Gradient Descent vs Netwon’s Method

Newton’s method for finding roots:


In optimization we are essentially finding roots of derivative.




We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.



Andrew N.G Lecture [2]

 Gradient Descent  Newton
 Simpler  Slightly more complex (Requires computing and inverting hessian)
 Needs choice of learning rate alpha  No parameters (third point in image is optional )
 Needs more iteration  Needs fewer iteration
 Each iteration is cheaper O(n) where n is no of features  Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
 Use when no of features are less (n<1000)  Use when (n > 10,000)








Iterative Method for Unconstrained Optimization

Newton’s Method



  • Based on Taylor series expansion
  • Advantages
    • Convergence is rapid in general, quadratic near optimal point
    • Insensitive to no of variables
    • Performance does not depend on choice of parameter
      • Gradient method depends on learning rate
  • Disadvantages
    • Cost of computing and storing Hessian
    • Cost of computing single newton step
      • You need double derivative (Example in note is a simple root finding problem)


Gradient Descent

  • Very popular method and does not need any write up
  • Exhibits approximately linear convergence
  • Advantage
    • Very simple to implement
  • Disadvantage
    • Convergence rate depends on number of the Hessian
    • Very slow when for large no of variables (say 1000 or more)
    • Performance depends on choice of parameters like learning rate


Golden Section Search

  • Typically applicable for one dimension only
  • We used it to calculate mobile/tablet adjustments
    • As a good practice we had avoided recursion and took at max 20 iteration breaking loop with some criterion
  • Applicable for strictly unimodal function
  • Three points that maintain golden ratio (phi)
  • Bisection method is okay to find root, but for finding extreme golden section is preferred
  • Sample code :





Ref :








Negative sampling in word2vec

In the precious post we talked about skipgram model.


Now let’s say we have 1000 words and 300 hidden units, we shall have 300,000 wights in both hidden and output unit, which are two many parameters.

Output label during training is one hot vector with 999 zeros and single 1. We randomly select 5 zeros and update weight for six words only. (5 zeros and single 1). The more frequent word is the highr the probability of it getting selected. Google paper has mentioned some emperical formula for this. This is know as negative sampling.

Above was for output layer. In hidden layer weights are updated only for input words. (Irrespective if it is negative sampling or not)





[Example] Lagrange Multiplier With Equality Constraints


Stationary Point

Definition of stationary point from wikipedia :

In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).


Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.