Newton’s method for finding roots:

In optimization we are essentially finding roots of derivative.

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

Andrew N.G Lecture [2]

Gradient Descent |
Newton |

Simpler | Slightly more complex (Requires computing and inverting hessian) |

Needs choice of learning rate alpha | No parameters (third point in image is optional ) |

Needs more iteration | Needs fewer iteration |

Each iteration is cheaper O(n) where n is no of features | Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3) |

Use when no of features are less (n<1000) | Use when (n > 10,000) |

References:

*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ