## Gradient Descent vs Netwon’s Method

Newton’s method for finding roots:

In optimization we are essentially finding roots of derivative.

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

Andrew N.G Lecture [2]

 Gradient Descent Newton Simpler Slightly more complex (Requires computing and inverting hessian) Needs choice of learning rate alpha No parameters (third point in image is optional ) Needs more iteration Needs fewer iteration Each iteration is cheaper O(n) where n is no of features Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3) Use when no of features are less (n<1000) Use when (n > 10,000)

References:

## Class 12 Geometry Notes

Motivation behind these notes is that geometry helps in providing intuitive derivation to machine learning models and optimization scenarios !

Line in 2D resembles plane in 3D, not the line in 3D.

Concept of distance is essentially projection, It can be either sine (Cross product) or cosine (Dot product)

References :

http://www.ncert.nic.in/index.html

## On multivariate Gaussian

### Formulas

Formula for multivariate gaussian distribution

Formula of univariate gaussian distribution

Notes:

• There is normality constant in both equations
• Σ being a positive definite ensure quadratic bowl is downwards
• σ2 also being positive ensure that parabola is downwards

### On Covariance Matrix

Definition of covariance between two vectors:

When we have more than two variable we present them in matrix form. So covariance matrix will look like

• Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
• For some data above demands might not meet

### Derivations

Following derivations  are available at [0]:

• We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
• It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
• Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)

### Linear Transformation Interpretation

This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix

Step-2 : Change of variables, which we apply to density function

## Quick Note on Probability Rules

p(X, Y) is joint distribution

p(X/Y) is conditional distribution

p(X) is marginal distribution (Y is marginalized out).

You can not get conditional distribution from joint distribution with just by integration. There is no such relationship.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem.

We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]