Newton’s method for finding roots:
In optimization we are essentially finding roots of derivative.
We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.
Andrew N.G Lecture 
| Gradient Descent
|| Slightly more complex (Requires computing and inverting hessian)
| Needs choice of learning rate alpha
|| No parameters (third point in image is optional )
| Needs more iteration
|| Needs fewer iteration
| Each iteration is cheaper O(n) where n is no of features
|| Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
| Use when no of features are less (n<1000)
|| Use when (n > 10,000)
Motivation behind these notes is that geometry helps in providing intuitive derivation to machine learning models and optimization scenarios !
Line in 2D resembles plane in 3D, not the line in 3D.
Concept of distance is essentially projection, It can be either sine (Cross product) or cosine (Dot product)
Formula for multivariate gaussian distribution
Formula of univariate gaussian distribution
- There is normality constant in both equations
- Σ being a positive definite ensure quadratic bowl is downwards
- σ2 also being positive ensure that parabola is downwards
On Covariance Matrix
Definition of covariance between two vectors:
When we have more than two variable we present them in matrix form. So covariance matrix will look like
- Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
- For some data above demands might not meet
Following derivations are available at :
- We can prove that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
- It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
- Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)
Linear Transformation Interpretation
This was proved in two steps :
Step-1 : Factorizing covariance matrix
Step-2 : Change of variables, which we apply to density function
p(X, Y) is joint distribution
p(X/Y) is conditional distribution
p(X) is marginal distribution (Y is marginalized out).
You can not get conditional distribution from joint distribution with just by integration. There is no such relationship.
There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem.
We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. 
Further reading :