- Pattern Recognition and Machine Learning by Bishop [Page no 244]
- Andrew NG’s course by deeplearning.ai
Definition of stationary point from wikipedia :
In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).
Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.
There are three things :
Suppose p is the solution of primal problem and d the dual problem.
If original problem is minimization, we are interested in lower bound (d) such that d<p. We want to find maximum value of d and dual problem becomes maximization problem.
If original problem is maximization, we are interested in finding upper bound (d) such that p<d. We want to find minimum value of d and dual problem become minimization problem.
Dual problem is always convex irrespective of primal problem.
d<p is weak duality, while d=p is strong duality.
If primal problem is convex strong duality generally holds.
Slater’s condition is sufficient tests for convex optimization.
Further reading :
Kanpsack Problem here
Convex Optimization by Stephen Boyd
Motivation behind these notes is that geometry helps in providing intuitive derivation to machine learning models and optimization scenarios !
Line in 2D resembles plane in 3D, not the line in 3D.
Concept of distance is essentially projection, It can be either sine (Cross product) or cosine (Dot product)
Lagrange multipliers helps us to solve constrained optimization problem.
An example would to maximize f(x, y) with the constraint of g(x, y) = 0.
Geometrical intuition is that points on g where f either maximizes or minimizes would be will have a parallel gradient of f and g
∇ f(x, y) = λ ∇ g(x, y)
We need three equations to solve for x, y and λ.
Solving above gradient with respect to x and y gives two equation and third is g(x, y) = 0
These will give us the point where f is either maximum or minimum and then we can calculate f manually to find out point of interest.
Lagrange is a function to wrap above in single equation.
L(x, y, λ) = f(x, y) – λ g(x, y)
And then we solve ∇ L(x, y, λ) = 0
Let f represent the revenue R and g represent cost C.
Also let g(x, y) = b where b is maximum cost that we can afford.
Let M* be the optimized value obtained by solving lagrangian and value of λ we got was λ*.
So λ is the rate with which maximum revenue will increase with unit change in b.
Here is the problem of finding maximum of f(x) 
Above conditions are known as Karush-Kuhn-Tucker (KKT) conditions.
First – primal feasibility
Second – dual feasibility
Third – complementary slackness conditions
For the problem of finding minimum value of f(x) we minimize following with same KKT conditions:
This can be extended to arbitrary number of constraints.
 Pattern Recognition and machine learning by Bishop
Formula for multivariate gaussian distribution
Formula of univariate gaussian distribution
Definition of covariance between two vectors:
When we have more than two variable we present them in matrix form. So covariance matrix will look like
Following derivations are available at :
This was proved in two steps :
Step-1 : Factorizing covariance matrix
Step-2 : Change of variables, which we apply to density function