Deep Learning

Derivation Of Backpropagation – 2

 

References :

Advertisements
optimization

[Example] Lagrange Multiplier With Equality Constraints

 

Stationary Point

Definition of stationary point from wikipedia :

In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).

 

Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.

 

Example

 

 

optimization

Lagrange Duality

There are three things :

  1. Original problem (Primal problem)
  2. Dual function (Function of lagrange multiplier)
  3. Dual problem

 

Suppose p is the solution of primal problem and d the dual problem.

 

If original problem is minimization, we are interested in lower bound (d) such that d<p.  We want to find maximum value of d and dual problem becomes maximization problem.

 

If original problem is maximization, we are interested in finding upper bound (d) such that p<d. We want to find minimum value of d and dual problem become minimization problem.

 

Dual problem is always convex irrespective of primal problem.

 

d<p is weak duality, while d=p is strong duality.

If primal problem is convex strong duality generally holds.

Slater’s condition is sufficient tests for convex optimization.

 

 

Further reading :

Kanpsack Problem here 

http://ocw.nctu.edu.tw/upload/classbfs121109082344836.pdf

Convex Optimization by Stephen Boyd

 

optimization

[Theory] Lagrange Multiplier and Constrained Optimization

Lagrange multipliers helps us to solve constrained optimization problem.

An example would to maximize f(x, y) with the constraint of g(x, y) = 0.

Geometrical intuition is that points on g where f either maximizes or minimizes would be will have a parallel gradient of f and g

∇ f(x, y)  =  λ ∇  g(x, y)

We need three equations to solve for x, y and λ.

Solving above gradient with respect to x and y gives two equation and third is g(x, y) = 0

These will give us the point where f is either maximum or minimum and then we can calculate f manually to find out point of interest.

Lagrange is a function to wrap above in single equation.

L(x, y, λ) = f(x, y) – λ g(x, y)

And then we solve ∇ L(x, y, λ) = 0

 

Budget Intuition

Let f represent the revenue R and g represent cost C.

Also let g(x, y) = b where b is maximum cost that we can afford.

Let M* be the optimized value obtained by solving lagrangian and value of λ we got was λ*.

l1

So λ is the rate with which maximum revenue will increase with unit change in b.

 

Inequality Constraint

 

Here is the problem of finding maximum of f(x) [2]

l5

l4l7

Above conditions are known as Karush-Kuhn-Tucker (KKT) conditions.

First – primal feasibility

Second – dual feasibility

Third  –  complementary slackness conditions

For the problem of finding minimum value of f(x) we minimize following with same KKT conditions:

l9

 

This can be extended to arbitrary number of constraints.

l8

 

Notes

  • This is relates to solving the dual of a problem for convex function as we had in Boyd’s book
  • For unconstrained optimization computing hessian of a matrix is one way of solving if it is continuous  [1]

 

formal summary

References

[0] https://www.khanacademy.org/math/multivariable-calculus/applications-of-multivariable-derivatives/constrained-optimization/a/lagrange-multipliers-single-constraint

[1] https://www.youtube.com/watch?v=TyfZaZSF5nA

[2] Pattern Recognition and machine learning by Bishop

math

On multivariate Gaussian

Formulas

Formula for multivariate gaussian distribution

g1

Formula of univariate gaussian distribution

g2

Notes:

  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards

 

On Covariance Matrix

Definition of covariance between two vectors:

g3

When we have more than two variable we present them in matrix form. So covariance matrix will look like

g4

  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet

 

Derivations

Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)

g5

Linear Transformation Interpretation

g6

This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix

g7

Step-2 : Change of variables, which we apply to density function

g8

 

References

[0] http://cs229.stanford.edu/section/gaussians.pdf