Reinforcement learning

OpenAI Gym Environment

Open AI provides framework for creating environment and training on that environment. In this post I am pasting a simple notebook for a quick look up on how to use this environments and what all functions are available on environment object.

I have used environment available on github by Denny Britz and here are the references :

References :


[Example] Lagrange Multiplier With Equality Constraints


Stationary Point

Definition of stationary point from wikipedia :

In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).


Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.





Reinforcement learning

Thompson Sampling


Thompson sampling is one approach for Multi Armed Bandits problem and about the Exploration-Exploitation dilemma faced in reinforcement learning.

Challenge in solving such a problem is that we might end up fetching the same arm again and again. Bayesian approach helps us solving this dilemma by setting prior with somewhat high variance.

Here is the code for two armed bandit. One has success probability of 40% (bandit 0) and another has 25% (bandit 1).

We are using beta distribution for deciding which arm to pull. Beta distribution has two parameter alpha and beta. Higher values of alpha, pulls distribution towards 1. Beta distribution is always confined between 0 and 1.

How we train is that for each feedback we receive we increment alpha by 1 if it was success or beta by 1 in case of failure. For choosing the arm we draw random sample from the distribution of each arm and select the arm with highest value.


And here is simulation results. We see that initially both the the armed are pulled frequently but slowly arm 1 is pulled less and less, but it is never straight away zero.





Deep Learning

Deep learning taking off

I recently started Andrew Ng’s specialization on deep learning and found these two interesting points :

One is about how performance of algorithm changes with the amount of data. Traditional algorithms have limits but Deep neural network has more advantages.



Also for the small amount of data traditional algorithms may win over neural nets with good feature engineering.

Second reason is that deep learning requires data, computation and efficient algorithms. Recent years have seen significant advancement in algorithm to increase computation efficiency. For example sigmoid to ReLU was an algorithmic change which allowed gradient to converge faster.


Ref :


Lagrange Duality

There are three things :

  1. Original problem (Primal problem)
  2. Dual function (Function of lagrange multiplier)
  3. Dual problem


Suppose p is the solution of primal problem and d the dual problem.


If original problem is minimization, we are interested in lower bound (d) such that d<p.  We want to find maximum value of d and dual problem becomes maximization problem.


If original problem is maximization, we are interested in finding upper bound (d) such that p<d. We want to find minimum value of d and dual problem become minimization problem.


Dual problem is always convex irrespective of primal problem.


d<p is weak duality, while d=p is strong duality.

If primal problem is convex strong duality generally holds.

Slater’s condition is sufficient tests for convex optimization.



Further reading :

Kanpsack Problem here

Convex Optimization by Stephen Boyd



Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.


Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.


When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power


Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove




[1] :


[Theory] Lagrange Multiplier and Constrained Optimization

Lagrange multipliers helps us to solve constrained optimization problem.

An example would to maximize f(x, y) with the constraint of g(x, y) = 0.

Geometrical intuition is that points on g where f either maximizes or minimizes would be will have a parallel gradient of f and g

∇ f(x, y)  =  λ ∇  g(x, y)

We need three equations to solve for x, y and λ.

Solving above gradient with respect to x and y gives two equation and third is g(x, y) = 0

These will give us the point where f is either maximum or minimum and then we can calculate f manually to find out point of interest.

Lagrange is a function to wrap above in single equation.

L(x, y, λ) = f(x, y) – λ g(x, y)

And then we solve ∇ L(x, y, λ) = 0


Budget Intuition

Let f represent the revenue R and g represent cost C.

Also let g(x, y) = b where b is maximum cost that we can afford.

Let M* be the optimized value obtained by solving lagrangian and value of λ we got was λ*.


So λ is the rate with which maximum revenue will increase with unit change in b.


Inequality Constraint


Here is the problem of finding maximum of f(x) [2]



Above conditions are known as Karush-Kuhn-Tucker (KKT) conditions.

First – primal feasibility

Second – dual feasibility

Third  –  complementary slackness conditions

For the problem of finding minimum value of f(x) we minimize following with same KKT conditions:



This can be extended to arbitrary number of constraints.




  • This is relates to solving the dual of a problem for convex function as we had in Boyd’s book
  • For unconstrained optimization computing hessian of a matrix is one way of solving if it is continuous  [1]


formal summary




[2] Pattern Recognition and machine learning by Bishop


On multivariate Gaussian


Formula for multivariate gaussian distribution


Formula of univariate gaussian distribution



  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards


On Covariance Matrix

Definition of covariance between two vectors:


When we have more than two variable we present them in matrix form. So covariance matrix will look like


  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet



Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)


Linear Transformation Interpretation


This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix


Step-2 : Change of variables, which we apply to density function









Quick Note on Probability Rules

p(X, Y) is joint distribution

p(X/Y) is conditional distribution

p(X) is marginal distribution (Y is marginalized out).


You can not get conditional distribution from joint distribution with just by integration. There is no such relationship.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem.




We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]



Further reading : 

[0] :