IID Assumption

This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning


Some notes:


1) Method of cross validation changes based in i.i.d assumption.



Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.



Panel Data And Analysis

From Wikipedia

In statistics and econometrics, panel data or longitudinal data[1][2] are multi-dimensional data involving measurements over time. Panel data contain observations of multiple phenomena obtained over multiple time periods for the same firms or individuals

Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

A study that uses panel data is called a longitudinal study or panel study.


Cross sectional data:

Cross-sectional data, or a cross section of a study population, in statistics and econometrics is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.

Analysis of cross-sectional data usually consists of comparing the differences among the subjects.




Panel Data


Cross Sectional Data


Time Series


Gradient Descent vs Netwon’s Method

Newton’s method for finding roots:


In optimization we are essentially finding roots of derivative.




We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.



Andrew N.G Lecture [2]

 Gradient Descent  Newton
 Simpler  Slightly more complex (Requires computing and inverting hessian)
 Needs choice of learning rate alpha  No parameters (third point in image is optional )
 Needs more iteration  Needs fewer iteration
 Each iteration is cheaper O(n) where n is no of features  Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
 Use when no of features are less (n<1000)  Use when (n > 10,000)




*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ


Regression Trees

Figure below shows how decision tree creates rectangle in predictor space:

As we have describe classification tree, the only difference here is how to come up with output instead of label and how to define a metric other than entropy/gini.

For output it uses the mean of leaves.

For metric CART methodology uses simple SSE.

  • SSE = ∑ (y_i – y1)² + ∑(y_i – y2)²
    • where y1 and y2 are means of two newly crated groups.
  • diff = SSE_before – SSE_after
    • We choose a predictor which minimizes SSE the most.
    • diff is highest



Applied predictive modeling by Max Kuhn and Kjell Johnson




Classification Trees

How classification Tree creates Rectangle in Predictor Space

  • Criterion to choose which predictor to Split on
    • Predictor can be continuous, binary, categorical
    • Analysis in this post applies to first two, categorical predictor is handled in separate blog post
    • In case of continuous predictor we sort it first and choose the midpoint to split on
  • Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest child nodes
  • Two Popular Criterion are Gini and Entropy
    • Note this criterion are not directed used to select predictor to split on
    • Rather we take a difference of them before and after splitting
    • While computing gini/entropy after splitting we apply weight to both splits



  •  Formula:
    • H = ∑ -p * log(p)
  • About :
    • Randomness
    • Information Carries
    • Highest when p=0.5
    • Higher Entropy => Higher Randomness => Carries more information
    • After splitting we want entropy to reduce
  • Initial :
    • n0 positive and m0 negative sample
    • g0 = (n+m) total samples
  • After Splitting
    • Group 1:
      • n1 positive
      • m1 negative
      • g1 = (n1 + m1) total
    • Group 2:
      • n2 positive
      • m2 negative
      • g2 = (n2 + m2) total
  • Before Entropy :
    • H_before =  -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
  • After Entropy :
    • H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
    • H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
    • H_after = (g1/g0) * H1 + (g2/g0)*H2
  • diff = H_before – H_after
  • Select a predictor for which diff is highest and split on it
    • Which means we are selecting a predictor which reduces randomness more
    • Which also mean we are selecting a predictor which reduces information more
      • So we are selecting a variable which carries more information
        • More important feature



  • Gini for 2 class :
    • G = p1 * (1-p1) + p2 * (1 – p2)
    • Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
  • Gini in general:
    • G = ∑ p * (1 – p) = 1 – ∑ p²
  • Like entropy gini is maximum when p1 = p2 = 0.5
    • In this case nodes are least pure
  • Gini is a measure of impurity which we want to reduce by splitting on predictor
  • diff = G_before – G_after
    • We will select a predictor for which diff is highest


ROC Curve

  • ROC curve can be constructed easily for classifier which outputs ranking.
    • For example decision tree outputs probability where we can change the threshold.
  • On the contrary decision tree outputs label
  • However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
  • Then we can change the threshold and plot ROC.



Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

  1. limit max. depth of trees
  2. Cost complexity pruning
  3. ensembles / bag more than just 1 tree
  4. set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)



Applied predictive modeling by Max Kuhn and Kjell Johnson


Value Function


Optimal Policy


Learning Via Value Function is not possible when we don’t know immediate rewards and next state. If they are known we can use dynamic programming methods described at Dynamic Programming for RL.


Q function



How it solve the problem of unknown reward and next state functions?

We calculate Q directly without using these values as described next in training rule for Q

Training Rule for Q

We initialize Q table with random values and keep updating it on each action based on reward get.


Convergence Conditions

  • System is deterministic MDP (Markov Decision Process)
  • Immediate reward values are bounded
  • Agent visit every state action pair infinitely often


Derivation for Convergence


Experimental Strategies

  • We also need to do exploration
  • So we pick up action with a probability such that
    • action having more Q value has higher probability
  • One way to formulate this is as follows where in value of k control exploration-exploitation trade-off.
    • Initially k is small (exploration) and then is slowly increased (exploitation)



How can we speed up Updating Sequence

  • Strategy 1
    • Consider the case for single absorbing goal state case. (What it means that if agent’s goal is to reach particular state) and initially all (q,a) pairs are initialized with 0.
    • So only one (q,a) pair will be update when we reach to reward state,
    • To speed up things, we can store intermediate state and action and then update backward till origin once we reach reward state.
    • Of course this will increase memory requirement
  • Strategy 2
    • Store immediate reward of past along with state-action pair and retrain periodically
    • If one Q(s, a) changes it will help change others as well
  • Strategy 3
    • If performing real environment action in environment is difficult we can perform simulation



Reference :

Machine Learning by Tom Mitchell