This is an interesting thread on the topic : https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning
1) Method of cross validation changes based in i.i.d assumption.
Looking at the last three equation, p(D/h) = p((d1, d2, d3)/h) can be reduced to multiplication only when d1, d2, d3 are independent.
Newton’s method for finding roots:
In optimization we are essentially finding roots of derivative.
We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.
Andrew N.G Lecture 
| Gradient Descent
|| Slightly more complex (Requires computing and inverting hessian)
| Needs choice of learning rate alpha
|| No parameters (third point in image is optional )
| Needs more iteration
|| Needs fewer iteration
| Each iteration is cheaper O(n) where n is no of features
|| Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
| Use when no of features are less (n<1000)
|| Use when (n > 10,000)
Figure below shows how decision tree creates rectangle in predictor space:
As we have describe classification tree, the only difference here is how to come up with output instead of label and how to define a metric other than entropy/gini.
For output it uses the mean of leaves.
For metric CART methodology uses simple SSE.
- SSE = ∑ (y_i – y1)² + ∑(y_i – y2)²
- where y1 and y2 are means of two newly crated groups.
- diff = SSE_before – SSE_after
- We choose a predictor which minimizes SSE the most.
- diff is highest
Applied predictive modeling by Max Kuhn and Kjell Johnson
How classification Tree creates Rectangle in Predictor Space
- Criterion to choose which predictor to Split on
- Predictor can be continuous, binary, categorical
- Analysis in this post applies to first two, categorical predictor is handled in separate blog post
- In case of continuous predictor we sort it first and choose the midpoint to split on
- Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest child nodes
- Two Popular Criterion are Gini and Entropy
- Note this criterion are not directed used to select predictor to split on
- Rather we take a difference of them before and after splitting
- While computing gini/entropy after splitting we apply weight to both splits
- About :
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce
- Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
- After Splitting
- Group 1:
- n1 positive
- m1 negative
- g1 = (n1 + m1) total
- Group 2:
- n2 positive
- m2 negative
- g2 = (n2 + m2) total
- Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
- After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
- diff = H_before – H_after
- Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
- So we are selecting a variable which carries more information
- Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
- Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
- Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
- Gini is a measure of impurity which we want to reduce by splitting on predictor
- diff = G_before – G_after
- We will select a predictor for which diff is highest
- ROC curve can be constructed easily for classifier which outputs ranking.
- For example decision tree outputs probability where we can change the threshold.
- On the contrary decision tree outputs label
- However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
- Then we can change the threshold and plot ROC.
Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.
- limit max. depth of trees
- Cost complexity pruning
- ensembles / bag more than just 1 tree
- set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)
Applied predictive modeling by Max Kuhn and Kjell Johnson
Learning Via Value Function is not possible when we don’t know immediate rewards and next state. If they are known we can use dynamic programming methods described at Dynamic Programming for RL.
How it solve the problem of unknown reward and next state functions?
We calculate Q directly without using these values as described next in training rule for Q
Training Rule for Q
We initialize Q table with random values and keep updating it on each action based on reward get.
- System is deterministic MDP (Markov Decision Process)
- Immediate reward values are bounded
- Agent visit every state action pair infinitely often
Derivation for Convergence
- We also need to do exploration
- So we pick up action with a probability such that
- action having more Q value has higher probability
- One way to formulate this is as follows where in value of k control exploration-exploitation trade-off.
- Initially k is small (exploration) and then is slowly increased (exploitation)
How can we speed up Updating Sequence
- Strategy 1
- Consider the case for single absorbing goal state case. (What it means that if agent’s goal is to reach particular state) and initially all (q,a) pairs are initialized with 0.
- So only one (q,a) pair will be update when we reach to reward state,
- To speed up things, we can store intermediate state and action and then update backward till origin once we reach reward state.
- Of course this will increase memory requirement
- Strategy 2
- Store immediate reward of past along with state-action pair and retrain periodically
- If one Q(s, a) changes it will help change others as well
- Strategy 3
- If performing real environment action in environment is difficult we can perform simulation
Machine Learning by Tom Mitchell
Purpose is to know how it looks solving linear programs using libraries and solvers.
Reference : https://pythonhosted.org/PuLP/CaseStudies/a_blending_problem.html