K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:
- You need to specify value of k
- Can cluster non-clustered data
- Sensitive to scale
- Even on perfect data sets, it can get stuck in a local minimum
- Means are continuous
- Hidden assumption: SSE is worth minimizing
- K-means serves more as quantification
In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.
Scikit also supports variety of clustering algorithms including DBSCAN and list which one suits when. http://scikit-learn.org/stable/modules/clustering.html
When averaged across all possible situations, every algorithm performs equally well 🙂
This is especially true in supervised learning; validation or cross-validation is commonly used to assess the predictive accuracy of multiple models of varying complexity to find the best model. A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.
In the precious post we talked about skipgram model.
Now let’s say we have 1000 words and 300 hidden units, we shall have 300,000 wights in both hidden and output unit, which are two many parameters.
Output label during training is one hot vector with 999 zeros and single 1. We randomly select 5 zeros and update weight for six words only. (5 zeros and single 1). The more frequent word is the highr the probability of it getting selected. Google paper has mentioned some emperical formula for this. This is know as negative sampling.
Above was for output layer. In hidden layer weights are updated only for input words. (Irrespective if it is negative sampling or not)
Skip gram model
- Weights of hidden layer serves as word vectors
- There is just one hidden layer and one output layer(softmax)
- Hidden layer does not have any activation
- As input is one-hot vector output of hidden layer would be correposding word vector
- Training pair would be nearby words in predefined video
- We can imagine how huge can that be
- It is pair of words both one-hot encoded
- Sure, we need to know previously size of our vocabulary (which will be dimension of one-hot vector)
- The paper google release was trained on google news data and used 300 dimension vector, which means 300 neuron in hidden unit. The paper lists this no and size of training words and efficiency.
- Not there is one more parameter called named window size which was set to 5.
- It means that 5 words before and after center words are considered as pair for training data.
- Earlier NLP methods used to rely on synonyms/hypernyms which is not totally contextual
- “proficient” is synonym of good only in some context
- New words are getting added everyday
- All words are one-hot encoded
- Somewhat similar word might be orthagonal
- Size of vector become too large
There are two more things:
- Continuous Bag Of Words
- Negative sampling
This is taken from https://courses.csail.mit.edu/6.867/wiki/images/a/a7/Qp-cvxopt.pdf
Converting to standard form
- Unlike Java python module can have multiple classes
- One module is one python file
- Generally one module is some complete functionality
- It contains multiple classes and function
- Package needs to be specified with __init__.py files
- One sample structure
Reference : https://docs.python.org/3/tutorial/modules.html#packages
Dynamic Programming is one of the method to solve reinforcement learning problem. It assumes that complete dynamics of MDP are known and we are interested in
- FInding value function for given policy (Prediction problem)
- Finding optimal policy for given MDP (Control problem)
There are three things :
- Policy Evaluation
- Takes Probability of actions into account
- Policy Iteration
- First evaluates policy
- Then generates new policy based on this evaluation
- Value Iteration
- Policy evaluation is time taking process
- Let’s go with just one iteration
- We don’t need to update policy in each iteration, just keep using new value function does the job
- So it is like keep updating value function (choosing action that maximizes value instead of taking probabilities) until it converges and then selecting a policy based on this optimal values function
Here are two slides from : http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/DP.pdf