Now let’s say we have 1000 words and 300 hidden units, we shall have 300,000 wights in both hidden and output unit, which are two many parameters.
Output label during training is one hot vector with 999 zeros and single 1. We randomly select 5 zeros and update weight for six words only. (5 zeros and single 1). The more frequent word is the highr the probability of it getting selected. Google paper has mentioned some emperical formula for this. This is know as negative sampling.
Above was for output layer. In hidden layer weights are updated only for input words. (Irrespective if it is negative sampling or not)
In mathematics, particularly in calculus, a stationary point or critical point of a differentiable function of one variable is a point on the graph of the function where the function’s derivative is zero. Informally, it is a point where the function “stops” increasing or decreasing (hence the name).
Lagrange multiplier helps us to find all the stationary points, It can be local minima, local maxima, global minima or global maxima. Once we evaluate objective function at each of these stationary point we can classify which one is local/global minima and maxima.