This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.
Problem: Document Retrieval
Given a document, we want to search for similar documents in a corpus.
Representation: How to Represent Documents as Vectors?
- Bag of Words: Count the frequency of each word in the vocabulary.
- TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
- TF: Same as bag of words.
- IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).
Distance Metrics
There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:
Euclidean Distance:
- Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
- It is beneficial to normalize features because features can have different units and varying ranges.
- Normalization can be achieved by dividing either by the range or by the variance of the features.
Scaled Euclidean Distance:
- Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
- For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
- This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
- The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).
Cosine Similarity:
- Cosine similarity measures the cosine of the angle between two vectors.
- Similarity = x.dot(y).
- Distance = 1 – similarity.
- Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
- Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
- Feature vectors in NLP tend to be sparse.
- The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
- Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
- Normalization avoids doubling the length of documents by copying.
- One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
- To address this, an alternative approach is to cap the maximum word count that can appear in a vector.
Cosine vs Euclidean:
- The choice between cosine and Euclidean distance depends on the use case.
- Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
- Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
- For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.
Brute Force
- Calculate distance to each document in corpus
- Suppose there are N documents in corpus
- Complexity would be O(N) for 1-NN
- It will be O(N log k) for k-NN
- Using priority queue
- It is computationally inefficient when we N is large and we need to query often
KD-Trees
KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:
- At each node, we store the smallest bounding box that contains all the data points in that region.
- While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
- First, we find the leaf node where the query point belongs.
- Then, we find the nearest neighbor within that node.
- We keep moving up in the tree.
- If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
- If it is less, we traverse down the tree.
- To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]
Construction of KD-Trees:
Here are the steps involved in constructing KD-Trees:
- Choose a splitting strategy:
- Widest dimension: Split based on the dimension with the widest range.
- Alternating dimension: Split alternately between dimensions.
- Determine the value to split on (split value):
- Median: Choose the median value of the points along the splitting dimension.
- Center point: Compute the center point as the average of the left and right extremes.
- Stop conditions for splitting:
- If there are fewer points left than a specified threshold.
- If the minimum width of the region is achieved.
How to Prune More Aggressively:
To prune more aggressively in KD-Trees, consider the following approach:
- Suppose the nearest distance found is r.
- Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
- This allows for more aggressive pruning and can speed up the search process.
Limitations of KD-Trees:
KD-Trees face challenges in high-dimensional spaces. Here are some limitations:
- In high dimensions, the radius of the search space is generally large.
- For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
- One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.
LSH – Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:
- The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
- If necessary, nearby bins can also be explored to expand the search.
- To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
- Let’s consider a motivating example with points in a 2D plane:
- A single random line passing through the origin is chosen.
- Points that are closer to each other will mostly fall into the same bin.
- However, using just one line can lead to more data points in a single bin, increasing query time.
- The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
- For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
- If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
- To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
- To find nearby bins:
- Flip one bit to find bins with a single bit difference.
- For more exploration, flip two bits to find bins with two bits difference.
- It is possible for the nearest point to be in a different bin.
- The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
- In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
- LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
- LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
- For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.