This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.
Problem: Document Retrieval
Given a document, we want to search for similar documents in a corpus.
Representation: How to Represent Documents as Vectors?
- Bag of Words: Count the frequency of each word in the vocabulary.
- TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
- TF: Same as bag of words.
- IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).
Distance Metrics
There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:
Euclidean Distance:
- Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
- It is beneficial to normalize features because features can have different units and varying ranges.
- Normalization can be achieved by dividing either by the range or by the variance of the features.
Scaled Euclidean Distance:
- Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
- For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
- This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
- The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).
Cosine Similarity:
- Cosine similarity measures the cosine of the angle between two vectors.
- Similarity = x.dot(y).
- Distance = 1 – similarity.
- Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
- Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
- Feature vectors in NLP tend to be sparse.
- The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
- Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
- Normalization avoids doubling the length of documents by copying.
- One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
- To address this, an alternative approach is to cap the maximum word count that can appear in a vector.
Cosine vs Euclidean:
- The choice between cosine and Euclidean distance depends on the use case.
- Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
- Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
- For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.
Brute Force
- Calculate distance to each document in corpus
- Suppose there are N documents in corpus
- Complexity would be O(N) for 1-NN
- It will be O(N log k) for k-NN
- Using priority queue
- It is computationally inefficient when we N is large and we need to query often
KD-Trees
KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:
- At each node, we store the smallest bounding box that contains all the data points in that region.
- While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
- First, we find the leaf node where the query point belongs.
- Then, we find the nearest neighbor within that node.
- We keep moving up in the tree.
- If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
- If it is less, we traverse down the tree.
- To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]
Construction of KD-Trees:
Here are the steps involved in constructing KD-Trees:
- Choose a splitting strategy:
- Widest dimension: Split based on the dimension with the widest range.
- Alternating dimension: Split alternately between dimensions.
- Determine the value to split on (split value):
- Median: Choose the median value of the points along the splitting dimension.
- Center point: Compute the center point as the average of the left and right extremes.
- Stop conditions for splitting:
- If there are fewer points left than a specified threshold.
- If the minimum width of the region is achieved.
How to Prune More Aggressively:
To prune more aggressively in KD-Trees, consider the following approach:
- Suppose the nearest distance found is r.
- Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
- This allows for more aggressive pruning and can speed up the search process.
Limitations of KD-Trees:
KD-Trees face challenges in high-dimensional spaces. Here are some limitations:
- In high dimensions, the radius of the search space is generally large.
- For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
- One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.
LSH – Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:
- The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
- If necessary, nearby bins can also be explored to expand the search.
- To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
- Let’s consider a motivating example with points in a 2D plane:
- A single random line passing through the origin is chosen.
- Points that are closer to each other will mostly fall into the same bin.
- However, using just one line can lead to more data points in a single bin, increasing query time.
- The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
- For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
- If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
- To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
- To find nearby bins:
- Flip one bit to find bins with a single bit difference.
- For more exploration, flip two bits to find bins with two bits difference.
- It is possible for the nearest point to be in a different bin.
- The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
- In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
- LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
- LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
- For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.
One thought on “Nearest Neighbour Search”