Nearest Neighbour Search

 

This post is a lecture summary of a course by the University of Washington available on Coursera: Machine Learning: Clustering and Retrieval.

Problem: Document Retrieval

Given a document, we want to search for similar documents in a corpus.

Representation: How to Represent Documents as Vectors?

  • Bag of Words: Count the frequency of each word in the vocabulary.
  • TF-IDF: Calculate the IDF (inverse document frequency) of each word in the vocabulary. Weight word frequency (TF) by its IDF in a given document.
    • TF: Same as bag of words.
    • IDF: Higher for rare words, calculated as log(D_total / (1 + D_word)).

Distance Metrics

There are different distance metrics used to measure similarity or dissimilarity between vectors. Let’s explore some commonly used distance metrics:

Euclidean Distance:

  • Euclidean distance between two vectors x and y can be calculated using the vectorized form: sqrt((x-y).T.(x-y)).
  • It is beneficial to normalize features because features can have different units and varying ranges.
  • Normalization can be achieved by dividing either by the range or by the variance of the features.

Scaled Euclidean Distance:

  • Scaled Euclidean distance allows for weighting different dimensions differently based on their importance.
  • For example, in text analysis, we may want to give more weight to the title/abstract than the body of an article.
  • This can be achieved by summing up the word vectors of the article and title and using a diagonal matrix A that contains weights for each feature.
  • The vectorized form for scaled Euclidean distance is: sqrt((x-y).T.A.(x-y)).

Cosine Similarity:

  • Cosine similarity measures the cosine of the angle between two vectors.
  • Similarity = x.dot(y).
  • Distance = 1 – similarity.
  • Cosine similarity is not a proper “distance” metric because it does not satisfy the triangular property (|ab| + |bc| > |ac|).
  • Cosine similarity is commonly used in Natural Language Processing (NLP) tasks due to its computational efficiency for sparse features.
  • Feature vectors in NLP tend to be sparse.
  • The range of similarity values is generally between -1 and 1. However, when features are non-negative, the range becomes 0 to 1, which is the case with TF-IDF.
  • Normalization can be performed instead of using the inner product: Similarity = x.dot(y) / (|x||y|).
  • Normalization avoids doubling the length of documents by copying.
  • One side effect of normalization is that long documents and tweets may appear similar, which is undesired when suggesting tweets to someone reading a document.
  • To address this, an alternative approach is to cap the maximum word count that can appear in a vector.

Cosine vs Euclidean:

  • The choice between cosine and Euclidean distance depends on the use case.
  • Cosine distance is more suitable for text features where the magnitude of the vector is not as important as the orientation.
  • Euclidean distance is a better choice for count features, such as measuring how many times a document was read.
  • For mixed feature types, both cosine and Euclidean distances can be computed (possibly with a subset of relevant features) and combined using appropriate weights, considering the trade-offs and requirements of the specific problem.

 

Brute Force

  • Calculate distance to each document in corpus
  • Suppose there are N documents in corpus
  • Complexity  would be O(N) for 1-NN
  • It will be O(N log k) for k-NN
    • Using priority queue
  • It is computationally inefficient when we N is large and we need to query often

KD-Trees

KD-Trees are used to divide the space into hierarchical regions via a tree structure. Here’s how it works:

  • At each node, we store the smallest bounding box that contains all the data points in that region.
  • While querying (for 1 nearest neighbor, which can be extended for k nearest neighbors):
    • First, we find the leaf node where the query point belongs.
    • Then, we find the nearest neighbor within that node.
    • We keep moving up in the tree.
    • If the distance between the query point and the bounding box is greater than the distance found so far, we skip that region and move up the tree.
    • If it is less, we traverse down the tree.
  • To calculate the distance between a point and a rectangle, you can use appropriate distance metrics like Euclidean distance or other suitable measures. It is computationally easy rectangles are axis aligned[1]

Construction of KD-Trees:

Here are the steps involved in constructing KD-Trees:

  • Choose a splitting strategy:
    • Widest dimension: Split based on the dimension with the widest range.
    • Alternating dimension: Split alternately between dimensions.
  • Determine the value to split on (split value):
    • Median: Choose the median value of the points along the splitting dimension.
    • Center point: Compute the center point as the average of the left and right extremes.
  • Stop conditions for splitting:
    • If there are fewer points left than a specified threshold.
    • If the minimum width of the region is achieved.

How to Prune More Aggressively:

To prune more aggressively in KD-Trees, consider the following approach:

  • Suppose the nearest distance found is r.
  • Instead of pruning points that are farther than distance r, prune all points that are farther than (r/a), where a > 1.
  • This allows for more aggressive pruning and can speed up the search process.

Limitations of KD-Trees:

KD-Trees face challenges in high-dimensional spaces. Here are some limitations:

  • In high dimensions, the radius of the search space is generally large.
  • For d dimensions, the radius intersects 2^d hypercubes, resulting in an extensive search space.
  • One possible solution is to prune all points that are farther away than (r/a), where a > 1, to mitigate these issues.

 

LSH – Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) is a technique used to classify data points into various bins and search within a given bin. Here’s how it works:

  • The idea is to group data points into bins based on their proximity and search within the relevant bin(s).
  • If necessary, nearby bins can also be explored to expand the search.
  • To determine nearby bins, multiple random lines (or planes in higher dimensions) are used.
  • Let’s consider a motivating example with points in a 2D plane:
    • A single random line passing through the origin is chosen.
    • Points that are closer to each other will mostly fall into the same bin.
  • However, using just one line can lead to more data points in a single bin, increasing query time.
  • The solution is to use multiple random lines (or planes) to achieve more efficient classification of neighboring bins.
  • For example, let’s consider three lines, f1, f2, and f3, defined as f(x,y) = 3x + 4y = 0:
    • If f1(p,q) < 0, f2(p,q) > 0, and f3(p,q) > 0 for a given point (p,q), we assign it to the bucket [0,1,1].
  • To facilitate searching within bins, a dictionary can be used, where the key represents the bin (e.g., [0,1,1], [0,1,0]), and the value is a list of all points in that bin.
  • To find nearby bins:
    • Flip one bit to find bins with a single bit difference.
    • For more exploration, flip two bits to find bins with two bits difference.
  • It is possible for the nearest point to be in a different bin.
  • The number of neighboring bins to explore depends on factors such as computational budget or a predefined quality threshold.
  • In some cases, an exact nearest neighbor may not be necessary, and a point with a predefined quality (epsilon) is sufficient.
  • LSH is useful in scenarios like document suggestion, where similar documents need to be retrieved efficiently.
  • LSH can also be extended to high-dimensional spaces by using planes or hyperplanes instead of lines.
  • For assigning bins to data points, dot product calculations are performed, which work well even in large dimensions, especially when the feature vectors are sparse.

References

[1] https://stackoverflow.com/questions/5254838/calculating-distance-between-a-point-and-a-rectangular-box-nearest-point

[2] https://gamedev.stackexchange.com/questions/44483/how-do-i-calculate-distance-between-a-point-and-an-axis-aligned-rectangle

One thought on “Nearest Neighbour Search

Leave a comment