Locality-sensitive hashing and its applications
Paolo Ferragina University of Pisa
ACM Kanellakis Award 2013
ACM Kanellakis Award 2013
A frequent issue
Given U users, described with a set of d features, the goal is to find (the largest) group of similar users
Features = Personal data, preferences, purchases,
navigational behavior, search behavior, followers/ing,
…
A feature is typically a numerical value: binary or real
Users could also be Web pages (dedup), products (recommendation),
tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),
tweets/news/search results (visualization)
010001110
000110010
100110010
0.3 0.7
0.1
Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1]
Solution #1
Try all groups of users and, for each group, check the (average) similarity among all its users.
# Sim computations 2
U U
2In the case of Facebook this is > 2
1billion (10
9)
2If we limit groups to have a size L users
# Sim computations U
L L
2(Even if 1ns/sim and L=10, it is > (109)10 /109 secs >
1070 years)
No faster CPU/GPU, multi-cores,… could help ! No faster CPU/GPU, multi-cores,… could help !
Solution #2: introduce approximation
Interpret every user as a point in a d-dim
space, and then apply a clustering algorithm
Pick K=2 centroids at random Compute clusters
Re-determine centroids
x
x
Re-compute clusters
x
x x
x Re-determine centroids
Re-compute clusters
Converged!
K-means
f1
Each iteration takes
K U
f2 computations of SimSolution #2: few considerations
Cost per iteration = K U, #iterations is typically small
What about optimality ? It is locally optimal
[recently, some researchers showed how to introduce some guarantee]
What about the Sim-cost ? Comparing users/points costs (d) in time and space [notice that d may be
bi/millions]
What about K ? Iterate K=1, …, U it costs U3 < UL [years]
In T time, we can manage U = T
1/3users
Using s-faster CPU ≈ using sT time an old CPU
we can manage (s*T)1/3 = s1/3 T1/3 users
Solution #3: introduce randomization
Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints.
ACM Kanellakis Award 2013
ACM Kanellakis Award 2013
It is randomized, correct with high probability
It guarantees local access to data, which is good for speed in disk/distributed setting
A warm-up problem
Consider vectors p,q of d binary features
Hamming distance
D(p,q)= #bits where p and q differ
Define hash function h by choosing a set I of k random coordinates
h(p) = projection of vector p on I’s coordinates
Example: Pick I={1,4} (k=2), then h(p=01011) =01
A key property
Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d
We can vary the probability by changing k
1 2 …. d
k
d q p q D
h p
h ( , ) )
1 ( )]
( )
(
Pr[
k=2 k=4
distanc e
distanc e
Pr Pr
Larger k
Smaller False Positive
What about False Negatives?
# = D(p,q) p versus q
# = d - D(p,q)
= Sk
where s is the similarity
between p and q
Reiterate L times
1) Repeat L times the k-projections hi(p) 2) We set g(p) = < h1(p), h2(p), …, hL(p)>
3) Declare «p matches q» if at least one hi(p)=hi(q) Example:
We set k=2, L=3, let p = 01001 and q = 01101
•I1 = {3,4}, we have h1(p) = 00 and h1(q)=10
•I2 = {1,3}, we have h2(p) = 00 and h2(q)=01
•I3 = {1,5}, we have h3(p) = 01 and h3(q)=01
p and q declared to match !!
Larger L
Smaller False Negatives
Sketch(p)
Measuring the error probability
The g() consists of L independent hashes hi
Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]
Pr g ( p ) g ( q ) 1 1 s
k
Lk k
i
i
s
d q p q D
h p
h ( , ) ) 1
( )]
( )
( Pr[
s Pr
(1/L)^(1/k)
The case: Groups of similar items
Buckets provide the candidate similar items
«Merge» similar sets over L rounds if they share items
h1(p) h2(p) hL(p)
TL T2
T1
p If p ≈ q, then they
fall in at least one same bucket
q
No Tables !
SORT
p,q,…
q,z…
The case of on-line query
Given a query w, find the similar indexed vectors:
check the vectors in the buckets
h
j(w)
for all j=1,…, Lh1(w)
h2(w)
hL(w) TL T2
T1
w
p,q
p,z,t
r,q
LSH versus K-means
What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability]
What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors]
What about the cost per iteration? Typically K- means requires few iterations, each costs K U d
[LSH sorts U short items, few scans]
What about K ? In principle have to iterate K=1,
…, U [LSH does not need to know the number of clusters]You could apply K-means over LSH-sketch vectors !!
Document duplication
(exact or approximate)
Duplicate documents
The web is full of duplicated content
Few exact duplicate detection
Many cases of near duplicates
E.g., Last modified date the only difference between two copies of a page
Sec. 19.6
Obvious techniques
Checksum – no worst-case collision probability guarantees
MD5 – cryptographically-secure string hashes
relatively slow
Karp-Rabin’ s Scheme
Rolling hash: split doc in many pieces
Algebraic technique – arithmetic on primes
Efficient and other nice properties…
Exact-Duplicate Detection
Karp-Rabin Fingerprints
Consider – m-bit string A = 1 a1 a2 … am
Basic values:
Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264)
Fingerprints: f(A) = A mod p
Nice property is that if B = a2 … am am+1
f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p
Prob[false hit btw A vs B] = Prob p divides (A-B) = #div(A-B)/ #prime(U) ≈ (log (A+B)) / #prime(U)
= m log U/U
Near-Duplicate Detection
Problem
Given a large collection of documents
Identify the near-duplicate documents
Web search engines
Proliferation of near-duplicate documents
Legitimate – mirrors, local copies, updates, …
Malicious – spam, spider-traps, dynamic URLs, …
Mistaken – spider errors
30% of web-pages are near-duplicates [1997]
Shingling: from docs to sets of shingles
dissect document into q-grams (shingles)
T = I leave and study in Pisa, ….
If we set q=3 the 3-grams are:
<I leave and><leave and study><and study in><study in Pisa>…
represent documents by sets of hashes/shingles
The near-duplicate document detection
problem reduces to set intersection among int (shingles)
S
BS
A DocBDoc A
Desiderata
Storage: only small sketches of each document.
Computation: the fastest possible
Stream Processing
:
once sketch computed, source is unavailable
Error Guarantees
problem scale small biases have large impact
need formal guarantees – heuristics will not do
More applications
Sets & Jaccard similarity
Set similarity Jaccard similarity
S
BS
AB A
B A
B
A
S S
S ) S
S , sim(S
Compute Jaccard-sim(S
A, S
B)
Set A Set B
264 264
264 264
264 264
264 264 Are these equal?
Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance btw array of integers!
Sec. 19.6
Lemma: Prob[] is exactly the Jaccard-sim(SA, SB)
a x + b mod 264 permuted minimum
Cosine distance btw p and q
Construct a random hyperplane r of d-dim and unit norm
Sketch of a vector p is hr(p)=sign(p r) = ±1
Sketch of a vector q is hr(q)=sign(q r) = ±1
Lemma:
h (q)] 1 (p)
P[h
r rcos() = p q / ||p|| * ||q||
Other distances
The main theorem
Set k = (log n) / (log 1/p2)
L=n, with = (ln p1 / ln p2 ) < 1
the LSH-construction described before guarantees
Extra space ≈ nL = n1+ fingeprints, of size k
Query time ≈ L = nbuckets accessed
It is correct with probability ≈ 0.3Repeating 1/times the LSH-construction described before the success probability becomes 1-.
Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then…
Do exist nowadays many variants and improvements!