Locality-sensitive hashing and its applications
Paolo Ferragina University of Pisa
ACM Kanellakis Award 2013
ACM Kanellakis Award 2013
A frequent issue
Given U users, described with a set of d features, the goal is to find (the largest) group of similar users
Features = Personal data, preferences, purchases,
navigational behavior, search behavior, followers/ing,
…
A feature is typically a numerical value: binary or real
Users could also be Web pages (dedup), products (recommendation),
tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),
tweets/news/search results (visualization)
010001110
000110010
100110010
0.3 0.7
0.1
Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1]
Solution #1
Try all groups of users and, for each group, check the (average) similarity among all its users.
# Sim computations 2
U U
2In the case of Facebook this is > 2
1billion (10
9)
2If we limit groups to have a size L users
# Sim computations U
L L
2(Even if 1ns/sim and L=10, it is > (109)10 /109 secs >
1070 years)
No faster CPU/GPU, multi-cores,… could help ! No faster CPU/GPU, multi-cores,… could help !
Solution #2: introduce approximation
Interpret every user as a point in a d-dim
space, and then apply a clustering algorithm
Pick K=2 centroids at random Compute clusters
Re-determine centroids
x
x
Re-compute clusters
x
x x
x Re-determine centroids
Re-compute clusters
Converged!
K-means
f1
Each iteration takes
K U
f2 computations of SimSolution #2: few considerations
Cost per iteration = K U, #iterations is typically small
What about optimality ? It is locally optimal
[recently, some researchers showed how to introduce some guarantee]
What about the Sim-cost ? Comparing users/points costs (d) in time and space [notice that d may be
bi/millions]
What about K ? Iterate K=1, …, U it costs U3 < UL [years]
In T time, we can manage U = T
1/3users
Using s-faster CPU ≈ using sT time an old CPU
we can manage (s*T)1/3 = s1/3 T1/3 users
Solution #3: introduce randomization
Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints.
ACM Kanellakis Award 2013
ACM Kanellakis Award 2013
It is randomized, correct with high probability
It guarantees local access to data, which is good for speed in disk/distributed setting
A warm-up problem
Consider vectors p,q of d binary features
Hamming distance
D(p,q)= #bits where p and q differ
Define hash function h by choosing a set I of k random coordinates
h(p) = projection of vector p on I’s coordinates
Example: Pick I={1,4} (k=2), then h(p=01011) =01
A key property
Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d
We can vary the probability by changing k
1 2 …. d
k
d q p q D
h p
h ( , ) )
1 ( )]
( )
(
Pr[
k=2 k=4
distanc e
distanc e
Pr Pr
Larger k
Smaller False Positive
What about False Negatives?
# = D(p,q) p versus q
# = d - D(p,q)
= Sk
where s is the similarity
between p and q
Reiterate L times
1) Repeat L times the k-projections hi(p) 2) We set g(p) = < h1(p), h2(p), …, hL(p)>
3) Declare «p matches q» if at least one hi(p)=hi(q) Example:
We set k=2, L=3, let p = 01001 and q = 01101
•I1 = {3,4}, we have h1(p) = 00 and h1(q)=10
•I2 = {1,3}, we have h2(p) = 00 and h2(q)=01
•I3 = {1,5}, we have h3(p) = 01 and h3(q)=01
p and q declared to match !!
Larger L
Smaller False Negatives
Sketch(p)
Measuring the error probability
The g() consists of L independent hashes hi
Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]
Pr g ( p ) g ( q ) 1 1 s
k
Lk k
i
i
s
d q p q D
h p
h ( , ) ) 1
( )]
( )
( Pr[
s Pr
(1/L)^(1/k)
The case: Groups of similar items
Buckets provide the candidate similar items
«Merge» similar sets over L rounds if they share items
h1(p) h2(p) hL(p)
TL T2
T1
p If p ≈ q, then they
fall in at least one same bucket
q
No Tables !
SORT
p,q,…
q,z…
The case of on-line query
Given a query w, find the similar indexed vectors:
check the vectors in the buckets
h
j(w)
for all j=1,…, Lh1(w)
h2(w)
hL(w) TL T2
T1
w
p,q
p,z,t
r,q
LSH versus K-means
What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability]
What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors]
What about the cost per iteration? Typically K- means requires few iterations, each costs K U d
[LSH sorts U short items, few scans]
What about K ? In principle have to iterate K=1,
…, U [LSH does not need to know the number of clusters]You could apply K-means over LSH-sketch vectors !!
More applications
Sets & Jaccard similarity
Set similarity Jaccard similarity
S
BS
AB A
B A
B
A
S S
S ) S
S , sim(S
Compute Jaccard-sim(S
A, S
B)
Set A Set B
264 264
264 264
264 264
264 264 Are these equal?
Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance !
Sec. 19.6
Lemma: Prob[] is exactly the Jaccard-sim(SA, SB)
a x + b mod 264 permuted minimum
Cosine distance btw p and q
Construct a random hyperplane r of d-dim and unit norm
Sketch of a vector p is hr(p)=sign(p r) = ±1
Sketch of a vector q is hr(q)=sign(q r) = ±1
Lemma:
h (q)] 1 (p)
P[h
r rcos() = p q / ||p|| * ||q||
Other distances
The main theorem
Set k = (log n) / (log 1/p2)
L=n, with = (ln p1 / ln p2 ) < 1
the LSH-construction described before guarantees
Extra space ≈ nL = n1+ fingeprints, of size k
Query time ≈ L = nbuckets accessed
It is correct with probability ≈ 0.3Repeating 1/times the LSH-construction described before the success probability becomes 1-.
Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then…
Do exist nowadays many variants and improvements!