Locality-sensitive hashing and its applications

(1)

Locality-sensitive hashing and its applications

Paolo Ferragina University of Pisa

ACM Kanellakis Award 2013

(2)

A frequent issue

Given U users, described with a set of d features, the goal is to find (the largest) group of similar users

Features = Personal data, preferences, purchases,

navigational behavior, search behavior, followers/ing,

…

 A feature is typically a numerical value: binary or real

Users could also be Web pages (dedup), products (recommendation),

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),

tweets/news/search results (visualization)

010001110

000110010

100110010

0.3 0.7

0.1

Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1]

(3)

Solution #1

Try all groups of users and, for each group, check the (average) similarity among all its users.

# Sim computations  2

^U

 U

²

In the case of Facebook this is > 2

^1billion

 (10

⁹

)

²

If we limit groups to have a size  L users

# Sim computations  U

^L

 L

²

(Even if 1ns/sim and L=10, it is > (10⁹)¹⁰/10⁹ secs >

10⁷⁰ years)

No faster CPU/GPU, multi-cores,… could help ! No faster CPU/GPU, multi-cores,… could help !

(4)

Solution #2: introduce approximation

Interpret every user as a point in a d-dim

space, and then apply a clustering algorithm

Pick K=2 centroids at random Compute clusters

Re-determine centroids

x

Re-compute clusters

x

x x

x Re-determine centroids

Re-compute clusters

Converged!

K-means

f1

Each iteration takes

 K  U

f2 computations of Sim

(5)

Solution #2: few considerations

 Cost per iteration = K  U, #iterations is typically small

 What about optimality ? It is locally optimal

[recently, some researchers showed how to introduce some guarantee]

 What about the Sim-cost ? Comparing users/points costs (d) in time and space [notice that d may be

bi/millions]

 What about K ? Iterate K=1, …, U it costs U³< U^L [years]

In T time, we can manage U = T

^1/3

users

Using s-faster CPU ≈ using sT time an old CPU

 we can manage (s*T)^1/3 = s^1/3 T^1/3users

(6)

Solution #3: introduce randomization

Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints.

ACM Kanellakis Award 2013

 It is randomized, correct with high probability

 It guarantees local access to data, which is good for speed in disk/distributed setting

(7)

A warm-up problem



Consider vectors p,q of d binary features



Hamming distance

D(p,q)= #bits where p and q differ



Define hash function h by choosing a set I of k random coordinates

h(p) = projection of vector p on I’s coordinates

Example: Pick I={1,4} (k=2), then h(p=01011) =01

(8)

A key property

Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d

We can vary the probability by changing k

1 2 …. d

k

d q p q D

h p

h ( , ) )

1 ( )]

( )

(

Pr[   

k=2 k=4

distanc e

Pr Pr

Larger k

Smaller False Positive

What about False Negatives?

# = D(p,q) p versus q

# = d - D(p,q)

= S^k

where s is the similarity

between p and q

(9)

Reiterate L times

1) Repeat L times the k-projections h_i(p) 2) We set g(p) = < h₁(p), h₂(p), …, h_L(p)>

3) Declare «p matches q» if at least one hi(p)=hi(q) Example:

We set k=2, L=3, let p = 01001 and q = 01101

•I1 = {3,4}, we have h₁(p) = 00 and h₁(q)=10

•I2 = {1,3}, we have h₂(p) = 00 and h₂(q)=01

•I3 = {1,5}, we have h₃(p) = 01 and h₃(q)=01

p and q declared to match !!

Larger L

Smaller False Negatives

Sketch(p)

(10)

Measuring the error probability

The g() consists of L independent hashes hi

Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]

^Pr  ^g ⁽ ^p ⁾ ^ ^g ⁽ ^q ⁾  ^ ¹ ^  ¹ ^ ^s

^k



^L

k k

i

s

d q p q D

h p

h    ( , ) )  1

( )]

( )

( Pr[

s Pr

(1/L)^(1/k)

(11)

The case: Groups of similar items

Buckets provide the candidate similar items

«Merge» similar sets over L rounds if they share items

h₁(p) h₂(p) h_L(p)

T_L T₂

T₁

p If p ≈ q, then they

fall in at least one same bucket

q

No Tables !

 SORT

p,q,…

q,^z…

(12)

The case of on-line query

Given a query w, find the similar indexed vectors:

check the vectors in the buckets

h

_j

(w)

^{for all}^{j=1,…, L}

h₁(w)

h₂(w)

h_L(w) T_L T₂

T₁

w

p,q

p,z,t

r,q

(13)

LSH ^versus K-means

 What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability]

 What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors]

 What about the cost per iteration? Typically K- means requires few iterations, each costs K  U  d

[LSH sorts U short items, few scans]

 What about K ? In principle have to iterate K=1,

…, U [LSH does not need to know the number of clusters]You could apply K-means over LSH-sketch vectors !!

(14)

Document duplication

(exact or approximate)

(15)

Duplicate documents



The web is full of duplicated content



Few exact duplicate detection



Many cases of near duplicates

 E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

(16)

 Obvious techniques

 Checksum – no worst-case collision probability guarantees

 MD5 – cryptographically-secure string hashes

 relatively slow

 Karp-Rabin’ s Scheme

 Rolling hash: split doc in many pieces

 Algebraic technique – arithmetic on primes

 Efficient and other nice properties…

Exact-Duplicate Detection

(17)

Karp-Rabin Fingerprints

 Consider – m-bit string A = 1 a1 a2 … am

 Basic values:

 Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 2⁶⁴)

 Fingerprints: f(A) = A mod p

 Nice property is that if B = a2 … am am+1

 f(B) = [2^m-1 (A – 2^m - a1 2^m-1) + 2^m+ am+1 ] mod p

 Prob[false hit btw A vs B] = Prob p divides (A-B) = #div(A-B)/ #prime(U) ≈ (log (A+B)) / #prime(U)

= m log U/U

(18)

Near-Duplicate Detection



Problem

 Given a large collection of documents

 Identify the near-duplicate documents



Web search engines

 Proliferation of near-duplicate documents

 Legitimate – mirrors, local copies, updates, …

 Malicious – spam, spider-traps, dynamic URLs, …

 Mistaken – spider errors

 30% of web-pages are near-duplicates [1997]

(19)

Shingling: from docs to sets of shingles

 dissect document into q-grams (shingles)

T = I leave and study in Pisa, ….

If we set q=3 the 3-grams are:

<I leave and><leave and study><and study in><study in Pisa>…

 represent documents by sets of hashes/shingles

The near-duplicate document detection

problem reduces to set intersection among int (shingles)

S

_B

S

_A ^Doc_B

Doc A

(20)

Desiderata

 Storage: only small sketches of each document.

 Computation: the fastest possible

 Stream Processing

:

 once sketch computed, source is unavailable

 Error Guarantees

 problem scale  small biases have large impact

 need formal guarantees – heuristics will not do

(21)

More applications

(22)

Sets & Jaccard similarity

Set similarity  Jaccard similarity

S

_B

S

_A

B A

B

A

S S

S ) S

S , sim(S



 

(23)

Compute Jaccard-sim(S

_A

, S

_B

)

Set A Set B

2⁶⁴ 2⁶⁴

2⁶⁴ 2⁶⁴ Are these equal?

Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance btw array of integers!

 

Sec. 19.6

Lemma: Prob[] is exactly the Jaccard-sim(S_A, S_B)

a x + b mod 2⁶⁴ permuted minimum

(24)

Cosine distance btw p and q

Construct a random hyperplane r of d-dim and unit norm

 Sketch of a vector p is h_r(p)=sign(p  r) = ±1

 Sketch of a vector q is h_r(q)=sign(q  r) = ±1



Lemma:



 



 h (q)] 1 (p)

P[h

_r _r

cos() = p  q / ||p|| * ||q||

Other distances

(25)

The main theorem

Set k = (log n) / (log 1/p2)

L=n^, with  = (ln p1 / ln p2 ) < 1

the LSH-construction described before guarantees



Extra space ≈ nL = n^1+ fingeprints, of size k



Query time ≈ L = n^buckets accessed



It is correct with probability ≈ 0.3

Repeating 1/times the LSH-construction described before the success probability becomes 1-.

Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then…

Do exist nowadays many variants and improvements!

(26)

Locality-sensitive hashing and its applications