• Non ci sono risultati.

Locality-sensitive hashing and its applications

N/A
N/A
Protected

Academic year: 2021

Condividi "Locality-sensitive hashing and its applications"

Copied!
26
0
0

Testo completo

(1)

Locality-sensitive hashing and its applications

Paolo Ferragina University of Pisa

ACM Kanellakis Award 2013

ACM Kanellakis Award 2013

(2)

A frequent issue

Given U users, described with a set of d features, the goal is to find (the largest) group of similar users

Features = Personal data, preferences, purchases,

navigational behavior, search behavior, followers/ing,

A feature is typically a numerical value: binary or real

Users could also be Web pages (dedup), products (recommendation),

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),

tweets/news/search results (visualization)

010001110

000110010

100110010

0.3 0.7

0.1

Similarity(u1,u2) is a function that, taken the set of features of users u1 and u2, returns a value in [0,1]

(3)

Solution #1

Try all groups of users and, for each group, check the (average) similarity among all its users.

# Sim computations  2

U

 U

2

In the case of Facebook this is > 2

1billion

 (10

9

)

2

If we limit groups to have a size  L users

# Sim computations  U

L

 L

2

(Even if 1ns/sim and L=10, it is > (109)10 /109 secs >

1070 years)

No faster CPU/GPU, multi-cores,… could help ! No faster CPU/GPU, multi-cores,… could help !

(4)

Solution #2: introduce approximation

Interpret every user as a point in a d-dim

space, and then apply a clustering algorithm

Pick K=2 centroids at random Compute clusters

Re-determine centroids

x

x

Re-compute clusters

x

x x

x Re-determine centroids

Re-compute clusters

Converged!

K-means

f1

Each iteration takes

 K  U

f2 computations of Sim

(5)

Solution #2: few considerations

Cost per iteration = K  U, #iterations is typically small

What about optimality ? It is locally optimal

[recently, some researchers showed how to introduce some guarantee]

What about the Sim-cost ? Comparing users/points costs (d) in time and space [notice that d may be

bi/millions]

What about K ? Iterate K=1, …, U it costs U3 < UL [years]

In T time, we can manage U = T

1/3

users

Using s-faster CPU ≈ using sT time an old CPU

 we can manage (s*T)1/3 = s1/3 T1/3 users

(6)

Solution #3: introduce randomization

Generate a fingerprint for every user that is much shorter than d and allows to transform similarity into equality of fingerprints.

ACM Kanellakis Award 2013

ACM Kanellakis Award 2013

It is randomized, correct with high probability

It guarantees local access to data, which is good for speed in disk/distributed setting

(7)

A warm-up problem

Consider vectors p,q of d binary features

Hamming distance

D(p,q)= #bits where p and q differ

Define hash function h by choosing a set I of k random coordinates

h(p) = projection of vector p on I’s coordinates

Example: Pick I={1,4} (k=2), then h(p=01011) =01

(8)

A key property

Pr[picking x s.t. p[x]=q[x]]= (d - D(p,q))/d

We can vary the probability by changing k

1 2 …. d

k

d q p q D

h p

h ( , ) )

1 ( )]

( )

(

Pr[   

k=2 k=4

distanc e

distanc e

Pr Pr

Larger k

Smaller False Positive

What about False Negatives?

# = D(p,q) p versus q

# = d - D(p,q)

= Sk

where s is the similarity

between p and q

(9)

Reiterate L times

1) Repeat L times the k-projections hi(p) 2) We set g(p) = < h1(p), h2(p), …, hL(p)>

3) Declare «p matches q» if at least one hi(p)=hi(q) Example:

We set k=2, L=3, let p = 01001 and q = 01101

•I1 = {3,4}, we have h1(p) = 00 and h1(q)=10

•I2 = {1,3}, we have h2(p) = 00 and h2(q)=01

•I3 = {1,5}, we have h3(p) = 01 and h3(q)=01

p and q declared to match !!

Larger L

Smaller False Negatives

Sketch(p)

(10)

Measuring the error probability

The g() consists of L independent hashes hi

Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]

Prg ( p ) g ( q ) 1 1 s

k

L

k k

i

i

s

d q p q D

h p

h    ( , ) )  1

( )]

( )

( Pr[

s Pr

(1/L)^(1/k)

(11)

The case: Groups of similar items

Buckets provide the candidate similar items

«Merge» similar sets over L rounds if they share items

h1(p) h2(p) hL(p)

TL T2

T1

p If p ≈ q, then they

fall in at least one same bucket

q

No Tables !

 SORT

p,q,…

q,z

(12)

The case of on-line query

Given a query w, find the similar indexed vectors:

check the vectors in the buckets

h

j

(w)

for all j=1,…, L

h1(w)

h2(w)

hL(w) TL T2

T1

w

p,q

p,z,t

r,q

(13)

LSH versus K-means

What about optimality ? K-means is locally optimal [LSH finds correct clusters with high probability]

What about the Sim-cost ? K-means compares vectors of d components [LSH compares very short (sketch) vectors]

What about the cost per iteration? Typically K- means requires few iterations, each costs K  U  d

[LSH sorts U short items, few scans]

What about K ? In principle have to iterate K=1,

…, U [LSH does not need to know the number of clusters]You could apply K-means over LSH-sketch vectors !!

(14)

Document duplication

(exact or approximate)

(15)

Duplicate documents

The web is full of duplicated content

Few exact duplicate detection

Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

(16)

Obvious techniques

Checksum – no worst-case collision probability guarantees

MD5 – cryptographically-secure string hashes

relatively slow

Karp-Rabin’ s Scheme

Rolling hash: split doc in many pieces

Algebraic technique – arithmetic on primes

Efficient and other nice properties…

Exact-Duplicate Detection

(17)

Karp-Rabin Fingerprints

Consider – m-bit string A = 1 a1 a2 … am

Basic values:

Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264)

Fingerprints: f(A) = A mod p

Nice property is that if B = a2 … am am+1

f(B) = [2m-1 (A – 2m - a1 2m-1) + 2m + am+1 ] mod p

Prob[false hit btw A vs B] = Prob p divides (A-B) = #div(A-B)/ #prime(U) ≈ (log (A+B)) / #prime(U)

= m log U/U

(18)

Near-Duplicate Detection

Problem

Given a large collection of documents

Identify the near-duplicate documents

Web search engines

Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, …

Malicious – spam, spider-traps, dynamic URLs, …

Mistaken – spider errors

30% of web-pages are near-duplicates [1997]

(19)

Shingling: from docs to sets of shingles

dissect document into q-grams (shingles)

T = I leave and study in Pisa, ….

If we set q=3 the 3-grams are:

<I leave and><leave and study><and study in><study in Pisa>…

represent documents by sets of hashes/shingles

The near-duplicate document detection

problem reduces to set intersection among int (shingles)

S

B

S

A DocB

Doc A

(20)

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing

:

once sketch computed, source is unavailable

Error Guarantees

problem scale  small biases have large impact

need formal guarantees – heuristics will not do

(21)

More applications

(22)

Sets & Jaccard similarity

Set similarity  Jaccard similarity

S

B

S

A

B A

B A

B

A

S S

S ) S

S , sim(S

 

(23)

Compute Jaccard-sim(S

A

, S

B

)

Set A Set B

264 264

264 264

264 264

264 264 Are these equal?

Use 200 random permutations (minimum), or pick the 200 smallest items from one random permutation, thus create one 200-dim vector per set and evaluate Hamming distance btw array of integers!

Sec. 19.6

Lemma: Prob[] is exactly the Jaccard-sim(SA, SB)

a x + b mod 264 permuted minimum

(24)

Cosine distance btw p and q

Construct a random hyperplane r of d-dim and unit norm

Sketch of a vector p is hr(p)=sign(p  r) = ±1

Sketch of a vector q is hr(q)=sign(q  r) = ±1

Lemma:

 

 h (q)] 1 (p)

P[h

r r

cos() = p  q / ||p|| * ||q||

Other distances

(25)

The main theorem

Set k = (log n) / (log 1/p2)

L=n, with  = (ln p1 / ln p2 ) < 1

the LSH-construction described before guarantees

Extra space ≈ nL = n1+ fingeprints, of size k

Query time ≈ L = nbuckets accessed

It is correct with probability ≈ 0.3

Repeating 1/times the LSH-construction described before the success probability becomes 1-.

Whenever you have a LSH-function which maps close items to an equal value and far items to different values, then…

Do exist nowadays many variants and improvements!

(26)

Riferimenti

Documenti correlati

Sul primo aspetto, sembra che il processo di autonomia dallo Stato, per quanto estremamente lento rispetto ad altri contesti europei, avvenga più in alcune regioni che

tweets/news/search results (visualization) Users could also be Web pages (dedup), products (recommendation),. tweets/news/search

forse iniziative come quella della Nestlè che annuncia un programma x migliaia di posti di lavoro andrebbero prese più sul serio Altan, sempre sul pezzo (da

In più il perno della protesta di ieri erano le otto ore di stop delle fabbriche del Nord indette dalla Fiom, un sindacato fortemente strutturato e dotato di un leader che alle tv e

http://t.co/amM4s9ZJK3 Pagliuca - Parrucchieri: ecco come sfidiamo i low cost cinesi http://t.co/zWkhz3GzHm Intesa Sanpaolo: secondo il Ft sta valutando di lanciare un'offerta per

intervista a Silvia Avallone sui giovani e il lavoro: «Rimanere in Italia, una scelta coraggiosa» http://t.co/0FDry7Ce2k lessico/ l'unità di crisi x affrontare gli esuberi di

I joint models studiati e applicati in questa tesi possono essere fatti rientrare in una classe ampia di modelli detti shared parameter models, in cui un modello per misure di

The Volume test, highlighted in orange, aims to compare the categorization results with sam- ples of news from the same collection but with differ- ent sizes; the Variety test,