Packing to fewer dimensions

(1)

Packing to fewer dimensions

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Speeding up cosine computation



What if we could take our vectors and

“ pack” them into fewer dimensions

(say 50,000100)

while preserving distances?

 Now, O(nm) to compute cos(d,q) for all n docs

 Then, O(km+kn) where k << n,m



Two methods:

 “ Latent semantic indexing”

 Random projection

(3)

Briefly



LSI is data-dependent

 Create a k-dim subspace by eliminating redundant axes

 Pull together “ related” axes – hopefully

 car and automobile



Random projection is data-independent

 Choose a k-dim subspace that guarantees good stretching properties with high

probability between any pair of points.

What about polysemy ?

(4)

Latent Semantic Indexing

courtesy of Susan Dumais

Sec. 18.4

(5)

Notions from linear algebra



Matrix A, vector v



Matrix transpose (A

^t

)



Matrix product



Rank



Eigenvalues and eigenvector v: Av = v

Example

(6)

Overview of LSI



Pre-process docs using a technique from linear algebra called Singular Value

Decomposition



Create a new (smaller) vector space



Queries handled (faster) in this new space

(7)

Singular-Value Decomposition



Recall m  n matrix of terms  docs, A.

 A has rank r  m,n



Define term-term correlation matrix T=AA

^t

 T is a square, symmetric m  m matrix

 Let U be m  r matrix of r eigenvectors of T



Define doc-doc correlation matrix D=A

^t

A

 D is a square, symmetric n  n matrix

 Let V be n  r matrix of r eigenvectors of D

(8)

A ’ s decomposition



Given U

(for T, m  r)

and V

(for D, n  r)

formed by orthonormal columns

(unit dot-product)



It turns out that A = U S  V

^t

 Where S is a diagonal matrix with the

eigenvalues of T=AA^tin decreasing order.

=

A U S V

^t

mn mr rr rn

(9)

S

 Fix some k << r, zero out all but the k biggest eigenvalues in S [choice of k is crucial]

 Denote by Sk this new version of S, having rank k

 Typically k is about 100, while r (A ’ s rank) is > 10,000

=

U S

_k

V

^t

Dimensionality reduction

A

_k

document

useless due to 0-col/0-row of _k

m x r r x n

r k k

k

0 0

0

A

m x k k x n

(10)

A running example

(11)

A running example

(12)

A running example

(13)

Guarantee



A

_k

is a pretty good approximation to A:

 Relative distances are (approximately) preserved

 Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures:

 minB, rank(B)=k ||A-B||₂ = ||A-A_k||₂ = _k

 minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2



Frobenius norm ||A||

_F2

= 

_2



_2



_r2

(14)

Reduction

 Since we are interested in doc/doc correlation, we consider:

 D=A_tA =(U SV ^t)^t (U SV ^t) = (SV ^t)^t (SV ^t)

 Hence X = S V^t is a matrix r x n, may play the role of A

 To reduce its size we set X_k = S_k V^t is a matrix k x n and thus get At A  X_kt X_k (both are n x n matrices)

 We use Xk to define how to project A:

 Since Xk = S_kV_k ^t Xk  U_ktA (use def of SVD of A)

 Since X_k may play role of A, its cols are proj. docs

 Similarly Q can be interpreted as a new col of A and thus it is enough to multiply Ukt times Q to get the projected query, O(km) time

U,V are formed by

orthonormal eigenvectors of the matrices D,T

(15)

Which are the concepts ?

 c-th concept = c-th col of Uk (which is m x k)

 Uk[i][c] = strength of association between c-th concept and i-th term

 Vtk[c][j] = strength of association between c-th concept and j-th document

 Projected document: d’ j = Utk dj

 d ’ j [c] = strenght of concept c in dj

 Projected query: q’ = Utk q

 q ’ [c] = strenght of concept c in q

(16)

Random Projections

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Slides only !

(17)

An interesting math result

f() is called JL-embedding

Setting v=0 we also get a bound on f(u)’s stretching!!!

Lemma (Johnson-Linderstrauss, ‘82)

Let P be a set of n distinct points in m-dimensions.

Given  > 0, there exists a function f : P  IR^k such that for every pair of points u,v in P it holds:

(1 - ) ||u - v||²≤ ||f(u) – f(v)||²≤ (1 + ) ||u-v||² Where k = O(^-2 log n)

(18)

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above for ||u-v||²

(19)

How to compute a JL-embedding?

E[p_i,j] = 0 Var[p_i,j] = 1

If we set the projection matrix P = p_i,j as a random m x k matrix, where its components are independent

random variables with one of the following two distributions:

2

(20)

Finally...

 Random projections hide large constants

 k  (1/)²* log n, so k may be large…

 it is simple and fast to compute

 LSI is intuitive and may scale to any k

 optimal under various metrics

 but costly to compute, do exist good libraries