Packing to fewer dimensions
(for space compression and query speedup)
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Speeding up cosine computation
What if we could take our vectors and
“ pack” them into fewer dimensions (say 50,000100) while preserving distances?
Now, O(nm) to compute cos(d,q) for all n docs
Then, O(km+kn) where k << n,m
Two methods:
Latent semantic indexing
Random projection
Briefly
LSI is data-dependent
Create a k-dim subspace by eliminating redundant axes
Pull together “ related” axes – hopefully
car and automobile
Random projection is data-independent
Choose a k-dim subspace that guarantees good stretching properties with high
probability between any pair of points.
What about polysemy ?
Latent Semantic Indexing
courtesy of Susan Dumais
Sec. 18.4
Notions from linear algebra
Matrix A, vector v
Matrix transpose (At)
Matrix product
Rank
Eigenvalues and eigenvector v: Av = v
Example
Overview of LSI
Pre-process docs using a technique from linear algebra called Singular Value
Decomposition
Create a new (smaller) vector space
Queries handled (faster) in this new space
Singular-Value Decomposition
Recall m n matrix of terms docs, A.
A has rank r m,n
Define term-term correlation matrix T=AAt
T is a square, symmetric m m matrix
Let U be m r matrix of r eigenvectors of T
Define doc-doc correlation matrix D=AtA
D is a square, symmetric n n matrix
Let V be n r matrix of r eigenvectors of D
A ’ s decomposition
Given U (for T, m r) and V (for D, n r) formed by orthonormal columns (unit dot-product)
It turns out that
A = U V
t Where is a diagonal matrix with the singular values (=square root of the
eigenvalues of T=AAt ) in decreasing order.
=
A U V
tmn mr rr rn
The case of r = 2
From Jure Leskovec’s slides (Stanford), this and next ones
=
=
The case of r=2
=
=
7 5
r
k = projected space size
k
r = 3
Fix some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]
Denote by k this new version of , having rank k
Typically k is about 100, while r (A ’ s rank) is > 10,000
=
U
kV
tDimensionality reduction
A
kdocument
useless due to 0-col/0-row of k
m x r r x n
r k k
k
0 0
0
A
m x k k x nGuarantee
A
kis a pretty good approximation to A:
Relative distances are (approximately) preserved
Of all m n matrices of rank k, Ak is the best approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2
Frobenius norm
||A||
F2=
2
2
r2Interpret as another row in the matrix
This product U * gives the projection of users to the 2 new dims, corresponding to the
vectors v1, v2 of V matrix below
d
This is a new user that expresses preferences on-the-fly at query time
Recap
c-th concept = c-th col of Uk (which is m x k)
Uk[i][c] = strength of association between c-th concept and i-th row (term/user/…)
Vtk[c][j] = strength of association between c-th concept and j-th column (document/movie/…)
Projected query: q’ = q Vk
q ’ [c] = strenght of concept c in q
Projected column (doc/movie): d’ j = dj Vk
d ’ j [c] = strenght of concept c in dj