Packing to fewer dimensions
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Speeding up cosine computation
What if we could take our vectors and
“ pack” them into fewer dimensions
(say 50,000100)while preserving distances?
Now, O(nm) to compute cos(d,q) for all n docs
Then, O(km+kn) where k << n,m
Two methods:
“ Latent semantic indexing”
Random projection
Briefly
LSI is data-dependent
Create a k-dim subspace by eliminating redundant axes
Pull together “ related” axes – hopefully
car and automobile
Random projection is data-independent
Choose a k-dim subspace that guarantees good stretching properties with high
probability between any pair of points.
What about polysemy ?
Latent Semantic Indexing
courtesy of Susan Dumais
Sec. 18.4
Notions from linear algebra
Matrix A, vector v
Matrix transpose (A
t)
Matrix product
Rank
Eigenvalues and eigenvector v: Av = v
Example
Overview of LSI
Pre-process docs using a technique from linear algebra called Singular Value
Decomposition
Create a new (smaller) vector space
Queries handled (faster) in this new space
Singular-Value Decomposition
Recall m n matrix of terms docs, A.
A has rank r m,n
Define term-term correlation matrix T=AA
t T is a square, symmetric m m matrix
Let U be m r matrix of r eigenvectors of T
Define doc-doc correlation matrix D=A
tA
D is a square, symmetric n n matrix
Let V be n r matrix of r eigenvectors of D
A ’ s decomposition
Given U
(for T, m r)and V
(for D, n r)formed by orthonormal columns
(unit dot-product)
It turns out that A = U S V
t Where S is a diagonal matrix with the
eigenvalues of T=AAt in decreasing order.
=
A U S V
tmn mr rr rn
S
Fix some k << r, zero out all but the k biggest eigenvalues in S [choice of k is crucial]
Denote by Sk this new version of S, having rank k
Typically k is about 100, while r (A ’ s rank) is > 10,000
=
U S
kV
tDimensionality reduction
A
kdocument
useless due to 0-col/0-row of k
m x r r x n
r k k
k
0 0
0
A
m x k k x nA running example
A running example
A running example
Guarantee
A
kis a pretty good approximation to A:
Relative distances are (approximately) preserved
Of all m n matrices of rank k, Ak is the best approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2
Frobenius norm ||A||
F2=
2
2
r2Reduction
Since we are interested in doc/doc correlation, we consider:
D=At A =(U SV t)t (U SV t) = (SV t)t (SV t)
Hence X = S Vt is a matrix r x n, may play the role of A
To reduce its size we set Xk = Sk Vt is a matrix k x n and thus get At A Xkt Xk (both are n x n matrices)
We use Xk to define how to project A:
Since Xk = Sk Vk t Xk Ukt A (use def of SVD of A)
Since Xk may play role of A, its cols are proj. docs
Similarly Q can be interpreted as a new col of A and thus it is enough to multiply Ukt times Q to get the projected query, O(km) time
U,V are formed by
orthonormal eigenvectors of the matrices D,T
Which are the concepts ?
c-th concept = c-th col of Uk (which is m x k)
Uk[i][c] = strength of association between c-th concept and i-th term
Vtk[c][j] = strength of association between c-th concept and j-th document
Projected document: d’ j = Utk dj
d ’ j [c] = strenght of concept c in dj
Projected query: q’ = Utk q
q ’ [c] = strenght of concept c in q
Random Projections
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Slides only !
An interesting math result
f() is called JL-embedding
Setting v=0 we also get a bound on f(u)’s stretching!!!
Lemma (Johnson-Linderstrauss, ‘82)
Let P be a set of n distinct points in m-dimensions.
Given > 0, there exists a function f : P IRk such that for every pair of points u,v in P it holds:
(1 - ) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤ (1 + ) ||u-v||2 Where k = O(-2 log n)
What about the cosine-distance ?
f(u)’s, f(v)’s stretching
substituting formula above for ||u-v||2
How to compute a JL-embedding?
E[pi,j] = 0 Var[pi,j] = 1
If we set the projection matrix P = pi,j as a random m x k matrix, where its components are independent
random variables with one of the following two distributions:
2
Finally...
Random projections hide large constants
k (1/)2 * log n, so k may be large…
it is simple and fast to compute
LSI is intuitive and may scale to any k
optimal under various metrics
but costly to compute, do exist good libraries