• Non ci sono risultati.

Packing to fewer dimensions

N/A
N/A
Protected

Academic year: 2021

Condividi "Packing to fewer dimensions"

Copied!
37
0
0

Testo completo

(1)

Packing to fewer dimensions

(for space compression and query speedup)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Speeding up cosine computation

What if we could take our vectors and

“ pack” them into fewer dimensions (say 50,000100) while preserving distances?

Now, O(nm) to compute cos(d,q) for all n docs

Then, O(km+kn) where k << n,m

Two methods:

Latent semantic indexing

Random projection

(3)

Briefly

LSI is data-dependent

Create a k-dim subspace by eliminating redundant axes

Pull together “ related” axes – hopefully

car and automobile

Random projection is data-independent

Choose a k-dim subspace that guarantees good stretching properties with high

probability between any pair of points.

What about polysemy ?

(4)

Latent Semantic Indexing

courtesy of Susan Dumais

Sec. 18.4

(5)

Notions from linear algebra

Matrix A, vector v

Matrix transpose (At)

Matrix product

Rank

Eigenvalues and eigenvector v: Av = v

Example

(6)

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value

Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space

(7)

Singular-Value Decomposition

Recall m  n matrix of terms  docs, A.

A has rank r  m,n

Define term-term correlation matrix T=AAt

T is a square, symmetric m  m matrix

Let U be m  r matrix of r eigenvectors of T

Define doc-doc correlation matrix D=AtA

D is a square, symmetric n  n matrix

Let V be n  r matrix of r eigenvectors of D

(8)

A ’ s decomposition

Given U (for T, m  r) and V (for D, n  r) formed by orthonormal columns (unit dot-product)

It turns out that

A = U  V

t

Where is a diagonal matrix with the singular values (=square root of the

eigenvalues of T=AAt ) in decreasing order.

=

A UV

t

mn mr rr rn

(9)

The case of r = 2

From Jure Leskovec’s slides (Stanford), this and next ones

=

=

(10)

The case of r=2

=

=

(11)

7 5

r

k = projected space size

k

(12)

r = 3

(13)
(14)
(15)
(16)
(17)

Fix some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A ’ s rank) is > 10,000

=

U

k

V

t

Dimensionality reduction

A

k

document

useless due to 0-col/0-row of k

m x r r x n

r k k

k

0 0

0

A

m x k k x n

(18)

Guarantee

A

k

is a pretty good approximation to A:

Relative distances are (approximately) preserved

Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2

Frobenius norm

||A||

F2

= 

2



2



r2

(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)

Interpret as another row in the matrix

(32)

This product U *  gives the projection of users to the 2 new dims, corresponding to the

vectors v1, v2 of V matrix below

(33)
(34)

d

This is a new user that expresses preferences on-the-fly at query time

(35)
(36)
(37)

Recap

c-th concept = c-th col of Uk (which is m x k)

Uk[i][c] = strength of association between c-th concept and i-th row (term/user/…)

Vtk[c][j] = strength of association between c-th concept and j-th column (document/movie/…)

Projected query: q’ = q Vk

q ’ [c] = strenght of concept c in q

Projected column (doc/movie): d’ j = dj Vk

d ’ j [c] = strenght of concept c in dj

Riferimenti

Documenti correlati

Environmental systems show a complex behavior evolving over many timescales: the problem of disentangling a mixture of several dynamical pattern may be based on Singular

The 1-month lead time SPEIs forecasting capability of W-BS-SVR model was compared with the Support Vector Regression (SVR) and Boosting-Support Vector Regression (BS-SVR) models

In [120], a non-stationary version of the cubic B-spline scheme is proposed with subdivision rules in (7.11), where the correspondent limit stencil in (7.12) is computed using

(b) Find the cartesian equation of the path of the point, sketch it and indicate the direction of

STANDARD n-DIMENSIONAL EUCLIDEAN SPACES: vector algebra, dot product, norm and distance, orthogonal projection, orthogonality, angles, spanning sets, linear in- dependence,

 Pre-process docs using a technique from linear algebra called Singular

Docs Italia avrà un motore di ricerca che permette di navigare tra i documenti, sempre più importante mano a mano cresce la mole di testi pubblicati sulla piattaforma.

is to evaluate dose delivery of different VMAT plans such as Head and Neck (SIB: Simultaneously Integrated Boost), lung (SBRT: Stereotactic Body Radiation Therapy) and