• Non ci sono risultati.

Link-based Ranking

N/A
N/A
Protected

Academic year: 2021

Condividi "Link-based Ranking"

Copied!
30
0
0

Testo completo

(1)

Ranking

Link-based Ranking

(2° generation)

(2)

The Web as a Directed Graph

Assumption 1: A hyperlink between pages denotes

author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink

describes the target page (textual context)

Page A Anchor hyperlink Page B

Sec. 21.1

(3)

Indexing anchor text

When indexing a document D, include anchor text from links pointing to D.

www.ibm.com

Armonk, NY-based computer giant IBM announced today

Joe’s computer hardware links

Sun

Big Blue today announced record profits for the quarter

Sec. 21.1.1

Can score anchor text with weight

(4)

Random Walks

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(5)

Definitions

Adjacency matrix A Transition matrix P

1

2

3

1

2

3

1

1/2

1 1/2

(6)

6

What is a random walk

1

1/2

1 1/2

t=0

(7)

What is a random walk

1

1/2

1 1/2

1

1/2

1 1/2

t=0 t=1

(8)

What is a random walk

1

1/2

1 1/2 1

1/2

1 1/2

t=0 t=1

1

1/2

1 1/2

t=2

(9)

What is a random walk

1

1/2

1 1/2

1

1/2

1 1/2

t=0 t=1

1

1/2

1 1/2

t=2

1

1/2

1 1/2

t=3

(10)

Probability Distributions

xt(i) = probability that surfer is at node i at time t

xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P

Transition matrix P

1

2

3

1

1/2

1 1/2

t=2

0 0 1 x

t

1/2 1/2 0

x

=

(11)

Probability Distributions

xt(i) = probability that surfer is at node i at time t

xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i)

= x

t

P

xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 Pt+1

What happens when the surfer keeps walking for a long time? So called Stationary distribution

(12)

12

Stationary Distribution

The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node.

It is when the distribution does not change anymore: i.e. xT+1 = xT  xT P = 1* xT

(left eigenvector of eigenvalue 1)

For “well-behaved” graphs this does not depend on the start distribution:

x

t+1

= x

0

P

t+1

(13)

Interesting questions

Does a stationary distribution always exist? Is it unique?

Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and

aperiodic.

How fast will the random surfer

approach this stationary distribution?

Mixing Time!

(14)

14

Well behaved graphs

Irreducible: There is a path from every node to every other node ( it is an SCC).

Irreducible Not irreducible

(15)

Well behaved graphs

Aperiodic: The GCD of all cycle lengths is 1.

The GCD is also called period.

Aperiodic Periodicity is 3

(16)

16

About undirected graphs

A connected undirected graph is irreducible

A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution!

Makes sense, since larger the degree of the

node more likely a random walk is to come back to it.

(17)

PageRanks and HITS

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(18)

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity:

Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

Directed popularity:

Score of a page = number of its in-links (es. 3).

Easy to SPAM

(19)

Second generation:

PageRank

Deploy the graph structure, and each link has its own importance !

PageRank is

independent of the query

many interpretations:

Linear algebra – eigenvectors, eigenvalues

Markov chains – steady state probability distribution

(20)

Basic Intuition…

Random jump to any node

1-d Random jump

d

(21)

Google’s Pagerank

1 i j

N j

out j i r

r

i B j

) 1 1

) ( (

#

) ) (

(

) (

 

B(i) : set of pages linking to i.B(i) : set of pages linking to i.

#out(j)

#out(j) : number of outgoing links from j. : number of outgoing links from j.

e : vector of components 1/sqrt{N}.e : vector of components 1/sqrt{N}.

eer

r    T  ( 1   ) T

Random jump

Principal eigenvector

(22)

Pagerank: use in Search Engines

Preprocessing:

Given graph of links, build matrix P

Compute its principal eigenvector r

r[i] is the pagerank of page i

We are interested in the relative order

At query time:

Retrieve pages containing query terms

Rank them by their Pagerank

The final order is query- independent

(23)

How to compute it

Fast (approximate) computation:

Given the Web graph, build the matrix

P

Compute r = e * Pt for t = 0, 1, …

r[i] is the pagerank of page i

(24)

(Personalized) Pagerank

Bias the random jump:

Substitute e = [1 1 ….1] with a preference

vector which jumps to preferred pages (topics)

If e = ei , then r[j] = relatedness between node j and node i

(25)

CoSim Rank to «compare»

nodes

Personalized PageRank, p(0)(i) varies with node i The adjacency matrix A has normalized rows

>>> Set d=1, so no teleport step

(26)

HITS: Hypertext Induced Topic

Search

(27)

Calculating HITS

It is query- dependent

Produces two scores per page:

Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic points to many authoritative pages for that

(28)

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

(29)

HITS: Link Analysis Computation

Where

a: Vector of Authority’ s scores h: Vector of Hub’ s scores.

A: Adjacency matrix in which ai,j = 1 if ij

h AA

h

Aa A

a Aa

h

h A

a

T T T

 

 

Thus, h is an eigenvector of AA

t

Symmetric matrices

(30)

Weighting links

Weight more if the query occurs in the

neighborhood of the link (e.g. anchor text).

y x

y a

x h

) (

) (

x y

y h x

a

) (

)

( ( ) ( , ) ( )

) ( )

, ( )

(

y h y

x w x

a

y a y

x w x

h

x y

y x

Riferimenti

Documenti correlati

The prototype park is divided into different thematic areas which allow the users to experiment inusual kinds of landscape, from “ruins” of old cistern (pre-existing in the area, to

Based on the previous discussion of the results, we can conclude that literature and practitioners seem to be generally aligned on the digital twin paradigm; nevertheless, even if

Because the Canonical table provides information about the gene/protein position in each chromosome as well as AA composition, we can investigate the content of single amino acids

Al contrario i livelli di espressione di miR-Let-7d sono risultati estremamente costanti e omogenei nelle PBMC di tutta la popolazione analizzata (Figura 9, panel A),

Preliminary evidences about the sample composition, values of mean and Gini index of annual labour income, mean value of the WFH attitude index and share of employees with

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Roma “Tor Vergata”, via della Ricerca Scientifica, 00133 Roma,

Without loss of generality we may assume that A, B, C are complex numbers along the unit circle. |z|

Per maggiore chiarezza, riassumo i termini che ricavo dal ragiona- mento di Imperatori: 1) la Sindone esiste, e può essere studiata scien- tificamente; 2) secondo i sindonologi