Link-based Ranking

(1)

Ranking

Link-based Ranking

(2° generation)

(2)

The Web as a Directed Graph

Assumption 1: A hyperlink between pages denotes

author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink

describes the target page (textual context)

Page A _Anchor ^hyperlink Page B

Sec. 21.1

(3)

Indexing anchor text



When indexing a document D, include anchor text from links pointing to D.

www.ibm.com

Armonk, NY-based computer giant IBM announced today

Joe’s computer hardware links

Sun

Big Blue today announced record profits for the quarter

Sec. 21.1.1

Can score anchor text with weight

(4)

Random Walks

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(5)

Definitions

Adjacency matrix A Transition matrix P

1

2

3

1

2

3

1

1/2

1 1/2

(6)

6

What is a random walk

1

1/2

1 1/2

t=0

(7)

What is a random walk

1

1/2

1 1/2

1

1/2

1 1/2

t=0 t=1

(8)

What is a random walk

1

1/2

1 1/2 1

1/2

1 1/2

t=0 t=1

1

1/2

1 1/2

t=2

(9)

What is a random walk

1

1/2

1 1/2

1

1/2

1 1/2

t=0 t=1

1

1/2

1 1/2

t=2

1

1/2

1 1/2

t=3

(10)

Probability Distributions

 x_t(i) = probability that surfer is at node i at time t

 xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P

Transition matrix P

1

2

3

1

1/2

1 1/2

t=2

0 0 1 x

_t

1/2 1/2 0

x

=

(11)

Probability Distributions

 xt(i) = probability that surfer is at node i at time t

 xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i)

= x

_t

P

 xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 P^t+1

 What happens when the surfer keeps walking for a long time? So called Stationary distribution

(12)

12

Stationary Distribution

 The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node.

 It is when the distribution does not change anymore: i.e. xT+1 = xT  xT P = 1* xT

(left eigenvector of eigenvalue 1)

 For “well-behaved” graphs this does not depend on the start distribution:

x

t+1

= x

0

P

^t+1

(13)

Interesting questions



Does a stationary distribution always exist? Is it unique?

 Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and

aperiodic.



How fast will the random surfer

approach this stationary distribution?

 Mixing Time!

(14)

14

Well behaved graphs

 Irreducible: There is a path from every node to every other node ( it is an SCC).

Irreducible Not irreducible

(15)

Well behaved graphs

 Aperiodic: The GCD of all cycle lengths is 1.

The GCD is also called period.

Aperiodic Periodicity is 3

(16)

16

About undirected graphs

 A connected undirected graph is irreducible

 A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution!

 Makes sense, since larger the degree of the

node more likely a random walk is to come back to it.

(17)

PageRanks and HITS

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(18)

Query-independent ordering



First generation: using link counts as simple measures of popularity.

 Undirected popularity:

 Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

 Directed popularity:

 Score of a page = number of its in-links (es. 3).

Easy to SPAM

(19)

Second generation:

PageRank



Deploy the graph structure, and each link has its own importance !



PageRank is



independent of the query



many interpretations:

 Linear algebra – eigenvectors, eigenvalues

 Markov chains – steady state probability distribution

(20)

Basic Intuition…

Random jump to any node

1-d Random jump

d

(21)

Google’s Pagerank

 1 i  j

N j

out j i r

r

i B j

) 1 1

) ( (

#

) ) (

(

) (









 





B(i) : set of pages linking to i.B(i) : set of pages linking to i.

#out(j)

#out(j) : number of outgoing links from j. : number of outgoing links from j.

e : vector of components 1/sqrt{N}.e : vector of components 1/sqrt{N}.

 ^ee  ^r

r    ^T  ( 1   ) ^T 

Random jump

Principal eigenvector

(22)

Pagerank: use in Search Engines



Preprocessing:

 Given graph of links, build matrix P

 Compute its principal eigenvector r

 r[i] is the pagerank of page i

We are interested in the relative order



At query time:

 Retrieve pages containing query terms

 Rank them by their Pagerank

The final order is query- independent

(23)

How to compute it



Fast (approximate) computation:

 Given the Web graph, build the matrix

P

 Compute r = e * P^t for t = 0, 1, …

 r[i] is the pagerank of page i

(24)

(Personalized) Pagerank



Bias the random jump:

 Substitute e = [1 1 ….1] with a preference

vector which jumps to preferred pages (topics)

 If e = ei , then r[j] = relatedness between node j and node i

(25)

CoSim Rank to «compare»

nodes

Personalized PageRank, p⁽⁰⁾(i) varies with node i The adjacency matrix A has normalized rows

>>> Set d=1, so no teleport step

(26)

HITS: Hypertext Induced Topic

Search

(27)

Calculating HITS



It is query- dependent



Produces two scores per page:

 Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.

 Hub score: A good hub page for a topic points to many authoritative pages for that

Link-based Ranking

Ranking