Ranking
Link-based Ranking
(2° generation)
The Web as a Directed Graph
Assumption 1: A hyperlink between pages denotes
author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink
describes the target page (textual context)
Page A Anchor hyperlink Page B
Sec. 21.1
Indexing anchor text
When indexing a document D, include anchor text from links pointing to D.
www.ibm.com
Armonk, NY-based computer giant IBM announced today
Joe’s computer hardware links
Sun
Big Blue today announced record profits for the quarter
Sec. 21.1.1
Can score anchor text with weight
Random Walks
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Definitions
Adjacency matrix A Transition matrix P
1
2
3
1
2
3
1
1/2
1 1/2
6
What is a random walk
1
1/2
1 1/2
t=0
What is a random walk
1
1/2
1 1/2
1
1/2
1 1/2
t=0 t=1
What is a random walk
1
1/2
1 1/2 1
1/2
1 1/2
t=0 t=1
1
1/2
1 1/2
t=2
What is a random walk
1
1/2
1 1/2
1
1/2
1 1/2
t=0 t=1
1
1/2
1 1/2
t=2
1
1/2
1 1/2
t=3
Probability Distributions
xt(i) = probability that surfer is at node i at time t
xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P
Transition matrix P
1
2
3
1
1/2
1 1/2
t=2
0 0 1 x
t1/2 1/2 0
x
=
Probability Distributions
xt(i) = probability that surfer is at node i at time t
xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i)
= x
tP
xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 Pt+1
What happens when the surfer keeps walking for a long time? So called Stationary distribution
12
Stationary Distribution
The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node.
It is when the distribution does not change anymore: i.e. xT+1 = xT xT P = 1* xT
(left eigenvector of eigenvalue 1)
For “well-behaved” graphs this does not depend on the start distribution:
x
t+1= x
0P
t+1Interesting questions
Does a stationary distribution always exist? Is it unique?
Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and
aperiodic.
How fast will the random surfer
approach this stationary distribution?
Mixing Time!
14
Well behaved graphs
Irreducible: There is a path from every node to every other node ( it is an SCC).
Irreducible Not irreducible
Well behaved graphs
Aperiodic: The GCD of all cycle lengths is 1.
The GCD is also called period.
Aperiodic Periodicity is 3
16
About undirected graphs
A connected undirected graph is irreducible
A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution!
Makes sense, since larger the degree of the
node more likely a random walk is to come back to it.
PageRanks and HITS
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Undirected popularity:
Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).
Directed popularity:
Score of a page = number of its in-links (es. 3).
Easy to SPAM
Second generation:
PageRank
Deploy the graph structure, and each link has its own importance !
PageRank is
independent of the query
many interpretations:
Linear algebra – eigenvectors, eigenvalues
Markov chains – steady state probability distribution
Basic Intuition…
Random jump to any node
1-d Random jump
d
Google’s Pagerank
1 i j
N j
out j i r
r
i B j
) 1 1
) ( (
#
) ) (
(
) (
B(i) : set of pages linking to i.B(i) : set of pages linking to i.
#out(j)
#out(j) : number of outgoing links from j. : number of outgoing links from j.
e : vector of components 1/sqrt{N}.e : vector of components 1/sqrt{N}.
ee r
r T ( 1 ) T
Random jump
Principal eigenvector
Pagerank: use in Search Engines
Preprocessing:
Given graph of links, build matrix P
Compute its principal eigenvector r
r[i] is the pagerank of page i
We are interested in the relative order
At query time:
Retrieve pages containing query terms
Rank them by their Pagerank
The final order is query- independent
How to compute it
Fast (approximate) computation:
Given the Web graph, build the matrix
P
Compute r = e * Pt for t = 0, 1, …
r[i] is the pagerank of page i
(Personalized) Pagerank
Bias the random jump:
Substitute e = [1 1 ….1] with a preference
vector which jumps to preferred pages (topics)
If e = ei , then r[j] = relatedness between node j and node i
CoSim Rank to «compare»
nodes
Personalized PageRank, p(0)(i) varies with node i The adjacency matrix A has normalized rows
>>> Set d=1, so no teleport step
HITS: Hypertext Induced Topic
Search
Calculating HITS
It is query- dependent
Produces two scores per page:
Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.
Hub score: A good hub page for a topic points to many authoritative pages for that
Authority and Hub scores
2
3
4
1 1
5
6
7
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
HITS: Link Analysis Computation
Where
a: Vector of Authority’ s scores h: Vector of Hub’ s scores.
A: Adjacency matrix in which ai,j = 1 if ij
h AA
h
Aa A
a Aa
h
h A
a
T T T
Thus, h is an eigenvector of AA
tSymmetric matrices
Weighting links
Weight more if the query occurs in the
neighborhood of the link (e.g. anchor text).
y x
y a
x h
) (
) (
x y
y h x
a
) (
)
( ( ) ( , ) ( )
) ( )
, ( )
(
y h y
x w x
a
y a y
x w x
h
x y
y x