Crawling
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 20.1, 20.2 and 20.3
Spidering
24h, 7days “ walking” over a Graph
What about the Graph?
BowTie
Direct graph G = (N, E)
N changes (insert, delete) >> 50 * 10
9nodes
E changes (insert, delete) > 10 links per node
10*50*10
9 =500*10
91-entries in adj matrix
Crawling Issues
How to crawl?
Quality: “ Best” pages first
Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns (Minimize load)
How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have?
How often to crawl?
Freshness: How much has changed?
How to parallelize the process
Page selection
Given a page P, define how “ good” P is.
Several metrics:
BFS, DFS, Random
Popularity driven (PageRank, full vs partial)
Topic driven or focused crawling
Combined
This page is a new one ?
Check if file has been parsed or downloaded before
after 20 mil pages, we have “ seen” over 200 million URLs
each URL is at least 100 bytes on average
Overall we have about 20Gb of URLS
Options: compress URLs in main memory, or use disk
Bloom Filter (Archive)
Disk access with caching (Mercator, Altavista)
Also, two-level indexing with Front-coding compression
Link Extractor:
while(<Page Repository is not empty>){
<take a page p (check if it is new)>
<extract links contained in p within href>
<extract links contained in javascript>
<extract …..
<insert these links into the Priority Queue>
}
Downloaders:
while(<Assigned Repository is not empty>){
<extract url u>
<download page(u)>
<send page(u) to the Page Repository>
<store page(u) in a proper archive, possibly compressed>
}
Crawler Manager:
while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>
foreach u extracted {
if ( (u “Already Seen Page” ) ||
( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) {
<resolve u wrt DNS>
<send u to the Assigned Repository>
} }
}
Crawler “ cycle of life”
PQ
PR
Crawler Manager
AR
Downloaders
Link Extractor
Parallel Crawlers
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication
Dynamic assignment
Central coordinator dynamically assigns URLs to crawlers
Links are given to Central coordinator (?bottleneck?)
Static assignment
Web is statically partitioned and assigned to crawlers
Crawler only crawls its part of the web
Two problems with static assignment
Load balancing the #URLs assigned to downloaders:
Static schemes based on hosts may fail
www.geocities.com/….
www.di.unipi.it/
Dynamic “ relocation” schemes may be complicated
Managing the fault-tolerance:
What about the death of downloaders ? D D-1, new hash !!!
What about new downloaders ? D D+1, new hash !!!
Let D be the number of downloaders.
hash(URL) maps an URL to {0,...,D-1}.
Dowloader x fetches the URLs U s.t.
hash(U) = x
Which hash would you use?
A nice technique: Consistent Hashing
A tool for:
Spidering
Web Cache
P2P
Routers Load Balance
Distributed FS
Item and servers mapped to unit circle
Item K assigned to first server N such that ID(N) ≥ ID(K)
What if a downloader goes down?
What if a new downloader appears?
Each server gets replicated log S times
[monotone] adding a new server moves points between one old to the new, only.
[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...
Examples: Open Source
Nutch, also used by WikiSearch
http://nutch.apache.org/
Ranking
Link-based Ranking
(2° generation)
Reading 21
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Undirected popularity:
Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).
Directed popularity:
Score of a page = number of its in-links (es. 3).
Easy to SPAM
Second generation:
PageRank
Each link has its own importance!!
PageRank is
independent of the query
many interpretations…
Basic Intuition…
What about nodes
with no in/out links?
Google’ s Pagerank
else j i i
j
out
i
0
) (
# 1
,
N j
out j i r
r
i B j
) 1 1
) ( (
#
) ) (
(
) (
B(i) B(i) : set of pages linking to i. : set of pages linking to i.
#out(j)
#out(j) : number of outgoing links from j. : number of outgoing links from j.
e e : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.
Random jump
Principal eigenvector
r = [ T + (1-) e e T ] × r
Three different interpretations
Graph (intuitive interpretation)
Co-citation
Matrix (easy for computation)
Eigenvector computation or a linear system solution
Markov Chain (useful to prove convergence)
a sort of Usage Simulation
Any node
Neighbors
“ In the steady state” each
page has a long-term visit rate
- use this as the page’ s
score.
Pagerank: use in Search Engines
Preprocessing:
Given graph, build matrix
Compute its principal eigenvector r
r[i] is the pagerank of page i
We are interested in the relative order
Query processing:
Retrieve pages containing query terms
Rank them by their Pagerank
The final order is query- independent
T + (1-) e e T
HITS: Hypertext Induced Topic
Search
Calculating HITS
It is query- dependent
Produces two scores per page:
Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.