Crawling
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 20.1, 20.2 and 20.3
Spidering
24h, 7days “ walking” over a Graph
What about the Graph?
BowTie
Direct graph G = (N, E)
N changes (insert, delete) >> 50 * 10
9nodes
E changes (insert, delete) > 10 links per node
10*50*10
9 =500*10
91-entries in adj matrix
Crawling Issues
How to crawl?
Quality: “ Best” pages first
Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns (Minimize load)
How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have?
How often to crawl?
Freshness: How much has changed?
How to parallelize the process
What does it mean to crawl ?
Page selection
Given a page P, define how “ good” P is.
Several metrics:
BFS, DFS, Random
Popularity driven (PageRank, full vs partial)
Topic driven or focused crawling
Combined
This page is a new one ?
Check if file has been parsed or downloaded before
after 20 mil pages, we have “ seen” over 200 million URLs
each URL is at least 100 bytes on average
Overall we have about 20Gb of URLS
Options: compress URLs in main memory, or use disk
Bloom Filter (Archive)
Disk access with caching (Mercator, Altavista)
Also, two-level indexing with Front-coding compression
Link Extractor:
while(<Page Repository is not empty>){
<take a page p (check if it is new)>
<extract links contained in p within href>
<extract links contained in javascript>
<extract …..
<insert these links into the Priority Queue>
}
Downloaders:
while(<Assigned Repository is not empty>){
<extract url u>
<download page(u)>
<send page(u) to the Page Repository>
<store page(u) in a proper archive, possibly compressed>
}
Crawler Manager:
while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>
foreach u extracted {
if ( (u “Already Seen Page” ) ||
( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) {
<resolve u wrt DNS>
<send u to the Assigned Repository>
} }
}
Crawler “ cycle of life”
PQ
PR
Crawler Manager
AR
Downloaders
Link Extractor
Parallel Crawlers
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication
Dynamic assignment
Central coordinator dynamically assigns URLs to crawlers
Links are given to Central coordinator (?bottleneck?)
Static assignment
Web is statically partitioned and assigned to crawlers
Crawler only crawls its part of the web
Two problems with static assignment
Load balancing the #URLs assigned to downloaders:
Static schemes based on hosts may fail
www.geocities.com/….
www.di.unipi.it/
Dynamic “ relocation” schemes may be complicated
Managing the fault-tolerance:
What about the death of downloaders ? D D-1, new hash !!!
What about new downloaders ? D D+1, new hash !!!
Let D be the number of downloaders.
hash(URL) maps an URL to {0,...,D-1}.
Dowloader x fetches the URLs U s.t.
hash(U) = x
Which hash would you use?
A nice technique: Consistent Hashing
A tool for:
Spidering
Web Cache
P2P
Routers Load Balance
Distributed FS
Item and servers mapped to unit circle
Item K assigned to first server N such that ID(N) ≥ ID(K)
What if a downloader goes down?
What if a new downloader appears?
Each server gets replicated log S times
[monotone] adding a new server moves points between one old to the new, only.
[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...
Open Source
Nutch (+ hadoop), also used by WikiSearch