Paolo Ferragina Crawling

(1)

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

(2)

Spidering

 24h, 7days “ walking” over a Graph



What about the Graph?



BowTie



Direct graph G = (N, E)



N changes (insert, delete) >> 50 * 10

⁹

nodes



E changes (insert, delete) > 10 links per node



105010

^{9 =}

500*10

⁹

1-entries in adj matrix

(3)

Crawling Issues



How to crawl?



Quality: “ Best” pages first



Efficiency: Avoid duplication (or near duplication)



Etiquette: Robots.txt, Server load concerns (Minimize load)



How much to crawl? How much to index?



Coverage: How big is the Web? How much do we cover?



Relative Coverage: How much do competitors have?



How often to crawl?



Freshness: How much has changed?



How to parallelize the process

(4)

What does it mean to crawl ?

(5)

Page selection



Given a page P, define how “ good” P is.



Several metrics:



BFS, DFS, Random



Popularity driven (PageRank, full vs partial)



Topic driven or focused crawling



Combined

(6)

This page is a new one ?



Check if file has been parsed or downloaded before



after 20 mil pages, we have “ seen” over 200 million URLs



each URL is at least 100 bytes on average



Overall we have about 20Gb of URLS



Options: compress URLs in main memory, or use disk



Bloom Filter (Archive)



Disk access with caching (Mercator, Altavista)



Also, two-level indexing with Front-coding compression

(7)

Link Extractor:

while(<Page Repository is not empty>){

<take a page p (check if it is new)>

<extract links contained in p within href>

<extract links contained in javascript>

<extract …..

}

Downloaders:

while(<Assigned Repository is not empty>){

<extract url u>

<download page(u)>

<send page(u) to the Page Repository>

<store page(u) in a proper archive, possibly compressed>

}

Crawler Manager:

while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>

foreach u extracted {

if ( (u  “Already Seen Page” ) ||

( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) {

} }

}

Crawler “ cycle of life”

PQ

PR

Crawler Manager

AR

Downloaders

Link Extractor

(8)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication



Dynamic assignment



Central coordinator dynamically assigns URLs to crawlers



Links are given to Central coordinator (?bottleneck?)



Static assignment



Web is statically partitioned and assigned to crawlers



Crawler only crawls its part of the web

(9)

Two problems with static assignment



Load balancing the #URLs assigned to downloaders:



Static schemes based on hosts may fail



www.geocities.com/….



www.di.unipi.it/



Dynamic “ relocation” schemes may be complicated



Managing the fault-tolerance:



What about the death of downloaders ? D  D-1, new hash !!!



What about new downloaders ? D  D+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

Which hash would you use?

(10)

A nice technique: Consistent Hashing



A tool for:



Spidering



Web Cache



P2P



Routers Load Balance



Distributed FS



Item and servers mapped to unit circle



Item K assigned to first server N such that ID(N) ≥ ID(K)



What if a downloader goes down?



What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(11)

Open Source



Nutch (+ hadoop), also used by WikiSearch



Paolo Ferragina Crawling

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

Spidering

 24h, 7days “ walking” over a Graph

What about the Graph?

BowTie

Direct graph G = (N, E)

N changes (insert, delete) >> 50 * 10

nodes

E changes (insert, delete) > 10 links per node

10*50*10

500*10

1-entries in adj matrix

Crawling Issues

How to crawl?

Quality: “ Best” pages first

Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have?

How often to crawl?

Freshness: How much has changed?

How to parallelize the process

What does it mean to crawl ?

Page selection

Given a page P, define how “ good” P is.

Several metrics:

BFS, DFS, Random

Popularity driven (PageRank, full vs partial)

Topic driven or focused crawling

Combined

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “ seen” over 200 million URLs

each URL is at least 100 bytes on average

Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk

Bloom Filter (Archive)

Disk access with caching (Mercator, Altavista)

Also, two-level indexing with Front-coding compression

Crawler “ cycle of life”

PQ

PR

AR

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment

Central coordinator dynamically assigns URLs to crawlers

Links are given to Central coordinator (?bottleneck?)

Static assignment

Web is statically partitioned and assigned to crawlers

Crawler only crawls its part of the web

Two problems with static assignment

Load balancing the #URLs assigned to downloaders:

Static schemes based on hosts may fail

www.geocities.com/….

www.di.unipi.it/

Dynamic “ relocation” schemes may be complicated

Managing the fault-tolerance:

What about the death of downloaders ? D  D-1, new hash !!!

What about new downloaders ? D  D+1, new hash !!!

A nice technique: Consistent Hashing

A tool for:

Spidering

Web Cache

P2P

Routers Load Balance

Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

Open Source

Nutch (+ hadoop), also used by WikiSearch

http://nutch.apache.org/

105010