• Non ci sono risultati.

Paolo Ferragina Crawling

N/A
N/A
Protected

Academic year: 2021

Condividi "Paolo Ferragina Crawling"

Copied!
11
0
0

Testo completo

(1)

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

(2)

Spidering

 24h, 7days “ walking” over a Graph

What about the Graph?

BowTie

Direct graph G = (N, E)

N changes (insert, delete) >> 50 * 10

9

nodes

E changes (insert, delete) > 10 links per node

10*50*10

9 =

500*10

9

1-entries in adj matrix

(3)

Crawling Issues

How to crawl?

Quality: “ Best” pages first

Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have?

How often to crawl?

Freshness: How much has changed?

How to parallelize the process

(4)

What does it mean to crawl ?

(5)

Page selection

Given a page P, define how “ good” P is.

Several metrics:

BFS, DFS, Random

Popularity driven (PageRank, full vs partial)

Topic driven or focused crawling

Combined

(6)

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “ seen” over 200 million URLs

each URL is at least 100 bytes on average

Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk

Bloom Filter (Archive)

Disk access with caching (Mercator, Altavista)

Also, two-level indexing with Front-coding compression

(7)

Link Extractor:

while(<Page Repository is not empty>){

<take a page p (check if it is new)>

<extract links contained in p within href>

<extract links contained in javascript>

<extract …..

<insert these links into the Priority Queue>

}

Downloaders:

while(<Assigned Repository is not empty>){

<extract url u>

<download page(u)>

<send page(u) to the Page Repository>

<store page(u) in a proper archive, possibly compressed>

}

Crawler Manager:

while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>

foreach u extracted {

if ( (u  “Already Seen Page” ) ||

( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) {

<resolve u wrt DNS>

<send u to the Assigned Repository>

} }

}

Crawler “ cycle of life”

PQ

PR

Crawler Manager

AR

Downloaders

Link Extractor

(8)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment

Central coordinator dynamically assigns URLs to crawlers

Links are given to Central coordinator (?bottleneck?)

Static assignment

Web is statically partitioned and assigned to crawlers

Crawler only crawls its part of the web

(9)

Two problems with static assignment

Load balancing the #URLs assigned to downloaders:

Static schemes based on hosts may fail

www.geocities.com/….

www.di.unipi.it/

Dynamic “ relocation” schemes may be complicated

Managing the fault-tolerance:

What about the death of downloaders ? DD-1, new hash !!!

What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

Which hash would you use?

(10)

A nice technique: Consistent Hashing

A tool for:

Spidering

Web Cache

P2P

Routers Load Balance

Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(11)

Open Source

Nutch (+ hadoop), also used by WikiSearch

http://nutch.apache.org/

+ page storage:

Riferimenti

Documenti correlati

In order to perform the investigation on the RTD fluctuations the mixer was divided along the axial direction into six identical portions. In this way six bins to be used as

 Then repeatedly joins the closest pair of clusters, until there is only one cluster.  The history of mergings forms a binary tree

razionalizzazione della disciplina del lavoro penitenziario, anche alle dipendenze del datore di lavoro esterno, dopo le modifiche legislative successive all’O.P.,

 Keep attention: The computation at every step is the MIN of the SIM computed among all pairs of clusters.  The history of merging forms a binary tree or

 We need to “normalize” terms in indexed text and query words into the same form.  We want to

 Each page gets a score given by the number of in-links plus the number of out-links (es..

 Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.  Hub score: A good hub page for a topic points to many authoritative pages

 Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large