• Non ci sono risultati.

Paolo Ferragina Crawling

N/A
N/A
Protected

Academic year: 2021

Condividi "Paolo Ferragina Crawling"

Copied!
22
0
0

Testo completo

(1)

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

(2)

Spidering

 24h, 7days “ walking” over a Graph

What about the Graph?

BowTie

Direct graph G = (N, E)

N changes (insert, delete) >> 50 * 10

9

nodes

E changes (insert, delete) > 10 links per node

10*50*10

9 =

500*10

9

1-entries in adj matrix

(3)

Crawling Issues

How to crawl?

Quality: “ Best” pages first

Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have?

How often to crawl?

Freshness: How much has changed?

How to parallelize the process

(4)

Page selection

Given a page P, define how “ good” P is.

Several metrics:

BFS, DFS, Random

Popularity driven (PageRank, full vs partial)

Topic driven or focused crawling

Combined

(5)

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “ seen” over 200 million URLs

each URL is at least 100 bytes on average

Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk

Bloom Filter (Archive)

Disk access with caching (Mercator, Altavista)

Also, two-level indexing with Front-coding compression

(6)

Link Extractor:

while(<Page Repository is not empty>){

<take a page p (check if it is new)>

<extract links contained in p within href>

<extract links contained in javascript>

<extract …..

<insert these links into the Priority Queue>

}

Downloaders:

while(<Assigned Repository is not empty>){

<extract url u>

<download page(u)>

<send page(u) to the Page Repository>

<store page(u) in a proper archive, possibly compressed>

}

Crawler Manager:

while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>

foreach u extracted {

if ( (u  “Already Seen Page” ) ||

( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) {

<resolve u wrt DNS>

<send u to the Assigned Repository>

} }

}

Crawler “ cycle of life”

PQ

PR

Crawler Manager

AR

Downloaders

Link Extractor

(7)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment

Central coordinator dynamically assigns URLs to crawlers

Links are given to Central coordinator (?bottleneck?)

Static assignment

Web is statically partitioned and assigned to crawlers

Crawler only crawls its part of the web

(8)

Two problems with static assignment

Load balancing the #URLs assigned to downloaders:

Static schemes based on hosts may fail

www.geocities.com/….

www.di.unipi.it/

Dynamic “ relocation” schemes may be complicated

Managing the fault-tolerance:

What about the death of downloaders ? DD-1, new hash !!!

What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

Which hash would you use?

(9)

A nice technique: Consistent Hashing

A tool for:

Spidering

Web Cache

P2P

Routers Load Balance

Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(10)

Examples: Open Source

 Nutch, also used by WikiSearch

http://nutch.apache.org/

(11)

Ranking

Link-based Ranking

(2° generation)

Reading 21

(12)

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity:

Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

Directed popularity:

Score of a page = number of its in-links (es. 3).

Easy to SPAM

(13)

Second generation:

PageRank

 Each link has its own importance!!

 PageRank is

independent of the query

many interpretations…

(14)

Basic Intuition…

What about nodes

with no in/out links?

(15)

Google’ s Pagerank



 

 

else j i i

j

out

i

0

) (

# 1

,

N j

out j i r

r

i B j

) 1 1

) ( (

#

) ) (

(

) (

 

B(i) B(i) : set of pages linking to i. : set of pages linking to i.

#out(j)

#out(j) : number of outgoing links from j. : number of outgoing links from j.

e e : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principal eigenvector

r = [   T + (1-) e e T ] × r

(16)

Three different interpretations

Graph (intuitive interpretation)

Co-citation

Matrix (easy for computation)

Eigenvector computation or a linear system solution

Markov Chain (useful to prove convergence)

a sort of Usage Simulation

Any node

 Neighbors

In the steady state” each 

page has a long-term visit rate

- use this as the page’ s

score.

(17)

Pagerank: use in Search Engines

Preprocessing:

Given graph, build matrix

Compute its principal eigenvector r

r[i] is the pagerank of page i

We are interested in the relative order

Query processing:

Retrieve pages containing query terms

Rank them by their Pagerank

The final order is query- independent

  T + (1-) e e T

(18)

HITS: Hypertext Induced Topic

Search

(19)

Calculating HITS

 It is query- dependent

Produces two scores per page:

Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic

points to many authoritative pages for that

topic.

(20)

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

(21)

HITS: Link Analysis Computation

Where

a: Vector of Authority’ s scores h: Vector of Hub’ s scores.

A: Adjacency matrix in which a

i,j

= 1 if ij

h AA

h

Aa A

a Aa

h

h A

a

T T T

 

 

Thus, h is an eigenvector of AA t a is an eigenvector of A t A

Symmetric

matrices

(22)

Weighting links

Weight more if the query occurs in the

neighborhood of the link (e.g. anchor text).

y x

y a

x h

) (

) (

x y

y h x

a

) (

)

( ( ) ( , ) ( )

) ( )

, ( )

(

y h y

x w x

a

y a y

x w x

h

x y

y x

Riferimenti

Documenti correlati

Government Printing Office , Social Security Administration (US)..

We look for small (sometimes “minimal”) set of generators for the fields K(E[m]) generated by the m-torsion points of an elliptic curve E (where K is any number field).. For m = 3

For the point-like source search using the direction information of upward-going muons, we evaluate the background due to atmospheric neutrino induced muons randomly mixing for

This paper reports an experimental study of transient natural convection heat and mass transfer between two large cylindrical reservoirs connected through a short

[r]

[r]

To drill a cube is to partition it into 27 congruent subcubes and remove the closed central cube along with the 6 remaining subcubes sharing a face with that cube, resulting in a

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Firenze, viale Morgagni 67/A, 50134 Firenze, Italy.. We denote by σ(n) the sum of the divisors of