• Non ci sono risultati.

Duplicate match

N/A
N/A
Protected

Academic year: 2021

Condividi "Duplicate match"

Copied!
10
0
0

Testo completo

(1)

Crawling

(2)

A small example

(3)

This page is a new one ?

Check if the page has been parsed/downloaded before

Duplicate match

Near-duplicate match

Options:

Hashing on URLs

after 50 bln pages, we have “ seen” over 500 bln URLs

each URL is at least 100 bytes on average

Overall we have about 50.000 Tb (=50Pb) for just the URLS

Disk access with caching (e.g. Altavista)

> 5 ms per URL check

> 5 ms * 5 * 1013 URL-checks => 8000 years/1PC => 30gg/105 PCs

Bloom Filter (Archive)

For 50 bln URLs  about 50 Tb [cfr. 1/1000 hashing]

(4)

Is the page new? [Bloom Filter, 1970]

Create a binary array B[1,m]

Consider a family of k hash functions that map a key (URL) to a position (integer) in B

Pro: No need to store keys, less complex

≈ 50*X bln bits in array versus 50.000 bln chars Cons: false positives

B 1

k=3

(5)

TTT 2

(6)

Minimize prob. error for k = (m/n) ln 2

m/n = 30 bits

= 4 * 10

-11

= 0.62

m/n

Advantageous when (m/n) « (key-length in bits + log n)

(7)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment

Central coordinator dynamically assigns URLs to crawlers

It needs communication bwt coordinator/crawl threads

Static assignment

Web is statically partitioned and assigned to crawlers

Crawler only crawls its part of the web, no need of

coordinator and thus communication

(8)

Two problems with static assignment

Load balancing the #URLs assigned to downloaders:

Static schemes based on hosts may fail

www.geocities.com/….

www.di.unipi.it/

Dynamic “ relocation” schemes may be complicated

Managing the fault-tolerance:

What about the death of downloaders ? DD-1, new hash !!!

What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

Which hash would you use?

(9)

A nice technique: Consistent Hashing

A tool for:

Spidering

Web Cache

P2P

Routers Load Balance

Distributed FS

Item and servers mapped to unit circle via hash function ID()

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between an old server to the new one, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(10)

Open Source

Nutch (+ hadoop), also used by WikiSearch

http://nutch.apache.org/

Riferimenti

Documenti correlati

Furthermore, the index S p gives no information on fracture propagation when spalling occurs, this limitation being critical in the case - for instance - of

Characteriza- tion results show state-of-the-art performance in terms of power consumption (<40 mW for the core), which is less than half compared to off-the-shelf sensors, and

In these preparations, the network firing rate of the ganglion was characterized by large bursts with an average duration of 5 seconds ( Figure 2A , left panel and Figure 3B )..

razionalizzazione della disciplina del lavoro penitenziario, anche alle dipendenze del datore di lavoro esterno, dopo le modifiche legislative successive all’O.P.,

Figure 3.22: Experimental time series showing the presence of a stable upright position together with a stable large amplitude sub-harmonically resonant roll motion for ITACA in

La stessa cosa vale anche per il verbo “to be” (essere), che viene utilizzato in molte espressioni idiomatiche inglesi, le quali in italiano sono invece tradotte con il

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication. 

In order to perform the investigation on the RTD fluctuations the mixer was divided along the axial direction into six identical portions. In this way six bins to be used as