• Non ci sono risultati.

Paolo Ferragina Crawling

N/A
N/A
Protected

Academic year: 2021

Condividi "Paolo Ferragina Crawling"

Copied!
55
0
0

Testo completo

(1)

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

(2)

Spidering

24h, 7days “walking” over a Graph

What about the Graph?

BowTie

Direct graph G = (N, E)

N changes (insert, delete) >> 50 * 10

9

nodes

E changes (insert, delete) > 10 links per node

10*50*109 = 500*109 1-entries in adj matrix

(3)

Crawling Issues

How to crawl?

Quality: “Best” pages first

Efficiency: Avoid duplication (or near duplication)

Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index?

Coverage: How big is the Web? How much do we cover?

Relative Coverage: How much do competitors have?

How often to crawl?

Freshness: How much has changed?

How to parallelize the process

(4)

Page selection

Given a page P, define how “good” P is.

Several metrics:

BFS, DFS, Random

Popularity driven (PageRank, full vs partial)

Topic driven or focused crawling

Combined

(5)

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “seen” over 200 million URLs

each URL is at least 100 bytes on average

Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk

Bloom Filter (Archive)

Disk access with caching (Mercator, Altavista)

(6)

Link Extractor:

while(<Page Repository is not empty>){

<take a page p (check if it is new)>

<extract links contained in p within href>

<extract links contained in javascript>

<extract …..

<insert these links into the Priority Queue>

}

Downloaders:

while(<Assigned Repository is not empty>){

<extract url u>

<download page(u)>

<send page(u) to the Page Repository>

<store page(u) in a proper archive, possibly compressed>

}

Crawler Manager:

while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>

foreach u extracted {

if ( (u  “Already Seen Page” ) ||

( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) {

<resolve u wrt DNS>

<send u to the Assigned Repository>

} }

}

Crawler “cycle of life”

PQ

PR

Crawler Manager AR

Downloaders

Link Extractor

(7)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment

Central coordinator dynamically assigns URLs to crawlers

Links are given to Central coordinator (?bottleneck?)

Static assignment

Web is statically partitioned and assigned to crawlers

Crawler only crawls its part of the web

(8)

Two problems

Load balancing the #URLs assigned to downloaders:

Static schemes based on hosts may fail

www.geocities.com/….

www.di.unipi.it/

Dynamic “relocation” schemes may be complicated

Managing the fault-tolerance:

What about the death of downloaders ? DD-1, new hash !!!

What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

(9)

A nice technique: Consistent Hashing

A tool for:

Spidering

Web Cache

P2P

Routers Load Balance

Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(10)

Examples: Open Source

Nutch, also used by WikiSearch

http://www.nutch.org

Hentrix, used by Archive.org

http://archive-crawler.sourceforge.net/index.html

Consisten Hashing

Amazon’s Dynamo

(11)

Ranking

Link-based Ranking

(2° generation)

Reading 21

(12)

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity:

Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).

Directed popularity:

Score of a page = number of its in-links (es. 3).

Easy to SPAM

(13)

Second generation:

PageRank

Each link has its own importance!!

PageRank is

independent of the query

many interpretations…

(14)

Basic Intuition…

What about nodes with no in/out links?

(15)

Google’s Pagerank





 

else j i i

j out

i

0

) (

# 1

,

N j

out j i r

r

i B j

) 1 1

) ( (

#

) ) (

(

) (

 

B(i) B(i) : set of pages linking to i. : set of pages linking to i.

#out(j)

#out(j) : number of outgoing links from j. : number of outgoing links from j.

e e : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principal eigenvector

r = [  

T

+ (1-) e e

T

] × r

(16)

Three different interpretations

Graph (intuitive interpretation)

Co-citation

Matrix (easy for computation)

Eigenvector computation or a linear system solution

Markov Chain (useful to prove convergence)

a sort of Usage Simulation

Any node

 Neighbors

“In the steady state” each



page has a long-term visit rate

- use this as the page’s score.

(17)

Pagerank: use in Search Engines

Preprocessing:

Given graph, build matrix

Compute its principal eigenvector r

r[i] is the pagerank of page i

We are interested in the relative order

Query processing:

Retrieve pages containing query terms

Rank them by their Pagerank

The final order is query- independent

 

T

+ (1-) e e

T

(18)

HITS: Hypertext Induced Topic

Search

(19)

Calculating HITS

 It is query- dependent

Produces two scores per page:

Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic

points to many authoritative pages for that

topic.

(20)

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

(21)

HITS: Link Analysis Computation

Where

a: Vector of Authority’s scores h: Vector of Hub’s scores.

A: Adjacency matrix in which ai,j = 1 if ij

h AA

h

Aa A

a Aa

h

h A

a

T T T

 

 

Thus, h is an eigenvector of AA

t

a is an eigenvector of A

t

A

Symmetric matrices

(22)

Weighting links

Weight more if the query occurs in the

neighborhood of the link (e.g. anchor text).

y x

y a

x h

) (

) (

x y

y h x

a

) (

)

( ( ) ( , ) ( )

) ( )

, ( )

(

y h y

x w x

a

y a y

x w x

h

x y

y x

(23)

Latent Semantic Indexing

(mapping onto a smaller space of latent concepts)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 18

(24)

Speeding up cosine computation

What if we could take our vectors and

“pack” them into fewer dimensions (say 50,000100) while preserving distances?

Now, O(nm)

Then, O(km+kn) where k << n,m

Two methods:

“Latent semantic indexing”

Random projection

(25)

A sketch

LSI is data-dependent

Create a k-dim subspace by eliminating redundant axes

Pull together “related” axes – hopefully

car and automobile

Random projection is data-independent

Choose a k-dim subspace that guarantees good stretching properties with high

probability between pair of points.

What about polysemy ?

(26)

Notions from linear algebra

Matrix A, vector v

Matrix transpose (A

t

)

Matrix product

Rank

Eigenvalues and eigenvector v: Av = v

(27)

Overview of LSI

Pre-process docs using a technique from linear algebra called Singular Value

Decomposition

Create a new (smaller) vector space

Queries handled (faster) in this new space

(28)

Singular-Value Decomposition

Recall m  n matrix of terms  docs, A.

A has rank r  m,n

Define term-term correlation matrix T=AA

t

T is a square, symmetric m  m matrix

Let P be m  r matrix of eigenvectors of T

Define doc-doc correlation matrix D=A

t

A

D is a square, symmetric n  n matrix

Let R be n  r matrix of eigenvectors of D

(29)

A’s decomposition

Do exist matrices P

(for T, m  r)

and R

(for D, n  r)

formed by orthonormal columns

(unit dot-product)

It turns out that A = PR

t

Where  is a diagonal matrix with the

eigenvalues of T=AA

t

in decreasing order.

=

A PR

t

mn mr rr rn

(30)

For some k << r, zero out all but the k biggest eigenvalues in  [choice of k is crucial]

Denote by k this new version of , having rank k

Typically k is about 100, while r (A’s rank) is > 10,000

=

P

k

R

t

Dimensionality reduction

A

k

document

useless due to 0-col/0-row of k

m x r r x n

r k k

k

0 0

0

A

m x k k x n

(31)

Guarantee

A

k

is a pretty good approximation to A:

Relative distances are (approximately) preserved

Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures:

minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k

minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2

Frobenius norm ||A||

F2

= 

2



2



r2

(32)

Reduction

X

k

=

k

R

t is the doc-matrix k x n, hence reduced to k dim

Take the doc-correlation matrix:

It is D=At A=(P  Rt)t (P Rt) = (Rt)t (Rt)

Approx with k, thus get At A Xkt Xk (both are n x n matr.)

We use X

k

to define A’s projection:

X

k =

k

R

t

, substitute R

t

= 



P

t

A, so get P

kt

A .

In fact, 

k



P

t

= P

kt

which is a k x m matrix

This means that to reduce a doc/query vector is enough to multiply it by P

kt

Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectors of the matrices D,T

(33)

Which are the concepts ?

c-th concept = c-th row of P

kt (which is k x m)

Denote it by P

kt

[c], whose size is m = #terms

P

kt

[c][i] = strength of association between c-th concept and i-th term

Projected document: d’

j

= P

kt

d

j

d’

j

[c] = strenght of concept c in d

j

Projected query: q’ = P

kt

q

q’ [c] = strenght of concept c in q

(34)

Random Projections

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Slides only !

(35)

An interesting math result

Setting v=0 we also get a bound on f(u)’s stretching!!!

d is our previous m = #terms

(36)

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above

(37)

A practical-theoretical idea !!!

E[ri,j] = 0 Var[ri,j] = 1

(38)

Finally...

 Random projections hide large constants

 k  (1/)

2

* log d, so it may be large…

 it is simple and fast to compute

 LSI is intuitive and may scale to any k

 optimal under various metrics

 but costly to compute

(39)

Document duplication

(exact or approximate)

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Slides only!

(40)

Duplicate documents

The web is full of duplicated content

Few exact duplicate detection

Many cases of near duplicates

E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

(41)

Natural Approaches

Fingerprinting:

only works for exact matches

Random Sampling

sample substrings (phrases, sentences, etc)

hope: similar documents  similar samples

But – even samples of same document will differ

Edit-distance

metric for approximate string-matching

expensive – even for one pair of strings

impossible – for 1032 web documents

(42)

Obvious techniques

Checksum – no worst-case collision probability guarantees

MD5 – cryptographically-secure string hashes

relatively slow

Karp-Rabin’s Scheme

Algebraic technique – arithmetic on primes

Efficient and other nice properties…

Exact-Duplicate Detection

(43)

Karp-Rabin Fingerprints

Consider – m-bit string A=a1 a2 … am

Assume – a1=1 and fixed-length strings (wlog)

Basic values:

Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264)

Set h = dm-1 mod p

Fingerprints: f(A) = A mod p

Nice property is that if B = a2 … am am+1

f(B) = [d (A - a1 h) + am+1 ] mod p

Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U

(44)

Near-Duplicate Detection

Problem

Given a large collection of documents

Identify the near-duplicate documents

Web search engines

Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, …

Malicious – spam, spider-traps, dynamic URLs, …

Mistaken – spider errors

30% of web-pages are near-duplicates

[1997]

(45)

Desiderata

Storage: only small sketches of each document.

Computation: the fastest possible

Stream Processing :

once sketch computed, source is unavailable

Error Guarantees

problem scale  small biases have large impact

need formal guarantees – heuristics will not do

(46)

Basic Idea [Broder 1997]

Shingling

dissect document into q-grams (shingles)

represent documents by shingle-sets

reduce problem to set intersection [ Jaccard ]

They are near-duplicates if large shingle-sets intersect enough

We know how to cope with “Set Intersection”

fingerprints of shingles (for space efficiency)

min-hash to estimate intersections sizes (for time and space efficiency)

(47)

Multiset of Fingerprints Doc shingling Multiset

of

Shingles

fingerprint

Documents  Sets of 64-bit fingerprints

Fingerprints:

• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)

• Fingerprint space [0, …, U-1]

• In practice, use 64-bit fingerprints, i.e., U=264

• Prob[collision] ≈ (8q)/264 << 1

(48)

Similarity of Documents

Doc

S B

B

S A

Doc A

• Jaccard measure – similarity of SA, SB  U = [0 … N-1]

• Claim: A & B are near-duplicates if sim(SA,SB) is high

B A

B A

B

A

S S

S ) S

S , sim(S

 

(49)

Speeding-up: Sketch of a document

Intersecting directly the shingles is too costly

Create a “sketch vector” (of size

~200) for each document

Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates

Sec. 19.6

(50)

Sketching by Min-Hashing

Consider

S

A

, S

B

 P

Pick a random permutation π of P

(such as ax+b mod |P|)

Define  = π

-1

( min{π(S

A

)} ) ,  = π

-1

( min{π(S

B

)} )

minimal element under permutation π

Lemma:

A B

B A

S S

S β] S

P[α 

 

(51)

Sum up…

Similarity sketch sk(A) = k minimal elements under π(S

A

)

K is fixed or is a fixed ratio of SA,SB ?

We might also take K permutations and the min of each

Similarity Sketches sk(A):

Succinct representation of fingerprint sets SA

Allows efficient estimation of sim(SA,SB)

Basic idea is to use min-hash of fingerprints

Note

: we can reduce the variance by using a larger k

(52)

Computing Sketch[i] for Doc1

Document 1

264 264 264 264

Start with 64-bit f(shingles) Permute on the number line with i

Pick the min value

Sec. 19.6

(53)

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264 264 264 264

264 264 264 264

Are these equal?

Test for 200 random permutations:

, 

,… 

200

A B

Sec. 19.6

(54)

However…

Document 1 Document 2 264

264 264 264

264 264 264 264

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability

Size_of_intersection / Size_of_union

A B

Sec. 19.6

(55)

Sum up…

Brute-force:

compare sk(A) vs. sk(B) for all the pairs of documents A and B.

Locality sensitive hashing (LSH)

Compute sk(A) for each document A

Use LSH of all sketches, briefly:

Take h elements of sk(A) as ID (may induce false positives)

Create t IDs (to reduce the false negatives)

If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates;

hence compare.

Riferimenti

Documenti correlati

From the different species analyzed in this review, it emerged that both sensitive and tolerant plants have an innate defense mechanism which includes morphological changes,

the upstream regions of cases exons and less LTRs in the upstream regions of control exons, than in the downstream regions, were noticed; with regions of 230 nucleotides

This guidance is part of a general framework, to be published in full by the end of 2015, which includes the following themes: 1) Support for companies providing apprenticeships,

Since 2016, the Commission has organised an annual European Vocational Skills Week, which also includes the EAfA Awards ceremony for companies and apprentices, in order to improve

Public sector and larger organisations tended to have organisational strategies, centralised processes and guidance and tools that they could draw upon, resulting in a more

TERA’s internal components (Event Management, Sub- scription Management, Access Point Lookup, Partition Merging, Inner-Cluster Dissemination), detailed in Section 3, working on

It can be noticed that isoproterenol application reduces the variability of local Ca 2+ release and reuptake in HF cells and, as reported in the columns (Figure 2B,C), the