Paolo Ferragina Crawling

(1)

Crawling

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.1, 20.2 and 20.3

(2)

Spidering



24h, 7days “walking” over a Graph



What about the Graph?



BowTie



Direct graph G = (N, E)



N changes (insert, delete) >> 50 * 10

⁹

nodes



E changes (insert, delete) > 10 links per node

 10*50*10^{9 =}500*10⁹1-entries in adj matrix

(3)

Crawling Issues



How to crawl?

 Quality: “Best” pages first

 Efficiency: Avoid duplication (or near duplication)

 Etiquette: Robots.txt, Server load concerns (Minimize load)



How much to crawl? How much to index?

 Coverage: How big is the Web? How much do we cover?

 Relative Coverage: How much do competitors have?



How often to crawl?

 Freshness: How much has changed?



How to parallelize the process

(4)

Page selection



Given a page P, define how “good” P is.



Several metrics:



BFS, DFS, Random



Popularity driven (PageRank, full vs partial)



Topic driven or focused crawling



Combined

(5)

This page is a new one ?



Check if file has been parsed or downloaded before

 after 20 mil pages, we have “seen” over 200 million URLs

 each URL is at least 100 bytes on average

 Overall we have about 20Gb of URLS



Options: compress URLs in main memory, or use disk

 Bloom Filter (Archive)

 Disk access with caching (Mercator, Altavista)

(6)

Link Extractor:

while(<Page Repository is not empty>){

<take a page p (check if it is new)>

<extract links contained in p within href>

<extract links contained in javascript>

<extract …..

}

Downloaders:

while(<Assigned Repository is not empty>){

<extract url u>

<download page(u)>

<send page(u) to the Page Repository>

<store page(u) in a proper archive, possibly compressed>

}

Crawler Manager:

while(<Priority Queue is not empty>){

<extract some URL u having the highest priority>

foreach u extracted {

if ( (u  “Already Seen Page” ) ||

( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) {

} }

}

Crawler “cycle of life”

PQ

PR

Crawler Manager AR

Downloaders

Link Extractor

(7)

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication



Dynamic assignment

 Central coordinator dynamically assigns URLs to crawlers

 Links are given to Central coordinator (?bottleneck?)



Static assignment

 Web is statically partitioned and assigned to crawlers

 Crawler only crawls its part of the web

(8)

Two problems



Load balancing the #URLs assigned to downloaders:

 Static schemes based on hosts may fail

 www.geocities.com/….

 www.di.unipi.it/

 Dynamic “relocation” schemes may be complicated



Managing the fault-tolerance:

 What about the death of downloaders ? DD-1, new hash !!!

 What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps an URL to {0,...,D-1}.

Dowloader x fetches the URLs U s.t.

hash(U) = x

(9)

A nice technique: Consistent Hashing



A tool for:

 Spidering

 Web Cache

 P2P

 Routers Load Balance

 Distributed FS

 Item and servers mapped to unit circle

 Item K assigned to first server N such that ID(N) ≥ ID(K)

 What if a downloader goes down?

 What if a new downloader appears?

Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

(10)

Examples: Open Source



Nutch, also used by WikiSearch

 http://www.nutch.org



Hentrix, used by Archive.org

 http://archive-crawler.sourceforge.net/index.html



Consisten Hashing



Amazon’s Dynamo

(11)

Ranking

Link-based Ranking

(2° generation)

Reading 21

(12)

Query-independent ordering



First generation: using link counts as simple measures of popularity.



Undirected popularity:

 Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).



Directed popularity:

 Score of a page = number of its in-links (es. 3).

Easy to SPAM

(13)

Second generation:

PageRank



Each link has its own importance!!



PageRank is



independent of the query



many interpretations…

(14)

Basic Intuition…

What about nodes with no in/out links?

(15)

Google’s Pagerank





 





else j i i

j out

i

0

) (

# 1

,

N j

out j i r

r

i B j

) 1 1

) ( (

#

) ) (

(

) (









 





B(i) B(i) : set of pages linking to i. : set of pages linking to i.

#out(j)

#out(j) : number of outgoing links from j. : number of outgoing links from j.

e e : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principal eigenvector

r = [  

^T

+ (1-) e e

^T

] × r

(16)

Three different interpretations



Graph (intuitive interpretation)

 Co-citation



Matrix (easy for computation)

 Eigenvector computation or a linear system solution



Markov Chain (useful to prove convergence)

 a sort of Usage Simulation

Any node

 Neighbors

“In the steady state” each



page has a long-term visit rate

- use this as the page’s score.

(17)

Pagerank: use in Search Engines



Preprocessing:



Given graph, build matrix



Compute its principal eigenvector r



r[i] is the pagerank of page i

We are interested in the relative order



Query processing:



Retrieve pages containing query terms



Rank them by their Pagerank

The final order is query- independent

 

^T

+ (1-) e e

^T

(18)

HITS: Hypertext Induced Topic

Search

(19)

Calculating HITS

 It is query- dependent



Produces two scores per page:



Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.



Hub score: A good hub page for a topic

points to many authoritative pages for that

topic.

(20)

Authority and Hub scores

2

3

4 1 1

5

6

7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

(21)

HITS: Link Analysis Computation

Where

a: Vector of Authority’s scores h: Vector of Hub’s scores.

A: Adjacency matrix in which ai,j = 1 if ij

h AA

h

Aa A

a Aa

h

h A

a

T T T



 

 





Thus, h is an eigenvector of AA

^t

a is an eigenvector of A

^t

A

Symmetric matrices

(22)

Weighting links

Weight more if the query occurs in the

neighborhood of the link (e.g. anchor text).





y x

y a

x h



) (





x y

y h x

a



) (

)

( ( ) ( , ) ( )

) ( )

, ( )

(

y h y

x w x

a

y a y

x w x

h

x y

y x













(23)

Latent Semantic Indexing

(mapping onto a smaller space of latent concepts)

Paolo Ferragina

Reading 18

(24)

Speeding up cosine computation



What if we could take our vectors and

“pack” them into fewer dimensions (say 50,000100) while preserving distances?



Now, O(nm)



Then, O(km+kn) where k << n,m



Two methods:



“Latent semantic indexing”



Random projection

(25)

A sketch



LSI is data-dependent



Create a k-dim subspace by eliminating redundant axes



Pull together “related” axes – hopefully

 car and automobile



Random projection is data-independent



Choose a k-dim subspace that guarantees good stretching properties with high

probability between pair of points.

What about polysemy ?

(26)

Notions from linear algebra



Matrix A, vector v



Matrix transpose (A

^t

)



Matrix product



Rank



Eigenvalues and eigenvector v: Av = v

(27)

Overview of LSI



Pre-process docs using a technique from linear algebra called Singular Value

Decomposition



Create a new (smaller) vector space



Queries handled (faster) in this new space

(28)

Singular-Value Decomposition



Recall m  n matrix of terms  docs, A.



A has rank r  m,n



Define term-term correlation matrix T=AA

^t



T is a square, symmetric m  m matrix



Let P be m  r matrix of eigenvectors of T



Define doc-doc correlation matrix D=A

^t

A



D is a square, symmetric n  n matrix



Let R be n  r matrix of eigenvectors of D

(29)

A’s decomposition



Do exist matrices P

(for T, m  r)

and R

(for D, n  r)

formed by orthonormal columns

(unit dot-product)



It turns out that A = P  R

^t



Where  is a diagonal matrix with the

eigenvalues of T=AA

^t

in decreasing order.

=

A P  ^R

^t

mn mr rr rn

(30)



 For some k << r, zero out all but the k biggest eigenvalues in  [choice of k is crucial]

 Denote by k this new version of , having rank k

 Typically k is about 100, while r (^{A’s rank}) is > 10,000

=

P 

_k

R

^t

Dimensionality reduction

A

_k

document

useless due to 0-col/0-row of _k

m x r r x n

r k k

k

0 0

0

A

m x k k x n

(31)

Guarantee



A

_k

is a pretty good approximation to A:

 Relative distances are (approximately) preserved

 Of all m  n matrices of rank k, Ak is the best approximation to A wrt the following measures:

 minB, rank(B)=k ||A-B||₂ = ||A-A_k||₂ = _k

 minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2



Frobenius norm ||A||

_F2

= 

_2



_2



_r2

(32)

Reduction



X

k

= 

k

R

t is the doc-matrix k x n, hence reduced to k dim

 Take the doc-correlation matrix:

 It is D=A_tA=(P  R^t)^t (P R^t) = (R^t)^t (R^t)

 Approx with k, thus get A_tA Xkt Xk (both are n x n matr.)



We use X

k

to define A’s projection:



X

k =



_k

R

^t

, substitute R

^t

= 

^

P

^t

A, so get P

_kt

A .



In fact, 

k



^

P

^t

= P

kt

which is a k x m matrix



This means that to reduce a doc/query vector is enough to multiply it by P

_kt

 Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

R,P are formed by

orthonormal eigenvectors of the matrices D,T

(33)

Which are the concepts ?



c-th concept = c-th row of P

_kt(which is k x m)



Denote it by P

_kt

[c], whose size is m = #terms



P

_kt

[c][i] = strength of association between c-th concept and i-th term



Projected document: d’

_j

= P

_kt

d

_j



d’

_j

[c] = strenght of concept c in d

_j



Projected query: q’ = P

_kt

q



q’ [c] = strenght of concept c in q

(34)

Random Projections

Paolo Ferragina

Slides only !

(35)

An interesting math result

Setting v=0 we also get a bound on f(u)’s stretching!!!

d is our previous m = #terms

(36)

What about the cosine-distance ?

f(u)’s, f(v)’s stretching

substituting formula above

(37)

A practical-theoretical idea !!!

E[r_i,j] = 0 Var[r_i,j] = 1

(38)

Finally...

 Random projections hide large constants

 k  (1/)

²

* log d, so it may be large…

 it is simple and fast to compute

 LSI is intuitive and may scale to any k

 optimal under various metrics

 but costly to compute

(39)

Document duplication

(exact or approximate)

Paolo Ferragina

Slides only!

(40)

Duplicate documents



The web is full of duplicated content



Few exact duplicate detection



Many cases of near duplicates

 E.g., Last modified date the only difference between two copies of a page

Sec. 19.6

(41)

Natural Approaches



Fingerprinting:

 only works for exact matches



Random Sampling

 sample substrings (phrases, sentences, etc)

 hope: similar documents  similar samples

 But – even samples of same document will differ



Edit-distance

 metric for approximate string-matching

 expensive – even for one pair of strings

 impossible – for 10³² web documents

(42)



Obvious techniques

 Checksum – no worst-case collision probability guarantees

 MD5 – cryptographically-secure string hashes

 relatively slow



Karp-Rabin’s Scheme

 Algebraic technique – arithmetic on primes

 Efficient and other nice properties…

Exact-Duplicate Detection

(43)

Karp-Rabin Fingerprints

 Consider – m-bit string A=a1 a2 … am

 Assume – a₁=1 and fixed-length strings (wlog)

 Basic values:

 Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 2⁶⁴)

 Set h = d^m-1 mod p

 Fingerprints: f(A) = A mod p

 Nice property is that if B = a₂ … a_ma_m+1

 f(B) = [d (A - a₁ h) + a_m+1 ] mod p

 Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U

(44)

Near-Duplicate Detection



Problem



Given a large collection of documents



Identify the near-duplicate documents



Web search engines



Proliferation of near-duplicate documents

 Legitimate – mirrors, local copies, updates, …

 Malicious – spam, spider-traps, dynamic URLs, …

 Mistaken – spider errors



30% of web-pages are near-duplicates

[1997]

(45)

Desiderata



Storage: only small sketches of each document.



Computation: the fastest possible



Stream Processing :



once sketch computed, source is unavailable



Error Guarantees

 problem scale  small biases have large impact

 need formal guarantees – heuristics will not do

(46)

Basic Idea [Broder 1997]



Shingling



dissect document into q-grams (shingles)



represent documents by shingle-sets



reduce problem to set intersection [ Jaccard ]

 They are near-duplicates if large shingle-sets intersect enough



We know how to cope with “Set Intersection”

 fingerprints of shingles (for space efficiency)

 min-hash to estimate intersections sizes (for time and space efficiency)

(47)

Multiset of Fingerprints Doc _shingling Multiset

of

Shingles

fingerprint

Documents  Sets of 64-bit fingerprints

Fingerprints:

• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)

• Fingerprint space [0, …, U-1]

• In practice, use 64-bit fingerprints, i.e., U=2⁶⁴

• Prob[collision] ≈ (8q)/2⁶⁴ << 1

(48)

Similarity of Documents

Doc

S _B

B

S _A

Doc A

• Jaccard measure – similarity of S_A, S_B U = [0 … N-1]

• Claim: A & B are near-duplicates if sim(S_A,S_B) is high

B A

B

A

S S

S ) S

S , sim(S



 

(49)

Speeding-up: Sketch of a document



Intersecting directly the shingles is too costly



Create a “sketch vector” (of size

~200) for each document



Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates

Sec. 19.6

(50)

Sketching by Min-Hashing



Consider



S

A

, S

B

 P



Pick a random permutation π of P

(such as ax+b mod |P|)



Define  = π

^-1

( min{π(S

A

)} ) ,  = π

-1

( min{π(S

B

)} )



minimal element under permutation π



Lemma:

^A ^B

B A

S S

S β] S

P[α 

 



(51)

Sum up…



Similarity sketch sk(A) = k minimal elements under π(S

A

)

 K is fixed or is a fixed ratio of S_A,S_B?

 We might also take K permutations and the min of each



Similarity Sketches sk(A):

 Succinct representation of fingerprint sets SA

 Allows efficient estimation of sim(SA,SB)

 Basic idea is to use min-hash of fingerprints



Note

: we can reduce the variance by using a larger k

(52)

Computing Sketch[i] for Doc1

Document 1

2⁶⁴ 2⁶⁴ 2⁶⁴ 2⁶⁴

Start with 64-bit f(shingles) Permute on the number line with _i

Pick the min value

Sec. 19.6

(53)

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

2⁶⁴ 2⁶⁴ 2⁶⁴ 2⁶⁴

Are these equal?

Test for 200 random permutations:



_

, 

_

,… 

₂₀₀

A B

Sec. 19.6

(54)

However…

Document 1 Document 2 2⁶⁴

2⁶⁴ 2⁶⁴ 2⁶⁴

2⁶⁴ 2⁶⁴ 2⁶⁴ 2⁶⁴

A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)

Claim: This happens with probability

Size_of_intersection / Size_of_union

A B

Sec. 19.6

(55)

Sum up…



Brute-force:

compare sk(A) vs. sk(B) for all the pairs of documents A and B.



Locality sensitive hashing (LSH)

 Compute sk(A) for each document A

 Use LSH of all sketches, briefly:

 Take h elements of sk(A) as ID (may induce false positives)

 Create t IDs (to reduce the false negatives)

 If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates;

hence compare.