Crawling
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 20.1, 20.2 and 20.3
Spidering
24h, 7days “walking” over a Graph
What about the Graph?
BowTie
Direct graph G = (N, E)
N changes (insert, delete) >> 50 * 10
9nodes
E changes (insert, delete) > 10 links per node
10*50*109 = 500*109 1-entries in adj matrix
Crawling Issues
How to crawl?
Quality: “Best” pages first
Efficiency: Avoid duplication (or near duplication)
Etiquette: Robots.txt, Server load concerns (Minimize load)
How much to crawl? How much to index?
Coverage: How big is the Web? How much do we cover?
Relative Coverage: How much do competitors have?
How often to crawl?
Freshness: How much has changed?
How to parallelize the process
Page selection
Given a page P, define how “good” P is.
Several metrics:
BFS, DFS, Random
Popularity driven (PageRank, full vs partial)
Topic driven or focused crawling
Combined
This page is a new one ?
Check if file has been parsed or downloaded before
after 20 mil pages, we have “seen” over 200 million URLs
each URL is at least 100 bytes on average
Overall we have about 20Gb of URLS
Options: compress URLs in main memory, or use disk
Bloom Filter (Archive)
Disk access with caching (Mercator, Altavista)
Link Extractor:
while(<Page Repository is not empty>){
<take a page p (check if it is new)>
<extract links contained in p within href>
<extract links contained in javascript>
<extract …..
<insert these links into the Priority Queue>
}
Downloaders:
while(<Assigned Repository is not empty>){
<extract url u>
<download page(u)>
<send page(u) to the Page Repository>
<store page(u) in a proper archive, possibly compressed>
}
Crawler Manager:
while(<Priority Queue is not empty>){
<extract some URL u having the highest priority>
foreach u extracted {
if ( (u “Already Seen Page” ) ||
( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) {
<resolve u wrt DNS>
<send u to the Assigned Repository>
} }
}
Crawler “cycle of life”
PQ
PR
Crawler Manager AR
Downloaders
Link Extractor
Parallel Crawlers
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication
Dynamic assignment
Central coordinator dynamically assigns URLs to crawlers
Links are given to Central coordinator (?bottleneck?)
Static assignment
Web is statically partitioned and assigned to crawlers
Crawler only crawls its part of the web
Two problems
Load balancing the #URLs assigned to downloaders:
Static schemes based on hosts may fail
www.geocities.com/….
www.di.unipi.it/
Dynamic “relocation” schemes may be complicated
Managing the fault-tolerance:
What about the death of downloaders ? DD-1, new hash !!!
What about new downloaders ? DD+1, new hash !!!
Let D be the number of downloaders.
hash(URL) maps an URL to {0,...,D-1}.
Dowloader x fetches the URLs U s.t.
hash(U) = x
A nice technique: Consistent Hashing
A tool for:
Spidering
Web Cache
P2P
Routers Load Balance
Distributed FS
Item and servers mapped to unit circle
Item K assigned to first server N such that ID(N) ≥ ID(K)
What if a downloader goes down?
What if a new downloader appears?
Each server gets replicated log S times
[monotone] adding a new server moves points between one old to the new, only.
[balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...
Examples: Open Source
Nutch, also used by WikiSearch
http://www.nutch.org
Hentrix, used by Archive.org
http://archive-crawler.sourceforge.net/index.html
Consisten Hashing
Amazon’s Dynamo
Ranking
Link-based Ranking
(2° generation)
Reading 21
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Undirected popularity:
Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5).
Directed popularity:
Score of a page = number of its in-links (es. 3).
Easy to SPAM
Second generation:
PageRank
Each link has its own importance!!
PageRank is
independent of the query
many interpretations…
Basic Intuition…
What about nodes with no in/out links?
Google’s Pagerank
else j i i
j out
i
0
) (
# 1
,
N j
out j i r
r
i B j
) 1 1
) ( (
#
) ) (
(
) (
B(i) B(i) : set of pages linking to i. : set of pages linking to i.
#out(j)
#out(j) : number of outgoing links from j. : number of outgoing links from j.
e e : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.
Random jump
Principal eigenvector
r = [
T+ (1-) e e
T] × r
Three different interpretations
Graph (intuitive interpretation)
Co-citation
Matrix (easy for computation)
Eigenvector computation or a linear system solution
Markov Chain (useful to prove convergence)
a sort of Usage Simulation
Any node
Neighbors
“In the steady state” each
page has a long-term visit rate
- use this as the page’s score.
Pagerank: use in Search Engines
Preprocessing:
Given graph, build matrix
Compute its principal eigenvector r
r[i] is the pagerank of page i
We are interested in the relative order
Query processing:
Retrieve pages containing query terms
Rank them by their Pagerank
The final order is query- independent
T+ (1-) e e
THITS: Hypertext Induced Topic
Search
Calculating HITS
It is query- dependent
Produces two scores per page:
Authority score: a good authority page for a topic is pointed to by many good hubs for that topic.
Hub score: A good hub page for a topic
points to many authoritative pages for that
topic.
Authority and Hub scores
2
3
4
1 1
5
6
7
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
HITS: Link Analysis Computation
Where
a: Vector of Authority’s scores h: Vector of Hub’s scores.
A: Adjacency matrix in which ai,j = 1 if ij
h AA
h
Aa A
a Aa
h
h A
a
T T T
Thus, h is an eigenvector of AA
ta is an eigenvector of A
tA
Symmetric matrices
Weighting links
Weight more if the query occurs in the
neighborhood of the link (e.g. anchor text).
y x
y a
x h
) (
) (
x y
y h x
a
) (
)
( ( ) ( , ) ( )
) ( )
, ( )
(
y h y
x w x
a
y a y
x w x
h
x y
y x
Latent Semantic Indexing
(mapping onto a smaller space of latent concepts)
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 18
Speeding up cosine computation
What if we could take our vectors and
“pack” them into fewer dimensions (say 50,000100) while preserving distances?
Now, O(nm)
Then, O(km+kn) where k << n,m
Two methods:
“Latent semantic indexing”
Random projection
A sketch
LSI is data-dependent
Create a k-dim subspace by eliminating redundant axes
Pull together “related” axes – hopefully
car and automobile
Random projection is data-independent
Choose a k-dim subspace that guarantees good stretching properties with high
probability between pair of points.
What about polysemy ?
Notions from linear algebra
Matrix A, vector v
Matrix transpose (A
t)
Matrix product
Rank
Eigenvalues and eigenvector v: Av = v
Overview of LSI
Pre-process docs using a technique from linear algebra called Singular Value
Decomposition
Create a new (smaller) vector space
Queries handled (faster) in this new space
Singular-Value Decomposition
Recall m n matrix of terms docs, A.
A has rank r m,n
Define term-term correlation matrix T=AA
t
T is a square, symmetric m m matrix
Let P be m r matrix of eigenvectors of T
Define doc-doc correlation matrix D=A
tA
D is a square, symmetric n n matrix
Let R be n r matrix of eigenvectors of D
A’s decomposition
Do exist matrices P
(for T, m r)and R
(for D, n r)formed by orthonormal columns
(unit dot-product)
It turns out that A = P R
t
Where is a diagonal matrix with the
eigenvalues of T=AA
tin decreasing order.
=
A P R
tmn mr rr rn
For some k << r, zero out all but the k biggest eigenvalues in [choice of k is crucial]
Denote by k this new version of , having rank k
Typically k is about 100, while r (A’s rank) is > 10,000
=
P
kR
tDimensionality reduction
A
kdocument
useless due to 0-col/0-row of k
m x r r x n
r k k
k
0 0
0
A
m x k k x nGuarantee
A
kis a pretty good approximation to A:
Relative distances are (approximately) preserved
Of all m n matrices of rank k, Ak is the best approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = k
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = k2k+22r2
Frobenius norm ||A||
F2=
2
2
r2Reduction
X
k=
kR
t is the doc-matrix k x n, hence reduced to k dim Take the doc-correlation matrix:
It is D=At A=(P Rt)t (P Rt) = (Rt)t (Rt)
Approx with k, thus get At A Xkt Xk (both are n x n matr.)
We use X
kto define A’s projection:
X
k =
kR
t, substitute R
t=
P
tA, so get P
ktA .
In fact,
k
P
t= P
ktwhich is a k x m matrix
This means that to reduce a doc/query vector is enough to multiply it by P
kt Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
R,P are formed by
orthonormal eigenvectors of the matrices D,T
Which are the concepts ?
c-th concept = c-th row of P
kt (which is k x m)
Denote it by P
kt[c], whose size is m = #terms
P
kt[c][i] = strength of association between c-th concept and i-th term
Projected document: d’
j= P
ktd
j
d’
j[c] = strenght of concept c in d
j
Projected query: q’ = P
ktq
q’ [c] = strenght of concept c in q
Random Projections
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Slides only !
An interesting math result
Setting v=0 we also get a bound on f(u)’s stretching!!!
d is our previous m = #terms
What about the cosine-distance ?
f(u)’s, f(v)’s stretching
substituting formula above
A practical-theoretical idea !!!
E[ri,j] = 0 Var[ri,j] = 1
Finally...
Random projections hide large constants
k (1/)
2* log d, so it may be large…
it is simple and fast to compute
LSI is intuitive and may scale to any k
optimal under various metrics
but costly to compute
Document duplication
(exact or approximate)
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Slides only!
Duplicate documents
The web is full of duplicated content
Few exact duplicate detection
Many cases of near duplicates
E.g., Last modified date the only difference between two copies of a page
Sec. 19.6
Natural Approaches
Fingerprinting:
only works for exact matches
Random Sampling
sample substrings (phrases, sentences, etc)
hope: similar documents similar samples
But – even samples of same document will differ
Edit-distance
metric for approximate string-matching
expensive – even for one pair of strings
impossible – for 1032 web documents
Obvious techniques
Checksum – no worst-case collision probability guarantees
MD5 – cryptographically-secure string hashes
relatively slow
Karp-Rabin’s Scheme
Algebraic technique – arithmetic on primes
Efficient and other nice properties…
Exact-Duplicate Detection
Karp-Rabin Fingerprints
Consider – m-bit string A=a1 a2 … am
Assume – a1=1 and fixed-length strings (wlog)
Basic values:
Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264)
Set h = dm-1 mod p
Fingerprints: f(A) = A mod p
Nice property is that if B = a2 … am am+1
f(B) = [d (A - a1 h) + am+1 ] mod p
Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U
Near-Duplicate Detection
Problem
Given a large collection of documents
Identify the near-duplicate documents
Web search engines
Proliferation of near-duplicate documents
Legitimate – mirrors, local copies, updates, …
Malicious – spam, spider-traps, dynamic URLs, …
Mistaken – spider errors
30% of web-pages are near-duplicates
[1997]
Desiderata
Storage: only small sketches of each document.
Computation: the fastest possible
Stream Processing :
once sketch computed, source is unavailable
Error Guarantees
problem scale small biases have large impact
need formal guarantees – heuristics will not do
Basic Idea [Broder 1997]
Shingling
dissect document into q-grams (shingles)
represent documents by shingle-sets
reduce problem to set intersection [ Jaccard ]
They are near-duplicates if large shingle-sets intersect enough
We know how to cope with “Set Intersection”
fingerprints of shingles (for space efficiency)
min-hash to estimate intersections sizes (for time and space efficiency)
Multiset of Fingerprints Doc shingling Multiset
of
Shingles
fingerprint
Documents Sets of 64-bit fingerprints
Fingerprints:
• Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits)
• Fingerprint space [0, …, U-1]
• In practice, use 64-bit fingerprints, i.e., U=264
• Prob[collision] ≈ (8q)/264 << 1
Similarity of Documents
Doc
S B
BS A
Doc A
• Jaccard measure – similarity of SA, SB U = [0 … N-1]
• Claim: A & B are near-duplicates if sim(SA,SB) is high
B A
B A
B
A
S S
S ) S
S , sim(S
Speeding-up: Sketch of a document
Intersecting directly the shingles is too costly
Create a “sketch vector” (of size
~200) for each document
Documents that share ≥ t (say 80%) corresponding vector elements are near duplicates
Sec. 19.6
Sketching by Min-Hashing
Consider
S
A, S
B P
Pick a random permutation π of P
(such as ax+b mod |P|)
Define = π
-1( min{π(S
A)} ) , = π
-1
( min{π(S
B)} )
minimal element under permutation π
Lemma:
A BB A
S S
S β] S
P[α
Sum up…
Similarity sketch sk(A) = k minimal elements under π(S
A)
K is fixed or is a fixed ratio of SA,SB ?
We might also take K permutations and the min of each
Similarity Sketches sk(A):
Succinct representation of fingerprint sets SA
Allows efficient estimation of sim(SA,SB)
Basic idea is to use min-hash of fingerprints
Note
: we can reduce the variance by using a larger kComputing Sketch[i] for Doc1
Document 1
264 264 264 264
Start with 64-bit f(shingles) Permute on the number line with i
Pick the min value
Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264 264 264 264
264 264 264 264
Are these equal?
Test for 200 random permutations:
,
,…
200A B
Sec. 19.6
However…
Document 1 Document 2 264
264 264 264
264 264 264 264
A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection)
Claim: This happens with probability
Size_of_intersection / Size_of_union
A B
Sec. 19.6
Sum up…
Brute-force:
compare sk(A) vs. sk(B) for all the pairs of documents A and B.
Locality sensitive hashing (LSH)
Compute sk(A) for each document A
Use LSH of all sketches, briefly:
Take h elements of sk(A) as ID (may induce false positives)
Create t IDs (to reduce the false negatives)
If one ID matches with another one (wrt same h-selection), then the corresponding docs are probably near-duplicates;
hence compare.