Web search engines
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Two main difficulties
The Web:
Size: more than tens of billions of pages
Language and encodings: hundreds…
Distributed authorship: SPAM, format-less,…
Dynamic: in one year 35% survive, 20% untouched
The User:
Query composition: short (2.5 terms avg) and imprecise
Query results: 85% users look at just one result-page
Several needs: Informational, Navigational, Transactional
Extracting “significant data” is difficult !!
Matching “user needs” is difficult !!
Evolution of Search Engines
First generation -- use only on-page, web-text data
Word frequency and language
Second generation -- use off-page, web-graph data
Link (or connectivity) analysis
Anchor-text (How people refer to a page)
Third generation -- answer “the need behind the query”
Focus on “user need”, rather than on query
Integrate multiple data-sources
Click-through data
1995-1997 AltaVista, Excite, Lycos, etc
1998: Google
Fourth generation Information Supply
[Andrei Broder, VP emerging search tech, Yahoo! Research]
Google, Yahoo, MSN, ASK,………
2009-12
2009
This is a search engine!!!
II° generation III° generation
IV° generation
Quality of a search engine
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 8
Is it good ?
How fast does it index
Number of documents/hour
(Average document size)
How fast does it search
Latency as a function of index size
Expressiveness of the query language
Measures for a search engine
All of the preceding criteria are measurable
The key measure: user happiness
…useless answers won’t make a user happy
Happiness: elusive to measure
Commonest approach is given by the relevance of search results
How do we measure it ?
Requires 3 elements:
1.
A benchmark document collection
2.
A benchmark suite of queries
3.
A binary assessment of either Relevant or
Irrelevant for each query-doc pair
Evaluating an IR system
Standard benchmarks
TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years
Other doc collections: marked by human experts, for each query and for each doc,
Relevant or Irrelevant
On the Web everything is more complicated since we
cannot mark the entire corpus !!
General scenario
Relevant
Retrieved collection
Precision: % docs retrieved that are relevant
[issue “junk”found]
Precision vs. Recall
Relevant
Retrieved collection
Recall: % docs relevant that are retrieved
[issue “info” found]How to compute them
Precision: fraction of retrieved docs that are relevant
Recall: fraction of relevant docs that are retrieved
Precision P = tp/(tp + fp)
Recall R = tp/(tp + fn)
Relevant Not Relevant Retrieved tp
(true positive)fp
(false positive)Not
Retrieved fn
(false negative)tn
(true negative)Some considerations
Can get high recall (but low precision) by retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
Precision usually decreases
Precision-Recall curve
We measures Precision at various levels of Recall
Note: it is an AVERAGE over many queries
precision
recall x x
x
x
A common picture
precision
recall x x
x
x
F measure
Combined measure
(weighted harmonic mean):
People usually use balanced F
1measure
i.e., with = ½ thus 1/F = ½ (1/P + 1/R)
Use this if you need to optimize a single measure that balances precision and recall.
R P
F 1
) 1
1 (
1
The web graph: properties
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 19.1 and 19.2
The Web’s Characteristics
Size
1 trillion of pages is available (Google 7/08)
50 billion static pages
5-40K per page => terabytes & terabytes
Size grows every day!!
Change
8% new pages, 25% new links change weekly
Life time of about 10 days
The Bow Tie
SCC SCC
WCC WCC
Some definitions
Weakly connected components
(WCC) Set of nodes such that from any node can go to any node via an undirected path.
Strongly connected components
(SCC) Set of nodes such that from any node can go to any node via a directed path.
Find the CORE
Iterate the following process:
Pick a random vertex v
Compute all nodes reached from v: O(v)
Compute all nodes that reach v: I(v)
Compute SCC(v):= I(v) ∩ O(v)
Check whether it is the largest SCC
If the CORE is about ¼ of the vertices, after 20 iterations, Pb to
not find the core < 1% (given that the graph is available).
Compute SCCs
Classical Algorithm:
1)
DFS(G)
2)
Transpose G in G
T3)
DFS(G
T) following vertices in decreasing order of the time their visit ended at step 1.
4)
Every tree is a SCC.
DFS is hard to compute on disk: no locality
DFS
DFS(u:vertex)
color[u]=GRAY
d[u] time time +1 foreach v in succ[u] do
if (color[v]=WHITE) then p[v] u
DFS(v) endFor
color[u] BLACK
f[u] time time + 1
Classical Approach
main(){
foreach vertex v do color[v]=WHITE endFor
foreach vertex v do
if (color[v]==WHITE) DFS(v);
endFor }
Semi-External DFS
Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.
• Bit array of nodes (visited or not)
• Array of successors
• Stack of the DFS-recursion
Observing Web Graph
We do not know which percentage of it we know
The only way to discover the graph structure of the web is via large scale crawls
Warning: the picture might be distorted by
Size limitation of the crawl
Crawling rules
Perturbations of the "natural" process of birth and death of nodes and links
Why is it interesting?
Largest artifact ever conceived by the human
Exploit its structure of the Web for
Crawl strategies
Search
Spam detection
Discovering communities on the web
Classification/organization
Predict the evolution of the Web
Sociological understanding
Many other large graphs…
Physical network graph
V = Routers
E = communication links
The “cosine” graph (undirected, weighted)
V = static web pages
E = semantic distance between pages
Query-Log graph (bipartite, weighted)
V = queries and URL
E = (q,u) u is a result for q, and has been clicked by some user who issued q
Social graph (undirected, unweighted)
V = users
E = (x,y) if x knows y (facebook, address book, email,..)
The size of the web
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 19.5
What is the size of the web ?
Issues
The web is really infinite
Dynamic content, e.g., calendar
Static web contains syntactic duplication, mostly due to mirroring (~30%)
Some servers are seldom connected
Who cares?
Media, and consequently the user
Engine design
What can we attempt to measure?
The relative sizes of search engines
Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text.
Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.)
The coverage of a search engine relative to
another particular crawling process.
A B = (1/2) * Size A A B = (1/6) * Size B
(1/2)*Size A = (1/6)*Size B
Size A / Size B =
(1/6)/(1/2) = 1/3 Sample URLs randomly from A
Check if contained in B and vice versa
A B
Each test involves: (i) Sampling URL (ii) Checking URL
Relative Size from Overlap Given two engines A and B
Sec. 19.5
Sampling URLs
Ideal strategy: Generate a random URL and check for containment in each index.
Problem: Random URLs are hard to find!
Approach 1: Generate a random URL surely contained in a given search engine
Approach 2: Random walks or random IP
addresses
#1: Random URL in SE via random queries
Generate random query:
Lexicon: 400,000+ words from a web crawl
Conjunctive Queries: w
1and w
2e.g., vocalists AND rsi
Get 100 result URLs from engine A
Choose a random URL as the candidate to check for presence in search engine B (next slide)
This induces a probability weight W(p) for each page.
Conjecture: W(SE
A) / W(SE
B) ~ |SE
A| / |SE
B|
URL checking
Download D at address URL.
Get list of words.
Use 8 low frequency words as AND query to B
Check if D is present in result set.
Problems:
Near duplicates
Engine time-outs
Is 8-word query good enough?
Advantages & disadvantages
Statistically sound under the induced weight.
Biases induced by random query
Query bias:
Favors content-rich pages in the language(s) of the lexicon
Ranking bias [
Solution: Use conjunctive queries & fetch all]
Query restriction bias:
engine might not deal properly with 8 words conjunctive query
Malicious bias:
Sabotage by engine
Operational Problems:
Time-outs, failures, engine inconsistencies, index modification.#2: Random IP addresses
Find a web server at the given IP address
If there’s one
Collect all pages from server
From this, choose a page at random
Advantages & disadvantages
Advantages
Clean statistics
Independent of crawling strategies
Disadvantages
Many hosts might share one IP, or not accept requests
No guarantee all pages are linked to root page, and thus can be collected.
Power law for # pages/hosts generates bias towards
sites with few pages.
Conclusions
No sampling solution is perfect.
Lots of new ideas ...
....but the problem is getting harder
Quantitative studies are fascinating and a
good research problem
The web-graph: storage
Paolo Ferragina
Dipartimento di Informatica Università di Pisa
Reading 20.4
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN & no OUT) Three key properties:
Skewed distribution: Pb that a node has x links is
1/x
, ≈ 2.1The In-degree distribution
Altavista crawl, 1999 WebBase Crawl 2001
Indegree follows power law distribution
k
k
u 1
] )
( degree -
in
Pr[
2.1
This is true also for: out-degree, size components,...
Definition
Directed graph G = (V,E)
V = URLs, E = (u,v) if u has an hyperlink to v
Isolated URLs are ignored (no IN, no OUT)
Three key properties:
Skewed distribution: Pb that a node has x links is
1/x
, ≈ 2.1 Locality: usually most of the hyperlinks point to other URLs on
the same host (about 80%).
Similarity: pages close in lexicographic order tend to share many outgoing lists
A Picture of the Web Graph
i
j
21 millions of pages, 150millions of links
URL-sorting
Stanford
Berkeley
Front-compression of URLs and adjacent storage of ids
Front-coding
Encoding positive/negative ints:
Copy-lists: close nodes sharemany successors
Uncompressed adjacency list
Each bit of the copy-list informs whether the corresponding successor of y is also a successor of the reference x;
The reference index is chosen in [0,W] that gives the best compression.
Adjacency list with copy lists
(similarity)
Reference chains
possibly limited
Copy-blocks = RLE(Copy-list)
Adjacency list with copy lists.
The first copy block is 0 if the copy list starts with 0;
Each RLE-length is decremented by one for all blocks The last block is omitted (we know the length from Outd);
Adjacency list with copy blocks
(RLE on bit sequences)
1,3
3
Extra-nodes: Intervals and Residuals
Adjacency list with copy blocks.
Consecutivity in extra-nodes
0 = (15-15)*2
(positive)2 = (23-19)-2
(jump >= 2)600 = (316-16)*2 12 = (22-16)*2
(positive)3018 = 3041-22-1
(difference)Intervals: encoded by left extreme and length Int. length: decremented by Lmin = 2
Residuals: differences between residuals, or the source (the first)
This is a Java and C++ lib
(≈3 bits/edge)