Web search engines

(1)

Web search engines

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Two main difficulties

The Web:

 Size: more than tens of billions of pages

 Language and encodings: hundreds…

 Distributed authorship: SPAM, format-less,…

 Dynamic: in one year 35% survive, 20% untouched

The User:

 Query composition: short (2.5 terms avg) and imprecise

 Query results: 85% users look at just one result-page

 Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

(3)

Evolution of Search Engines

 First generation -- use only on-page, web-text data

 Word frequency and language

 Second generation -- use off-page, web-graph data

 Link (or connectivity) analysis

 Anchor-text (How people refer to a page)

 Third generation -- answer “the need behind the query”

 Focus on “user need”, rather than on query

 Integrate multiple data-sources

 Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation  Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo, MSN, ASK,………

(4)

(5)

(6)

(7)

2009-12

2009

(8)

(9)

(10)

This is a search engine!!!

(11)

II° generation III° generation

IV° generation

(12)

Quality of a search engine

Paolo Ferragina

Reading 8

(13)

Is it good ?



How fast does it index



Number of documents/hour



(Average document size)



How fast does it search



Latency as a function of index size



Expressiveness of the query language

(14)

Measures for a search engine



All of the preceding criteria are measurable



The key measure: user happiness

…useless answers won’t make a user happy

(15)

Happiness: elusive to measure



Commonest approach is given by the relevance of search results



How do we measure it ?



Requires 3 elements:

1.

A benchmark document collection

2.

A benchmark suite of queries

3.

A binary assessment of either Relevant or

Irrelevant for each query-doc pair

(16)

Evaluating an IR system



Standard benchmarks



TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years



Other doc collections: marked by human experts, for each query and for each doc,

Relevant or Irrelevant

 On the Web everything is more complicated since we

cannot mark the entire corpus !!

(17)

General scenario

Relevant

Retrieved collection

(18)



Precision: % docs retrieved that are relevant

[issue “junk”

found]

Precision vs. Recall

Relevant

Retrieved collection



Recall: % docs relevant that are retrieved

[issue “info” found]

(19)

How to compute them



Precision: fraction of retrieved docs that are relevant



Recall: fraction of relevant docs that are retrieved



Precision P = tp/(tp + fp)



Recall R = tp/(tp + fn)

Relevant Not Relevant Retrieved tp

(true positive)

fp

(false positive)

Not

Retrieved fn

(false negative)

tn

(true negative)

(20)

Some considerations



Can get high recall (but low precision) by retrieving all docs for all queries!



Recall is a non-decreasing function of the number of docs retrieved



Precision usually decreases

(21)

Precision-Recall curve

 We measures Precision at various levels of Recall

 Note: it is an AVERAGE over many queries

precision

recall x x

x

(22)

A common picture

precision

recall x x

x

(23)

F measure



Combined measure

(weighted harmonic mean)

:



People usually use balanced F

1

measure

 i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)



Use this if you need to optimize a single measure that balances precision and recall.

R P

F 1

) 1

1 (

1 

  



(24)

The web graph: properties

Paolo Ferragina

Reading 19.1 and 19.2

(25)

The Web’s Characteristics



Size



1 trillion of pages is available (Google 7/08)

 50 billion static pages



5-40K per page => terabytes & terabytes



Size grows every day!!



Change



8% new pages, 25% new links change weekly



Life time of about 10 days

(26)

The Bow Tie

(27)

SCC SCC

WCC WCC

Some definitions



Weakly connected components

(WCC)

 Set of nodes such that from any node can go to any node via an undirected path.



Strongly connected components

^(SCC)

 Set of nodes such that from any node can go to any node via a directed path.

(28)

Find the CORE



Iterate the following process:



Pick a random vertex v



Compute all nodes reached from v: O(v)



Compute all nodes that reach v: I(v)



Compute SCC(v):= I(v) ∩ O(v)



Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to

not find the core < 1% (given that the graph is available).

(29)

Compute SCCs



Classical Algorithm:

1)

DFS(G)

2)

Transpose G in G

^T

3)

DFS(G

^T

) following vertices in decreasing order of the time their visit ended at step 1.

4)

Every tree is a SCC.

DFS is hard to compute on disk: no locality

(30)

DFS

DFS(u:vertex)

color[u]=GRAY

d[u] time  time +1 foreach v in succ[u] do

if (color[v]=WHITE) then p[v] u

DFS(v) endFor

color[u] BLACK

f[u]  time  time + 1

Classical Approach

main(){

foreach vertex v do color[v]=WHITE endFor

foreach vertex v do

if (color[v]==WHITE) DFS(v);

endFor }

(31)

Semi-External DFS

Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

• Bit array of nodes (visited or not)

• Array of successors

• Stack of the DFS-recursion

(32)

Observing Web Graph



We do not know which percentage of it we know



The only way to discover the graph structure of the web is via large scale crawls



Warning: the picture might be distorted by

 Size limitation of the crawl

 Crawling rules

 Perturbations of the "natural" process of birth and death of nodes and links

(33)

Why is it interesting?



Largest artifact ever conceived by the human



Exploit its structure of the Web for

 Crawl strategies

 Search

 Spam detection

 Discovering communities on the web

 Classification/organization



Predict the evolution of the Web

 Sociological understanding

(34)

Many other large graphs…



Physical network graph

 V = Routers

 E = communication links



The “cosine” graph (undirected, weighted)

 V = static web pages

 E = semantic distance between pages



Query-Log graph (bipartite, weighted)

 V = queries and URL

 E = (q,u) u is a result for q, and has been clicked by some user who issued q



Social graph (undirected, unweighted)

 V = users

 E = (x,y) if x knows y (facebook, address book, email,..)

(35)

The size of the web

Paolo Ferragina

Reading 19.5

(36)

What is the size of the web ?



Issues



The web is really infinite

 Dynamic content, e.g., calendar



Static web contains syntactic duplication, mostly due to mirroring (~30%)



Some servers are seldom connected



Who cares?



Media, and consequently the user



Engine design

(37)

What can we attempt to measure?



The relative sizes of search engines

 Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text.

 Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.)



The coverage of a search engine relative to

another particular crawling process.

(38)

A B = (1/2) * Size A A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

Size A / Size B =

(1/6)/(1/2) = 1/3 Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling URL (ii) Checking URL

Relative Size from Overlap Given two engines A and B

Sec. 19.5

(39)

Sampling URLs



Ideal strategy: Generate a random URL and check for containment in each index.



Problem: Random URLs are hard to find!



Approach 1: Generate a random URL surely contained in a given search engine



Approach 2: Random walks or random IP

addresses

(40)

#1: Random URL in SE via random queries



Generate random query:



Lexicon: 400,000+ words from a web crawl



Conjunctive Queries: w

1

and w

2

e.g., vocalists AND rsi



Get 100 result URLs from engine A



Choose a random URL as the candidate to check for presence in search engine B (next slide)



This induces a probability weight W(p) for each page.



Conjecture: W(SE

_A

) / W(SE

_B

) ~ |SE

_A

| / |SE

_B

|

(41)

URL checking



Download D at address URL.



Get list of words.



Use 8 low frequency words as AND query to B



Check if D is present in result set.



Problems:



Near duplicates



Engine time-outs



Is 8-word query good enough?

(42)

Advantages & disadvantages



Statistically sound under the induced weight.



Biases induced by random query



Query bias:

Favors content-rich pages in the language(s) of the lexicon



Ranking bias [

Solution: Use conjunctive queries & fetch all]



Query restriction bias:

engine might not deal properly with 8 words conjunctive query



Malicious bias:

Sabotage by engine



Operational Problems:

Time-outs, failures, engine inconsistencies, index modification.

(43)

#2: Random IP addresses



Find a web server at the given IP address



If there’s one



Collect all pages from server



From this, choose a page at random

(44)

Advantages & disadvantages



Advantages



Clean statistics



Independent of crawling strategies



Disadvantages



Many hosts might share one IP, or not accept requests



No guarantee all pages are linked to root page, and thus can be collected.



Power law for # pages/hosts generates bias towards

sites with few pages.

(45)

Conclusions



No sampling solution is perfect.



Lots of new ideas ...

....but the problem is getting harder



Quantitative studies are fascinating and a

good research problem

(46)

The web-graph: storage

Paolo Ferragina

Reading 20.4

(47)

Definition

Directed graph G = (V,E)

 V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT) Three key properties:

 Skewed distribution: Pb that a node has x links is

1/x

^, ≈ 2.1

(48)

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distribution

k

u 1_

] )

( degree -

in

Pr[  

 2.1



This is true also for: out-degree, size components,...

(49)

Definition

Directed graph G = (V,E)

 V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:

 Skewed distribution: Pb that a node has x links is

1/x

^, ≈ 2.1

 Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

 Similarity: pages close in lexicographic order tend to share many outgoing lists

(50)

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

(51)

URL-sorting

Stanford

Berkeley

(52)

Front-compression of URLs and adjacent storage of ids

Front-coding

Encoding positive/negative ints:

(53)

Copy-lists: close nodes sharemany successors

Uncompressed adjacency list

Each bit of the copy-list informs whether the corresponding successor of y is also a successor of the reference x;

The reference index is chosen in [0,W] that gives the best compression.

Adjacency list with copy lists

(similarity)

Reference chains

possibly limited

(54)

Copy-blocks = RLE(Copy-list)

Adjacency list with copy lists.

The first copy block is 0 if the copy list starts with 0;

Each RLE-length is decremented by one for all blocks The last block is omitted (we know the length from Outd);

Adjacency list with copy blocks

(RLE on bit sequences)

1,3

(55)

3

Extra-nodes: Intervals and Residuals

Adjacency list with copy blocks.

Consecutivity in extra-nodes

0 = (15-15)*2

(positive)

2 = (23-19)-2

(jump >= 2)

600 = (316-16)2 12 = (22-16)2

^(positive)

3018 = 3041-22-1

(difference)

Intervals: encoded by left extreme and length Int. length: decremented by L_min= 2

Residuals: differences between residuals, or the source (the first)

This is a Java and C++ lib

(≈3 bits/edge)