• Non ci sono risultati.

Web search engines

N/A
N/A
Protected

Academic year: 2021

Condividi "Web search engines"

Copied!
55
0
0

Testo completo

(1)

Web search engines

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

(2)

Two main difficulties

The Web:

Size: more than tens of billions of pages

Language and encodings: hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User:

Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

(3)

Evolution of Search Engines

First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data

Link (or connectivity) analysis

Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query”

Focus on “user need”, rather than on query

Integrate multiple data-sources

Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation  Information Supply

[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo, MSN, ASK,………

(4)
(5)
(6)
(7)

2009-12

2009

(8)
(9)
(10)

This is a search engine!!!

(11)

II° generation III° generation

IV° generation

(12)

Quality of a search engine

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 8

(13)

Is it good ?

How fast does it index

Number of documents/hour

(Average document size)

How fast does it search

Latency as a function of index size

Expressiveness of the query language

(14)

Measures for a search engine

All of the preceding criteria are measurable

The key measure: user happiness

…useless answers won’t make a user happy

(15)

Happiness: elusive to measure

Commonest approach is given by the relevance of search results

How do we measure it ?

Requires 3 elements:

1.

A benchmark document collection

2.

A benchmark suite of queries

3.

A binary assessment of either Relevant or

Irrelevant for each query-doc pair

(16)

Evaluating an IR system

Standard benchmarks

TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years

Other doc collections: marked by human experts, for each query and for each doc,

Relevant or Irrelevant

 On the Web everything is more complicated since we

cannot mark the entire corpus !!

(17)

General scenario

Relevant

Retrieved collection

(18)

Precision: % docs retrieved that are relevant

[issue “junk”

found]

Precision vs. Recall

Relevant

Retrieved collection

Recall: % docs relevant that are retrieved

[issue “info” found]

(19)

How to compute them

Precision: fraction of retrieved docs that are relevant

Recall: fraction of relevant docs that are retrieved

Precision P = tp/(tp + fp)

Recall R = tp/(tp + fn)

Relevant Not Relevant Retrieved tp

(true positive)

fp

(false positive)

Not

Retrieved fn

(false negative)

tn

(true negative)

(20)

Some considerations

Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

Precision usually decreases

(21)

Precision-Recall curve

We measures Precision at various levels of Recall

Note: it is an AVERAGE over many queries

precision

recall x x

x

x

(22)

A common picture

precision

recall x x

x

x

(23)

F measure

Combined measure

(weighted harmonic mean)

:

People usually use balanced F

1

measure

i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)

Use this if you need to optimize a single measure that balances precision and recall.

R P

F 1

) 1

1 (

1

  

(24)

The web graph: properties

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 19.1 and 19.2

(25)

The Web’s Characteristics

Size

1 trillion of pages is available (Google 7/08)

50 billion static pages

5-40K per page => terabytes & terabytes

Size grows every day!!

Change

8% new pages, 25% new links change weekly

Life time of about 10 days

(26)

The Bow Tie

(27)

SCC SCC

WCC WCC

Some definitions

Weakly connected components

(WCC)

Set of nodes such that from any node can go to any node via an undirected path.

Strongly connected components

(SCC)

Set of nodes such that from any node can go to any node via a directed path.

(28)

Find the CORE

Iterate the following process:

Pick a random vertex v

Compute all nodes reached from v: O(v)

Compute all nodes that reach v: I(v)

Compute SCC(v):= I(v) ∩ O(v)

Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to

not find the core < 1% (given that the graph is available).

(29)

Compute SCCs

Classical Algorithm:

1)

DFS(G)

2)

Transpose G in G

T

3)

DFS(G

T

) following vertices in decreasing order of the time their visit ended at step 1.

4)

Every tree is a SCC.

DFS is hard to compute on disk: no locality

(30)

DFS

DFS(u:vertex)

color[u]=GRAY

d[u] time  time +1 foreach v in succ[u] do

if (color[v]=WHITE) then p[v] u

DFS(v) endFor

color[u] BLACK

f[u]  time  time + 1

Classical Approach

main(){

foreach vertex v do color[v]=WHITE endFor

foreach vertex v do

if (color[v]==WHITE) DFS(v);

endFor }

(31)

Semi-External DFS

Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

Bit array of nodes (visited or not)

Array of successors

Stack of the DFS-recursion

(32)

Observing Web Graph

We do not know which percentage of it we know

The only way to discover the graph structure of the web is via large scale crawls

Warning: the picture might be distorted by

Size limitation of the crawl

Crawling rules

Perturbations of the "natural" process of birth and death of nodes and links

(33)

Why is it interesting?

Largest artifact ever conceived by the human

Exploit its structure of the Web for

Crawl strategies

Search

Spam detection

Discovering communities on the web

Classification/organization

Predict the evolution of the Web

Sociological understanding

(34)

Many other large graphs…

Physical network graph

V = Routers

E = communication links

The “cosine” graph (undirected, weighted)

V = static web pages

E = semantic distance between pages

Query-Log graph (bipartite, weighted)

V = queries and URL

E = (q,u) u is a result for q, and has been clicked by some user who issued q

Social graph (undirected, unweighted)

V = users

E = (x,y) if x knows y (facebook, address book, email,..)

(35)

The size of the web

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 19.5

(36)

What is the size of the web ?

Issues

The web is really infinite

Dynamic content, e.g., calendar

Static web contains syntactic duplication, mostly due to mirroring (~30%)

Some servers are seldom connected

Who cares?

Media, and consequently the user

Engine design

(37)

What can we attempt to measure?

The relative sizes of search engines

Document extension: e.g. engines index pages not yet crawled, by indexing anchor-text.

Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.)

The coverage of a search engine relative to

another particular crawling process.

(38)

A B = (1/2) * Size A A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

Size A / Size B =

(1/6)/(1/2) = 1/3 Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling URL (ii) Checking URL

Relative Size from Overlap Given two engines A and B

Sec. 19.5

(39)

Sampling URLs

Ideal strategy: Generate a random URL and check for containment in each index.

Problem: Random URLs are hard to find!

Approach 1: Generate a random URL surely contained in a given search engine

Approach 2: Random walks or random IP

addresses

(40)

#1: Random URL in SE via random queries

Generate random query:

Lexicon: 400,000+ words from a web crawl

Conjunctive Queries: w

1

and w

2

e.g., vocalists AND rsi

Get 100 result URLs from engine A

Choose a random URL as the candidate to check for presence in search engine B (next slide)

This induces a probability weight W(p) for each page.

Conjecture: W(SE

A

) / W(SE

B

) ~ |SE

A

| / |SE

B

|

(41)

URL checking

Download D at address URL.

Get list of words.

Use 8 low frequency words as AND query to B

Check if D is present in result set.

Problems:

Near duplicates

Engine time-outs

Is 8-word query good enough?

(42)

Advantages & disadvantages

Statistically sound under the induced weight.

Biases induced by random query

Query bias:

Favors content-rich pages in the language(s) of the lexicon

Ranking bias [

Solution: Use conjunctive queries & fetch all]

Query restriction bias:

engine might not deal properly with 8 words conjunctive query

Malicious bias:

Sabotage by engine

Operational Problems:

Time-outs, failures, engine inconsistencies, index modification.

(43)

#2: Random IP addresses

Find a web server at the given IP address

If there’s one

Collect all pages from server

From this, choose a page at random

(44)

Advantages & disadvantages

Advantages

Clean statistics

Independent of crawling strategies

Disadvantages

Many hosts might share one IP, or not accept requests

No guarantee all pages are linked to root page, and thus can be collected.

Power law for # pages/hosts generates bias towards

sites with few pages.

(45)

Conclusions

No sampling solution is perfect.

Lots of new ideas ...

....but the problem is getting harder

Quantitative studies are fascinating and a

good research problem

(46)

The web-graph: storage

Paolo Ferragina

Dipartimento di Informatica Università di Pisa

Reading 20.4

(47)

Definition

Directed graph G = (V,E)

V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT) Three key properties:

Skewed distribution: Pb that a node has x links is

1/x

, ≈ 2.1

(48)

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distribution

k

k

u 1

] )

( degree -

in

Pr[  

 2.1

This is true also for: out-degree, size components,...

(49)

Definition

Directed graph G = (V,E)

V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties:

Skewed distribution: Pb that a node has x links is

1/x

, ≈ 2.1

Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

Similarity: pages close in lexicographic order tend to share many outgoing lists

(50)

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

(51)

URL-sorting

Stanford

Berkeley

(52)

Front-compression of URLs and adjacent storage of ids

Front-coding

Encoding positive/negative ints:

(53)

Copy-lists: close nodes sharemany successors

Uncompressed adjacency list

Each bit of the copy-list informs whether the corresponding successor of y is also a successor of the reference x;

The reference index is chosen in [0,W] that gives the best compression.

Adjacency list with copy lists

(similarity)

Reference chains

possibly limited

(54)

Copy-blocks = RLE(Copy-list)

Adjacency list with copy lists.

The first copy block is 0 if the copy list starts with 0;

Each RLE-length is decremented by one for all blocks The last block is omitted (we know the length from Outd);

Adjacency list with copy blocks

(RLE on bit sequences)

1,3

(55)

3

Extra-nodes: Intervals and Residuals

Adjacency list with copy blocks.

Consecutivity in extra-nodes

0 = (15-15)*2

(positive)

2 = (23-19)-2

(jump >= 2)

600 = (316-16)*2 12 = (22-16)*2

(positive)

3018 = 3041-22-1

(difference)

Intervals: encoded by left extreme and length Int. length: decremented by Lmin = 2

Residuals: differences between residuals, or the source (the first)

This is a Java and C++ lib

(≈3 bits/edge)

Riferimenti

Documenti correlati

The aim of this study was to investigate associations between five lifestyle factors and risk of cancer- cardiometabolic multimorbidity defined as developing subsequently at least

In fase di esecuzione della query, quando viene immesso il valore del parametro, Access verifica che tale valore sia consistente con il dominio del parametro ESEMPIO. Uno o

diethylstilbestrol-induced RACK1 expression, also consistent with similar result observed in diethylstilbestrol-induced programming of prostate differentiation where flutamide

collisions are generated by tuples yielding the same hash function result with different attribute value. A local sort and join is performed into each bucket Very fast

Back queue request  Select a front queue randomly, biasing towards higher queues.

Back queue request  Select a front queue randomly, biasing towards higher queues.

 The k-gram index finds terms based on a query consisting of k-grams (here k=2).

We conclude that the implicit feedback generated from clicks both within result sets and between result sets in a query chain shows reasonable agree- ment with the explicit judgments