• Non ci sono risultati.

The Structure of the Web The Structure of the Web

N/A
N/A
Protected

Academic year: 2021

Condividi "The Structure of the Web The Structure of the Web"

Copied!
31
0
0

Testo completo

(1)

The Structure of the Web The Structure of the Web

Moreno Marzolla

Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

http://www.moreno.marzolla.name/

(2)

Complex Systems 2

Copyright © 2014, Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/CS2013/)

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA). To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San

Francisco, California, 94105, USA.

Image credits: the Web

(3)

Complex Systems 3

A Brief History of the Web

Vannevar Bush and the Memex

Memex was an hypotetical electro-mechanical system

capable of storing information and links between items on (punched) microfilms

Vannevar Bush, “As We May Think”, The Atlantic Monthly, July 1945

Memex was never implemented

– Nevertheless, all basic ideas on hypertexts were there

http://www.unc.edu/~elese/diglib/index.htm http://www.ibiblio.org/pioneers/bush.html

Slide credit: Fabio Vitali

(4)

Complex Systems 4

A Brief History of the Web

Ted Nelson and Project Xanadu(R)

In the '60s, Ted Nelson proposed an

integrated system (Project Xanadu®) for publishing documents.

– Bi-directional, “unbreakable” links

Transclusion (embedding) of arbitrary pieces of documents within other documents

– Copyright management

– Everyone can publish, mix, annotate any document

Some concepts from Xanadu® were later used, in a very simplified form, for the

World Wide Web

http://www.nndb.com/peopl e/132/000026054/

Slide credit: Fabio Vitali

(5)

Complex Systems 5

A Brief History of the Web

Douglas Engelbart

Douglas Engelbart built an innovative system for text editing, video conferencing, collaborative text editing based on mouse, windows and most of the things we know (and use) today

The system was demonstrated during a presentation in 1968 (“the mother of all demos”)

– https://www.youtube.com/watch?

v=VScVgXM7lQQ&list=PL76DBC8D6718B8FD3

Slide credit: Fabio Vitali

(6)

Complex Systems 6

What is the Web today?

A way for sharing (publishing) information

– A collection of independent Web Servers

Each Web Server hosts a number of resources (mostly, but not only, HTML pages)

– Each resource is identified by a URI

A way for browsing the shared information

– Web browsers can render HTML pages and their embedded content

– Browsers allow users to navigate the Web following

hyperlinks

(7)

Complex Systems 7

A set of Web resources

My CV:

I work at the

University of Bologna

UniBO Web Page:

People

Cell processor resources at IBM

research

cellsvm.pdf

HTML page

Other resource

People at UniBO:

...

Moreno Marzolla ...

List of my publications and my CV

Publications:

Optimized training of support vector machines

for the Cell processor

(8)

Complex Systems 8

A set of Web resources

List of my publications and my CV

My CV:

I work at the

University of Bologna

Publications:

Optimized training of support vector machines

for the Cell processor

UniBO Web Page:

People

Cell processor resources at IBM

research

cellsvm.pdf

People at UniBO:

...

Moreno Marzolla ...

Hyperlink

(9)

Complex Systems 9

Hyperlinks

Elements of HTML pages can be annotated with

hyperlinks pointing to other resources (including other HTML pages)

This organization is given for granted today, but is both inspired and non-obvious

– Many different ways or organizing data catalogs: by name, bu topic, hierarchical...

– Hyperlinks allow non-linear knowledge structuring, such that

logical relationships in the text become first-class citizens

(10)

Complex Systems 10

Precursors of hypertext:

Citation networks

Smith 2008

Green and Jones 2006

Brown

2005 Gray, Hop and

Long 2005

Short and Smith

1996 Karl and John

1993

The network of citations among

scientific papers form a directed graph

Unlike the Web, links

tend to point strictly

backward in time

(11)

Complex Systems 11

Precursors of hypertest:

cross-references in a encyclopedias

Nash

Equilibrium Game

Theory

John Forbes Nash

A Beautiful Mind (film)

RAND

Conspiracy Theory

NASA

Ron Howard Apollo 13

(film)

Some cross-references

from wikipedia

(12)

Complex Systems 12

Back to the Web

Our simple idea of the Web made of HTML pages and hyperlinks was mostly correct when the Web was

born, but is too simplicistic today

In the beginning, Web servers had a passive role, just providing static content

Now some links can be associated to dynamic actions

– “Buy it now”

– “Add to shopping cart”

– “Update my calendar”

– “Upload my photo”

(13)

Complex Systems

That's what we

13

are interested in

Two kind of links

Navigational

– Allow navigation from one page to another

Transactional

– Perform actions

– E.g., “buy it now” will

charge your credit cart

start shipping of the product to your home address

move you to a separate Web page containing a purchase receipt

The goal of “buy it now” is not just to move you to the receipt page

A lot of content on the Web has a transactional nature, but is linked together by a navigational “backbone”

– It is reachable via relatively stable Web pages connected to

each other by more traditional navigational links.

(14)

Complex Systems 14

www.moreno.marzolla.name

/index.php

/publications/ /teaching/

(15)

Complex Systems 15

The Web as a directed graph

(16)

Complex Systems 16

The Web as a directed graph

A (directed)

path

(17)

Complex Systems 17

The Web as a directed graph

Strongly Connected Components

(SCCs)

(18)

Complex Systems 18

Strongly Connected Components

We say that a strongly connected component (SCC) in a directed graph is a subset of the nodes such that:

– (i) every node in the subset has a path to every other node;

and

– (ii) the subset is not part of some larger set with the property

that every node can reach every other.

(19)

Complex Systems 19

Computing SCCs

Algorithm SCC(G,x) identifies the strongly connected component containing node x in time O(n+m)

It is possible to identify all strongly connected components in time O(nm+n

2

)

– Actually you can do that in O(n+m) using a more complex algorithm (e.g., Tarjan or Gabow algorithms)

algorithm SCC(Graph G, node x)→list of nodes L := empty list of nodes

(1) Execute DFS(G, x) marking visited nodes (2) Compute the transposed graph G

T

(invert direction of all edges in G)

(3) Execute DFS(G

T

, x), inserting in L all visited nodes which were marked during step (1)

return L

(20)

Complex Systems 20

A Picture of the Web

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S.

Rajagopalan, R. Stata, A. Tomkins, J. Wiener, Graph structure of the Web, Computer Networks 33, 1—6 (Jun. 2000), 309—320

– Based on a Web crawl performed by AltaVista on may 1999

– The crawl is based on a large set of starting points

accumulated over time from various sources, including voluntary submissions

– The crawl proceeds in roughly a BFS manner

– 203 million URLs

– 1466 million links

(21)

Complex Systems 21

A “picture” of the Web (in 2000)

A. Broder et al., “Graph Structure of the Web”, http://www9.org/w9cdrom/160/160.html

(22)

Complex Systems 22

Single “giant” SCC

It contains about 56 million nodes

Why a SCC component?

– Many search engines contain a large directory of Web sites

– From most Web sites

there are links back to the

search engines

(23)

Complex Systems 23

Why a single SCC?

Suppose there are two SCC, say A and B

– It would be possible to merge A and B into a single SCC if there is just a link from A to B, and another from B to A

– If A and B are sufficiently large, there is a high probability that such links exist

SCC A SCC B

(24)

Complex Systems 24

Graph structure of the Web

From the “bow tie” picture alone we have a high-level view of the structure of the Web

We now dig a bit deeper into the Web graph

– In/Out degree distribution

– Connectivity distribution

– Diameter

– Resiliency to edge removal

(25)

Complex Systems 25

Degree distribution

A. Broder et al., “Graph Structure of the Web”, http://www9.org/w9cdrom/160/160.html

In- and Out-degrees follow a power law

P[X = k] ~ 1/k

α

for some α>1

α ≈ 2.1 α ≈ 2.72

(26)

Complex Systems 26

Why power law?

One might ask why in/out degree distribution obeys the power law P[X = k] ~ 1/k

α

Rich-gets-richer effect

A new node X joins the network

X connects with L of existing nodes

Node X connects to node Y with probability proportional to the degree of Y

– Thus, nodes with more edges will likely “attract” new connections

The resulting network is called scale-free network

(27)

Complex Systems 27

Connectivity analysis

Weakly connected

(undirected) components Strongly connected

components

α ≈ 2.5

WCC containing 91% of Web pages

Giant SCC 56 million nodes, 28% of Web pages

(28)

Complex Systems 28

Other properties

k 1000 100 10 5 4 3

Size (millions) 177 167 105 59 41 15

Size of the largest surviving weak component when links to pages with in-degree at least k are removed from the graph.

(Direct) diameter of the SCC: >28

Maximum finite shortest path length: >503

With 75% probability, there is no directed path from a random start node to a random end node

The average directed path length is 16

(29)

Complex Systems 29

Exponential vs scale-free network

R. Albert, H. Jeong, A.-L. Barabási, Error and attack tolerance of complex networks, Nature 406, 378—482 (2000).

Exponential networks are

homogeneous: all nodes have the same average number of links

Scale-free networks are not

homogeneous: few nodes are

highly connected, most nodes

have just few links

(30)

Complex Systems 30

Change in diameter d as a function of

the fraction f of removed nodes

SF

SF

E

(31)

Complex Systems 31

Cluster size distribution for various f

f = fraction of failed/attacked

nodes

Riferimenti

Documenti correlati

We allow for discontinuous coefficients and random initial condition and, under suitable assumptions, we prove that in the limit as the number of particles grows to infinity

./ The algorithm is based on a non-linear Lagrangian framework in which direct and adjoint equations are solved by demigrating and migrating in the frequency domain respectively. ./

As an application, we prove tightness under diffusive rescaling for general pinning and wetting models based on random

• Class diagrams from a software perspective. These show the classes in the software and how they interrelate. • Sequence diagrams for common scenarios. A valuable approach is

The level initial income is negatively correlated with economic growth... Debt Maturity and the Financial Crisis.. Introduction. Factors that may worsen

Tacelli: Elliptic operators with unbounded diffusion and drift coefficients in L p spaces, Advances in Differential Equations, 19 n. Spina, Scale invariant elliptic operators

Arbitrage, Convex cone, Equivalent martingale measure, Equivalent probability measure with given marginals, Finitely additive probability, Fundamental theorem of asset

(The path from US News College Rankings to a class blog and back in Figures 13.5 and 13.6 suggests a concrete example for how this happens.) Thus, all these pages can mutually reach