Method for analyzing web space data

(1)

(12) United States Patent

Montangero et a].

US008117227B2

US 8,117,227 B2

Feb. 14, 2012

(10) Patent N0.:

(45) Date of Patent:

(54) METHOD FOR ANALYZING WEB SPACE

DATA

(75) Inventors: Simone Montangero, Pisa (IT); Marco

Furini, Melara (IT)

(73) Assignee: Scuola Normale Superiore Di Pisa,

Pisa (IT)

( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35

U.S.C. 154(b) by 230 days.

(21) Appl. No.: 12/635,004

(22) Filed: Dec. 10, 2009

(65) Prior Publication Data

US 2011/0145215A1 Jun. 16, 2011

(51) Int. Cl.

G06F 7/00 (2006.01) G06F 17/30 (2006.01)

(52) US. Cl. ... .. 707/769; 707/709 (58) Field of Classi?cation Search ... .. 707/769

See application ?le for complete search history.

(56) References Cited

U.S. PATENT DOCUMENTS

2007/0100875 A1 * 5/2007 Chi et al. ... .. 707/102

2008/0033587 A1 * 2/2008 Kurita et al. . 700/100

2010/0185641 A1 * 7/2010 Brazier et a1. ... .. 707/758

OTHER PUBLICATIONS

Yun Chi et al “Eigen-Trend: Trend Analysis in the Blogosphere Based on Singular Value Decompositions”, NEC Laboratories America, 2006.

X. Ni et al “Exploring in the Weblog Space by Detecting Informative and Affective Articles” Dept. of Computer Science etc., May 2007, pp. 281-290.

Yang Liu et al “ARSA: A Sentiment-Aware Model for Predicting

Sales Performance Using Blogs”, Dept. of Computer Science etc., Jul. 2007.

T. Fukuhara et a1 “Analyzing Concerns of People Using Weblog

Articles and Real World Temporal Data” Research Institute of Sci

ence and etc., May 2005.

N. S. Glance et a1 “BlogPulse: Automated Trend Discovery for

Weblogs” Intelliseek Applied Research Center, May 2004. D. Gruhl et al “How to Build a WebFountain: An Architecture for Very Large-scale Text Analytics”, IBM Systems Journal, 2004, pp. 64-77.

D. Gruhl et al “Information Diffusion Through Blogspace”, IBM

Research, May 2004, pp. 491501.

S. Morinaga et al “Mining Product Reputations on the Web”, NEC Corporation, 2007, pp. 341-349.

R. Agrawal “Mining Newsgroups Using Networks Arising From

Social Behavior”, IBM Almaden Research Center, Mar. 3, 2007, pp.

529-535.

A.S. Sachraj da et al “Fractal Conductance Fluctuations in a Soft-Wall Stadium and a Sinai Billiard”, Inst. For Microstructural Sciences,

1984, pp. 1948-1951. * cited by examiner

Primary Examiner * Apu Mo?z

Assistant Examiner * Mohammad Rahman

(74) Attorney, Agent, or FirmiDennison, Schultz &

MacDonald

(57)

ABSTRACT

A method for analyzing data from the web that determine the

importance that a chosen subject has in society, e.g., subject

matter relating a concert, a scienti?c discovery, a football match, a person, a corporation, a brand, or a car, and analyze

such data that can represent the entire society better than the

known techniques. The method according to the invention

can avoid malicious alterations and is able to measure and

detect the temporal relations among all the web resources that talk about a particular topic or subject matter.

10 Claims, 7 Drawing Sheets

10k CHOISE OF 101 COLLECTION OF 10 wEaREsouRcEs CALCULATION OF THE NUMBER OF WEBREsoURcEs GENERATION OF 10 THE TIME-SERIES TIME-s

IN TIME WINDOWS ERIES 1 n5

108 quANTIFIcATIoN OF THE LEVEL OF CORRELATION 0F EAcN WINDOW ESTIMATION OF VERAGE NUMBER OF wEsREsoURoEs COMPUTATION OF YREND INDEX

(2)

US. Patent

Feb. 14,

2012

Sheet 1 0f 7

CHOISE OF

A SUBJECT (S)

I

COLLECTION OF

WEBRESOURCES

I

CALCULATION OF

THE NUMBER OF

WEBRESOURCES

V

US 8,117,227 B2

101 /103a

CLASSIFICATION

OF THE WEBRESOURCES

_\103b

V

104/

THE TIME-SERIES

GENERATION OF

I

SPLIT OF THE

TIME-SERIES

IN TIME WINDOWS

V

106 QUANTIFICATION OF THE

LEVEL OF CORRELATION

OF EACH WINDOW

V

ESTIMATION OF

AVERAGE NUMBER OF

WEBRESOURCES

\107

108

V

COMPUTATION

OF TREND INDEX

(3)

(4)

US. Patent

Number of Web resources

Number of

web I

Feb. 14, 2012

Sheet 3 0f 7

_{US 8,117,227 B2}

,5

4,53

3,5

333 2,5

Qecember

January

Februaw

285)?

A

2868

2998

9‘ ‘Q N ***** "

N163 y

mas

Agarii

2888

Mar-ch

2868

(5)

US. Patent

Feb. 14, 2012

Sheet 4 on

US 8,117,227 B2

in

-;*- Barack Obama 1 y 4 1,9 1,8

xmngucmp

‘1.55

“ Apm ‘*Tiiw

2008 .2: >‘ '' February ‘ March December r‘ Januayy 20GB 2008 2008 2008 2007

(6)

US. Patent

Feb. 14, 2012

Sheet 5 017

US 8,117,227 B2

— ... ..

4,3

V \‘v\»\ ‘>4: 5:5: w’ . w “w‘mmxwis Q 'Q‘V'Q‘ _, - \ \ L“. W, 5:5;

'13

F1:

4,3‘

{1}

Q:

~11“ Swank Shame} ....

--::»Jehn McCain

5%"? .. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, “ .A l (‘my ‘+1 Arr)»

_\__,_...»--?~~~---E"

w’ "- -‘\\ ‘fr’ ’ ~ “ “ 1 .11 n »-&W' ~‘ ’ wNM-M' _ sir-“P” ><z \ "('3 M

{i

W -» -

E i, .,

m?w?arack Onzama

G3‘

~1=-JQW1 Ma?a“:

L.. ;..._. :3, . .

Q."

$2

(7)

US. Patent

Feb. 14, 2012

Sheet 6 017

US 8,117,227 B2

N111. mar

a?

Web féI-Zi-BGUfCi-EE (x

16?’)

‘:3 \ 1 1 1 \ l 4 »»»»»»»»»»»»»»»» -~aw-»»»»»M~--+<§ ~~~~~~~~~~~~ -~a» WM~W-—M*M~M+QW~§»

G?’

Jan £33

February M

(<2: ch

Apm

Nay

Jung

Juiy

Fig. 10

MW“- Facebook “ a!

311 ,x

3,5

--c:=-MySpace

(190“ ‘$

a 3,1

Trendlndex

2%

2,9 2,7 2'5 l l l l \ l

(8)

US. Patent

Trendl‘nd ex

3:? 3.5 3.3 3,1 29

Feb. 14, 2012

Sheet 7 0f 7

US 8,117,227 B2

Fig. 11

W“ Faoeboo k “9-”- M ySpace

(9)

US 8,117,227 B2

1

METHOD FOR ANALYZING WEB SPACE

DATA

FIELD OF THE INVENTION

The present invention relates to a method for analyzing Web space data, i.e. data from the internet Web, in order to ?nd

the society interests, in particular, for communication strate

gies, marketing analysis, business investment, sociological

activities, product planning, or targeted advertising.

BACKGROUND OF THE INVENTION

As Well knoWn, understanding social concerns and opin ions is of high importance in several scenarios and in many decision making processes. NoWadays, society thoughts are mainly investigated by conducting a series of questions to a selected sample of the population. Questions like. What do

you think about that brand? Did you like the commercial aired during the superboWl? Do you Watch that TV-shoW? Do you

buy that product?, are commonly used by polling organiZa

tions to ?gure out the society opinions.

Recently, the Widespread use of the Web as a Way of con

veying personal opinions has Whetted researchers to propose methods that aim at understanding the society through Web

space analysis.

The rationale behind the usage of Web data to ?nd the society interests is that, thanks to the set of Internet technolo gies grouped under the label Web 2.0, the Web is more and

more a signi?cant representation of our society: it is a modern

version of the Ancient Greek Agora, Where people gathered together to do commercial and administrative activities, to discuss politics and philosophy, to participate to social and religious events, to understand and in?uence society.

Web 2.0 technologies like blogs, podcasts, and Wikis are so

important in noWadays the society that they are affecting its morphology by creating neW spaces of freedom, giving voice

to any opinion, easing interpersonal relationships, and

encouraging the creation of collaborating collectivities.

The revolution of Web 2.0 is that it potentially transforms every user from a mere passive reader to an active modern citiZen, apart from ethnicity, gender, or Walk of life. Using

Web 2.0 technologies, people can meet virtually to share

knoWledge, conduct business, discuss different topics, social

iZe, and even in?uence society. The society and Web are so

strongly linked that they affect each other. When something happens in the society it is very likely that feW seconds later

someone Writes about it in the Webspace, for example, more and more people consider the Web as the ?rst place to look for neWs, or When a product is released, the Blogosphere, Which is made up of all the blogs and their interconnections, is the place Where to discuss about it. On the other hand, the Web

might in?uence the society providing several communication

tools and an easy access to information. For instance, on May

2007 a post on a blog reported that Apple Was delaying the “iPhone” and “Leopard OS”. Although this post turned out to be a false alarm, during the period that the neWs Was consid ered to be true, Apple’s stocks Were negatively affected.

In the literature, different proposals exploit the society

Web relation so as to ?nd out society’s interests like people’

concerns, HollyWood stars’ notoriety, politicians’ popularity,

or consumers’ opinions. These proposals are based on the idea that When you see something interesting, e.g., on TV, on

the Web, or at the movie theater, you usually converse about

it With friends, and if people talk about it and spread the voice around, there Will be several on-going conversations about

the topic. The more people converse on a same subject matter, 20 25 30 35 40 45 50 55 60 65

2

the more the topic is considered in society. By supposing the Blogosphere as the place Where modern conversations hap pens, these methods compute the number of on-going discus

sions about a speci?c topic and uses this number as an indi

cation of the importance of the topic in society.

Also commercial products like Google Trends, BlogPulse,

Trendpedia, and Blogmeter, just to name a feW, exploit the Webspace to analyZe human society. These tools assume that

the more people use Web search engines to look for a particu lar topic, or the more people discuss a particular topic in the

Blogosphere, the more the topic is popular, important, or

simply discussed in society.

A critical thinking to these approaches is that they may help understanding What’s going on in the Web, but they might be

misleading or might even represent a distorted vieW of the

society. TWo are the main concerns about these methods.

Firstly, results better represent a part and not the entire

society. In fact, being based on the Blogosphere, these meth ods analyZe a portion of the society composed of tens of

millions of users Who share information and exchange per

sonal opinions, a portion of the society usually de?ned as composed of technologically advanced people. With no doubt, the Blogosphere offers great commercial values and

provides neW business opportunities in areas such as product

survey, customer relationship, and marketing, but compared

to the 700 millions of Web users, the Blogosphere represents a very small portion of the Web, and therefore of the society.

The second critical note is related to the usage of the sole

magnitude of volume data search in Web search engines, or of

keyWords in the Blogosphere; it is easy to maliciously alter

the results as one can Write a softWare that automatically, and

periodically, issues Web searches, or posts blog messages, so

as to make a brand, a Website, or a politician more popular

than they really are.

Recently, in the literature many proposals focused on using

Web data to understand social opinions and/or concerns, as

Well as many commercial blog sites and Web search engines introduced services that try to give an indication of public

opinions.

In the literature, much research Work is being conducted on the Blogosphere, as blogs are much more dynamic than tra ditional Web pages:

Chi et al. analyZe the Blogosphere and propose a trend

analysis technique based on the singular value decomposi

tion.

Ni et al. propose a machine learning method for classifying informative and affective articles inside the Blogosphere.

Liu et al. study the predictive poWer of opinions and sen

timents expressed in blogs, in order to predict product sales

performance.

Fukuhara et al. describe a system that counts the number of

blog articles containing a speci?c Word so as to understand concerns of people.

Glance et al. propose a mechanism to discover trends

inside the Blogosphere by using data mining techniques.

Gruhl et al. use the volume of blogs or link structures to

predict the trend of product sales.

Morinaga et al. present an approach that automatically mines consumer opinions With respect to given products, in

order to facilitate customer relationship management. AgraWal et al. and Gamon et al. have also conducted

research in opinion mining for marketing purposes.

Also commercial blog sites and Web search engines are

offering services that aim at understanding the society

through Web data analysis.

The Webfountain project uses Web mining techniques for

(10)

US 8,117,227 B2

3

Google Trends charts hoW often a particular search term is entered relative the total search volume across various regions

of the World, and in various languages.

All the knoWn methods based on the simple magnitude of

the results, either in the Blogosphere, Web searches engines,

or the entire Web space, are misleading and provide different,

and sometimes controversial, understandings of the society.

SUMMARY OF THE INVENTION

It is therefore a feature of the present invention to provide

a method for analyZing data from the Web that can ?nd out the

importance that a “subject” has in society, e.g., a subject

matter relating to concert, a scienti?c discovery, a football

match, a person, a corporation, a brand, a car.

It is also a feature of the present invention to provide a

method for analyZing data from the Web that can represent the

entire society better than the knoWn techniques.

It is another feature of the present invention to provide a method for analyZing data from the Web that can avoid mali cious alterations.

It is a particular feature of the present invention to provide a method for analyZing data from the Web that is able to measure and detect the temporal relations among all the Web resources that talk about a particular topic or subject matter.

These and other features are accomplished With a generally

computer-based method, according to the invention, for ana lyZing data from the Web comprising the steps of:

choosing a determined subject or topic (S), said topic (S) being identi?ed by at least one keyword;

collecting data, or Web resources, from the Web that men

tion said determined topic (S) at successive instants t,

tWo successive instants t being separated by an interval

of time of determined length d;

counting the number W(S) of said Web resources that

mention said determined topic (S) at each instant t;

generating a time-series of consecutive measures of the

number of said Web resources, said time-series repre

senting said number W(S) of Web resources as a func

tion of time;

splitting said time-series into a plurality of consecutive time WindoWs of determined length Z, With Zid in such

a Way that each time WindoW comprises at least one Web resource among said Web resources;

applying a determined technique to said plurality of time

WindoWs for quantifying, for at least one time WindoW among said time WindoWs, the level of correlations Lc

existing in the Web resources W(S) of a same time Win doW T and/or to characterize the structure of a so de?ned

signal;

estimating, for each time WindoW, the average number

WM(S) of said Web resources W(S) that mention said

topic (S);

computing, for each time WindoW, a trend index by com bining said average number of said Web resources

WM(S) With said level of correlations Lc;

repeating said computing step of said trend index for all

said time WindoWs generating a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S changed over time.

After said step of counting the number W(S) of said Web

resources a step of classifying the Web resources by means of

the space locations can be also provided. In particular, the step

of classifying the Web resources can be carried out by means

of an IP address, the last page update, or any other property

that can be identi?ed by means of a “selection rule” (trans

lated in a “regular expression” in softWare), also associated to

20 25 30 35 40 45 50 55 60 65

4

Web 2.0 properties (such being part of blog, a given commu

nity, etc.) and to semantic Web properties (semantic Web).

In particular, the “level of correlation” is the quantity of

cross-correlation of a signal With itself. The mathematical tools to compute cross-correlations may be developed, for example, to ?nd repeating patterns, such as the presence of a

periodic signal Which has been buried under noise.

Advantageously, said determined technique applied to said

plurality of time WindoWs can be selected from the group

comprised of:

Fractal analysis;

Fourier transform;

Wavelet analysis;

Entropy analysis.

In particular, said determined technique applied to said plurality of time WindoWs is an Entropy analysis.

Preferably, said determined technique applied to said plu

rality of time WindoWs is Fractal analysis. In this case, each of said level of correlations of each time WindoW is expressed in

terms of a Fractal dimension D.

Advantageously, after said step of applying a determined technique to said plurality of time WindoWs, a step of post

processing the resulting data is also provided.

In particular, the above described method detects and mea

sures the temporal relations among the Web resources and

uses Fractal analysis to retrieve the correlations among the Web resources. The results obtained from applying fractal analysis are combined With the number of Web documents in order to compute an Index, i.e. the Trend Index, that aims to

give an indication of the interest that the society has on a

speci?c topic.

In particular, the computing step comprises a step of asso

ciating to said average number of said Web resources WM(S)

and to said fractal dimension D a different Weight, said asso

ciating step carried out selecting a parameter a in the range comprises betWeen 0 and 1 depending on the importance to assign to said averaged number of Web resources WM(S) and to said fractal dimension D respectively.

Advantageously, the estimating step of said average num

ber of Web resources of a determined time WindoW is carried

out applying the folloWing equation:

WW) = 2

Where Ti is the ith time WindoW, W Mi(S) the average number of Web resources Wji(S) in the ith time WindoW, and Ti is the length of the ith time WindoW.

Advantageously, the computing step of said trend index is

carried out applying the folloWing equation:

ml

Where Dl-(S) is the fractal dimension of the ith time WindoW

and 0t is a parameter comprised in the range betWeen 0 and l .

In particular, the collecting step of data from the Web and

said counting step of said number of Web resources W(S) are automatically carried out by a computer program, or Web craWler, Which broWses the Web at said intervals of time of

length d.

Advantageously, the computing step of said fractal dimen sion D comprises the steps of:

covering the curve of said time-series of data With a grid of square boxes of a determined side (L);

recording the number M(L) of boxes needed to cover said curve as a function of said box;

(11)

US 8,117,227 B2

5

computing said fractal dimension D applying the following

equation:

D : —lirrélogLM(L) (III)

In particular, the fractal dimension D is comprised in the range betWeen 1 and 2, said fractal dimension D being equal to 1 When the Web resources of a same interval time WindoW create a regular system, While D being equal to 2 When the Web resources of a same interval of time create a random

system.

In particular, tWo consecutive WindoWs of said plurality of WindoWs are partially overlapped.

Advantageously, the determined length Z is comprised in

the range betWeen 12 hours and 60 days.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention Will be noW shoWn With the folloWing

description of an exemplary embodiment thereof, exemplify ing but not limitative, With reference to the attached draWings

in Which:

FIG. 1 shoWs a ?owchart With the main steps of the method,

according to the invention, for analyZing data from the Web;

FIGS. from 2A to 2D diagrammatically shoW the step, as it

is provided by the method illustrated in FIG. 1, of splitting the

time-series of Web resources related to a determined topic

into several overlapping and consecutive WindoWs;

FIG. 3 diagrammatically shoWs a technique that can be

applied for carrying out the step of computing the fractal

dimension D of the Web resources;

FIG. 4 shoWs a diagram reporting the number of Web resources talking about the Democratic candidates Barack Obama and Hillary Clinton collected from December 2007

and May 2008;

FIG. 5 shoWs a diagram reporting the trend index, herein

after also called TrendIndex, as it is computed by the method,

according to the invention, on the basis of the Web resources

shoWn in FIG. 4;

FIG. 6 shoWs a diagram reporting the number of Web resources talking about the candidates Barack Obama and John Mc Cain collected from September 2008 to November

2008;

FIGS. 7 and 8 shoW tWo diagrams reporting the TrendIndex

computed by the method, according to the invention, on the

basis of the Web resources shoWn in FIG. 6 for tWo different

time-WindoWs;

FIG. 9 shoWs a diagram reporting the number of Web

resources talking about My Space and Facebook collected from December 2007 to July 2008;

FIG. 10 shoWs a diagram reporting the TrendIndex com

puted by the method, according to the invention, on the basis

of the Web resources shoWn in FIG. 9 for a time WindoW of 30

days;

FIG. 11 shoWs a diagram reporting the TrendIndex related to “MySpace” and “Facebook”, With a time WindoW of 7 days and (X:0.5.

DESCRIPTION OF PREFERRED EXEMPLARY EMBODIMENTS

FIG. 1 shoWs the main steps of the method, according to the

invention, for understanding the society through Web data

analysis.

20 25 30 35 40 45 50 55 60 65

6

The method starts With the choice of a determined topic (S), identi?ed by one or more keyWords, block 101.

A collecting step, block 102, is then provided at successive

instants t for collecting data, or Web resources, mentioning

the topic (S) from the Web. In particular, tWo successive

instants t are separated by an interval of time of determined

length.

A counting step is also provided for counting the number of

Web resources W(S) that mention the topic (S) at each instant t, block 103a. After the step of counting the number W(S) a step of classifying the Web resources by means of the space locations can be also provided, block 1031). In particular, the step of classifying the Web resources can be carried out by

means of an IP address, the last page update, or any other property that can be identi?ed by means of a “selection rule”

(translated into a “regular expression” in softWare), also asso ciated to Web 2.0 properties (such being part of blog, a given

community, etc.) and to semantic Web properties (semantic

Web).

Then a time-series of N consecutive periodic measures of

the number of Web resources is generated, block 104. In

particular, the time-series represents the average number

W(S) of Web resources as a function of time.

The time-series is successively split into a plurality of

consecutive time WindoWs of determined length Z, block 105. In particular, the length Z of each time WindoW is bigger than the length of the intervals of time betWeen tWo successive

instants at Which the Web resources are collected and counted in order to have at least one Web resource Within each time

WindoW. In a preferred embodiment of the invention, the collecting and counting steps of the Web resources are carried out by a Web craWler, ie a computer program that automati cally broWses the Web at each instant t.

A determined technique is then applied to each time Win doW of the time-series in order to compute, for each of them, a corresponding level of correlations Lc, block 106.

For example if the technique applied to each time WindoW is Fractal analysis the level of correlations is expressed in

terms of a fractal dimension D. In particular, the fractal dimension D indicates the level of correlation existing in the

Web resources W(S) of a same time WindoW. The value of the

fractal dimension D is comprised in the range betWeen 1 and 2. More in detail, the value D of the fractal dimension is equal to 1 When the Web resources of a same interval time WindoW corresponds to a regular system, While D is equal to 2 When the Web resources of a same time WindoW corresponds to a

random system. If the Web resources of a same time WindoW are correlated then the value of the fractal dimension D is

equal to 1.5. An example of regular system is represented by

a single blogger Who posts different messages about the same

topic. Although, the number of Web resources talking about the topic smoothly increases, this increasing number does not re?ect a groWing of interest in topic by society. It simply

represents a blogger Who is very interested in the topic. An

example of a random system is represented by several Web

resources Without any correlations that talk about the same

topic (e.g., people Who post messages about the same topic

but do not relate each other).

To compute the fractal dimension D it is possible to use the box counting algorithm as disclosed in “Fractal conductance ?uctuations in a soft wall stadium and a sinai billiard” Phys. Rev. Lett., 80: 1 948, 1998 in the name ofA. S. Sachrajda et al. As illustrated in FIG. 3, the fractal dimension D of a signal is obtained by covering the curve of data 30 With a grid of square boxes 50 of siZe L2. The number M(L) of boxes needed to cover the curve is recorded as a function of the box siZe L.

(12)

US 8,117,227 B2

7

The fractal dimension D of the curve is then de?ned as:

D : —lir%logLM(L) (III)

If the value of D, as calculated With the equation, is equal to

1, then the curve is a straight line, as it is in the case of a

regular system, Whereas if D is equal to 2 the curve is a random curve. Indeed, eventually a random curve covers uniformly the Whole plane. Any given value of D betWeen

these integer values is a signal of the fractality of the curve. In another embodiment of the invention, the fractal dimen

sion D is calculated by applying a different technique Which

uses rectangular boxes of siZe L><Ai, Where Ai is the largest excursion of the curve in the region L.

Then, the number

A;

is computed.

For any curve a region exists of box lengths Lmin<L<Lmax. Outside of this region either DIl or D:2. The ?rst equality (DIl) holds for L<Lmin and is due to the

coarse grain arti?cially introduced by any discrete time series. The second equality (D:2) is obtained for L>Lmax

and is due to the ?nite length of the analyZed time series. The boundaries Lmin, Lmax have to be chosen properly for

any time series. Unfortunately, the selection of Lmin, Lmax is

prone to errors since non optimal boundaries may be selected.

HoWever, as reported in the experimental results, this error does not really affect the computation of the TrendIndex. The fractal dimension represents the temporal correlation of a sequence of given values. To better appreciate hoW this tem

poral correlation changes With time, the method, according to

the present invention, considers the N collected samples of

the time-series FWeb(S) as several overlapping and consecu tive time WindoWs of length Z (e.g., in our experiments We consider Z equal to 7 and 30 days).

Once the fractal dimension D has been calculated, a step of estimation of the average number WM(S) of the Web resources is provided. This step is carried out for estimating

the average number of the Web resources of a same time

WindoW, block 107.

Therefore, the fractal dimension of each time WindoW is combined With the average number WM(S) of the Web

resources of the same time-WindoW for computing a Trend

Index, block 108. The iteration of the latter step for all the time WindoWs produces a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S has changed over time.

Fractal Analysis has been extensively employed in diverse scienti?c, sociological, and philosophical areas of research, and is used to describe physical, visual, acoustic, and chemi cal processes, and biological, Weather, and ?nancial systems. The importance of Fractal Analysis is that, given a sequence

of values related to different time points (i.e., a time-series), it

gives a fast insight of the “system” that generated the

sequence of values.

In many scenarios, the knowledge of the “system” brings

considerable bene?ts: for instance, it may be useful to predict

the near future behaviour of earthquakes, or the stock market

trend. The computation of the fractal dimension of a time

series alloWs discerning Whether the system that generated

20 25 30 35 40 45 50 55 60 65

8

the sequence is regular or random. Roughly, a regular system

produces smooth changes in the sequence of values, Whereas a random system produces highly irregular changes in the

sequence. In our scenario, the sequence of values is the time

series FWeb(S) and the system is composed by all the Web

resources that talk about topic S in the Webspace. We regard as appropriate to consider temporal correlations among Web resources of fundamental importance. In fact, the Web is a time evolving scenario Where the number of Web resources talking about a topic is different from time to time, and the

more these Web resources are temporarily correlated, the

more the topic re?ects an interest of the society.

In particular, correlations that survive long enough are

likely to create a netWork of Web resources. If this happens,

the netWork Will likely respond to subsequent stimuli (neW

events related to the topic) in a similar “correlated” Way.

Conversely, if the netWork is not suf?ciently correlated, it Will eventually vanish and disappear, and the response to subse

quent stimuli Will be negligible. By applying Fractal Analysis

and by computing the Fractal Dimension D, it is possible to

have an insight of the amount of correlations present in the netWork of Web resources, i.e. Whether the system is regular,

random, or something in betWeen, as anticipated above. The present technique starts from the concept that anything

betWeen a regular and a random system means that the net Work of Web resources that generated the sequence is corre

lated and thus it Will likely cause other people to become part

of the netWork. As a result, more parts of the society are

interested in the topic. To better clarify, let us consider a simple example: the success of a TV-series. An ensemble of fans can be triggered by the pilot episode so as to form a group of people interested in the TV-series. In this case, the group of fans is a correlated netWork as they talk almost every day about the TV-series. In fact, When a neW episode is aired this group of fans Will easily be the ?rst to talk about it, and it is

likely that they Will cause other people to become fans; this

means that the netWork groWs, as additional people become

part of the netWork.

Hereafter three examples are illustrated of application of

the method, according to the invention, for analyZing the society through Web data. In particular, three Well-knoWn scenarios are presented: the 2008 USA primary elections, the

2008 USA Presidential elections, and MySpace vs Facebook.

In the folloWing examples, Fractal analysis has been

applied to a plurality of time WindoWs for quantifying the level of correlations existing in the Web resources. The level of correlations Lc existing in the Web resources W(S) of a same time WindoW is, therefore, expressed as a fractal Dimen

sion D computed using the equation (III) above indicated. In

particular, it has been chosen a value of the parameter (X:0.5 Which corresponds to a Weight of the importance of the cor

relations as much as one order of magnitude in the number of

Web resources. A possible reasonable choice is that if com

paring tWo different topics, one appears more than ten times more than the other ones, it should be recorded in any case as

more in?uential. That is, during direct comparisons the cor relations play a role only if the tWo topics have the same order

of Web resources. Other choices of the value of 0t are of

course possible.

The TrendIndex is computed using the equation (II) above

indicated: TrendiZ(S):0t~log WMi(S)+(l —(X)'Dl-(S).

Example 1

Web resources talking about US Presidential candidates have been collected from the second Week of December 2007 to the second Week of May 2008. FIG. 4 reports the number of

(13)

US 8,117,227 B2

9

Web resources talking about the Democratic candidates Barack Obama and Hillary Clinton. The number of Web

resources has been detected every hour.

At ?rst look, many more Web resources talked about Hil

lary Clinton (only since the beginning of April, the tWo can

didates had a comparable number of Web resources). It can

also be noted the presence of peaks. By analyZing Where these peaks happen, it can be noted that both candidates have peaks

around primary election contests. Also, it is interesting to note

that the number of Web resources increased a lot at the begin

ning of March. A reasonable explanation is that at the begin ning of March, the candidate John McCain got the Republi

can nomination, and hence all the media attentions began

focusing mainly on the Democratic party.

Practically speaking, looking at this chart, Hillary Clinton

should have Won all the primary election contests, but she did

not. An interesting analysis can be done by analyZing the

volume of Web searches or the number of posts inside the Blogosphere, rather than the number of Web resources. The term “Barack Obama” has been entered in Web search engines many more times than the term “Hillary Clinton”, as Well as many more posts of the Blogosphere talked about

Barack Obama than those about Hillary Clinton. Looking at

these latter scenarios, Barack Obama should have Won all the primary election contests, but he did not.

This shoWs that an analysis based on the simple magnitude of results (either Web resources, posts in the Blogosphere, or searches in Web search engines) may represent a distorted reality, and therefore may not be not suf?cient to ?nd the

society interests.

FIG. 5 reports the TrendIndex. The period up to the end of

January saW Barack Obama Winning primary election con tests (IoWa and S. Carolina) and getting interesting results in

others (New Hampshire, Nevada, and Florida). The majority

of the media de?ned these results as unexpected. But looking

at the TrendIndex, these results Were not unexpected at all. In

fact, in this period the TrendIndex related to Barack Obama has been alWays higher than the one of Hillary Clinton, mean ing that people discussed much more about Barack Obama

than about Hillary Clinton.

A second interesting period to analyZe is February. In that

period several discussions focused on a possible WithdraW of Hillary Clinton from the Presidential race. The TrendIndex

shoWs that in February the buZZ around Hillary Clinton

increased a lot, and has been alWays higher than the TrendIn dex related to Barack Obama. In the second half of March (When no primary election contests Were scheduled), the buZZ around the tWo candidates decreased. When the primary elec

tion contests begun again, the buZZ of both increased, With the

one about Barack Obama higher than the one of Hillary Clinton (it is to note that at the beginning of June 2008, Hillary Clinton WithdraWs from the Presidential race, and Barack Obama became the Democratic nominee for Presi

dent of the United States). In summary, While approaches

based on the simple magnitude of results Were not suf?cient

to ?nd the society interests, the TrendIndex better represented

What Was going on in society.

Example 2

FIG. 6 reports the number of Web resources talking about Barack Obama and John McCain from mid September to

November 3 (the period When the battle for the Presidency

become interesting).

The number of Web resources talking about Barack Obama is higher than the ones talking about John McCain. It is interesting to observe that on October 16, John McCain

20 25 30 35 40 45 50 55 60 65

10

almost reached Barack Obama, and even passed him on Octo ber 22. Looking at What happened in society, We observe that

on October 16, John McCain Was a guest of the “David

Letterman ShoW” and the video became very popular on

video sharing sites like YouTube. On October 22, media

talked a lot about rumors related to expenses campaign of the

John McCain’s Vice-President. Also to note the impact that a speech held in St. Louis on October 18 had on Barack Obama.

Note also hoW, as of November 3, the difference among the tWo candidates is quite clear.

Looking at the number of Web searches, the term “Barack

Obama” has been entered many more times than the term

“John McCain”. In the Blogosphere, the tWo had a compa rable number of posts until mid October, and since then the difference betWeen the tWo is Widening With many more posts

talking about “Barack Obama”.

In summary, approaches based on the magnitude of results shoW that Barack Obama is clearly taking the lead over John McCain.

FIGS. 7-8 report the TrendIndex computed on time Win

doWs of 7 and 30 days, i.e. 2:7 and 2:30 in the equation (II),

respectively.

FIG. 7 presents TrendIndex computed on time WindoWs of

7 days. It can be observed that in the last month there Were

several tumarounds shoWing that one Week the society is

more interested in “Barack Obama” Whereas the successive

Week the interest goes to “John McCain”. This highlights the

fact that the tWo candidates are discussed in the society in a

comparable Way and depending on particular events (e. g.,

presence on a popular TV-shoW, rumors about personal

things) the discussion moves from one candidate to the other. FIG. 8 presents TrendIndex computed on time WindoWs of 30 days. Also in this investigation it can be observed that the

distance betWeen the tWo candidates ?uctuates over time, and

since October 28, the distance betWeen the tWo candidate is

Widening.

Example 3

FIG. 9 reports the number of Web resources talking about

MySpace and Facebook (the tWo most popular social net Working sites) from the beginning of December 2007 to the end of July 2008. With the exception of the ?rst half of

December 2007, many more Web resources talk about MyS

pace than Facebook. Therefore, an analysis based on the

simple magnitude of results Would indicate that the society

talks more about MySpace than Facebook. Since it is dif?cult to tell Whether this is true or not, it is Worth investigating both the number of Web searches and the Blogosphere.

Results obtained from analyZing the number of Web

searches, in the same period, shoW that, beginning from mid

April 2008, the term Facebook Was much more entered in

search engines than the term MySpace. Therefore, methods

based on the number of Web searches Would indicate that the

society talks more about Facebook.

Results obtained While analyZing the sole Blogosphere

shoW that the keyWord “MySpace” appears in many more posts than the keyWord “Facebook”. The difference is con

siderable (“MySpace” appears around tWice the keyWord “Facebook”), but beginning from September 2008, the tWo

keyWords appear in a similar number of posts (although “MySpace” has around 20% more posts than “Facebook”).

The comparison among the three analysis (the Whole Web,

the number of Web searches, and the Blogosphere) shoWs that methods based on the simple magnitude of results (either Web

(14)

US 8,117,227 B2

11

search engines) produce results that contradict each other,

and therefore they are not suited to ?nd the society interests. FIG. 10 shoWs Trendlndex computed on time WindoWs of

30 days (i.e., 2:30 in the Equation (11) used for computing the

Trendlndex). The tWo curves shoW that, beginning from Feb ruary, Facebook received increasing attention from the soci ety (a reason can be the launch of the Spanish language version of Facebook), but on June, MySpace overtook Face book (a possible reason is that on June MySpace redesigned the site With improved TV player and start page). To summa

riZe, this analysis shoWs that the society talks about “MyS

pace” and “Facebook” With comparable frequency, and

events like redesign of the Website or availability of neW

languages, clearly shoW their effects on society.

To understand Whether a particular event (e.g., a commer

cial, an article) produces effects on society or not, it is inter

esting to perform the analysis using a 7 days time WindoW

(FIG. 11). The high ?uctuation of the tWo curves is due to

relative short length of the observed time WindoWs. A length

of 7 days is effective to understand Whether a particular event (e. g., a commercial, an article) produces effects on society or not. Results shoW that beginning from February, there Were

frequent turnarounds. Once again, the analysis shoWs that

“MySpace” and “Facebook” receive comparable attention by

society.

The foregoing description of a speci?c embodiment Will so

fully reveal the invention according to a conceptual point of vieW, so that others, by applying current knowledge, Will be

able to modify and/or adapt for various applications such an embodiment Without further research and Without parting from the invention, and it is therefore to be understood that such adaptations and modi?cations Will have to be considered as equivalent to the speci?c embodiment. The means and the materials to realise the different functions described herein could have a different nature Without, for this reason, depart ing from the ?eld of the invention. It is to be understood that

the phraseology or terminology employed herein is for the

purpose of description and not of limitation. What is claimed is:

1. Method for analyZing data from the Web comprising the steps of:

choosing a determined topic (S), said topic (S) being iden

ti?ed by at least one keyWord;

collecting data, or Web resources, from the Web that men

tion said determined topic (S) at successive instants t,

tWo successive instants t being separated by an interval

of time of determined length d;

counting the number W(S) of said Web resources that

mention said determined topic (S) at each instant t;

generating a time-series of consecutive measures of the

number of said Web resources, said time-series repre

senting said number W(S) of Web resources as a func

tion of time;

splitting said time-series into a plurality of consecutive time WindoWs of determined length Z, With Zid in such

a Way that each time WindoW comprises at least one Web resource among said Web resources;

applying a determined technique to said plurality of time

WindoWs for quantifying, for at least one time WindoW among said time WindoWs, the level of correlations Lc

existing in the Web resources W(S) of a same time Win

doW T;

estimating, for each time WindoW, the average number

WM(S) of said Web resources W(S) that mention said

topic (S);

computing, for each time WindoW, a trend index by com bining said average number of said Web resources

WM(S) With said level of correlations Lc;

20 25 35 40 45 50 55 60

12

repeating said computing step of said trend index for all

said time WindoWs generating a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S changed over time.

2. Method according to claim 1, Wherein said determined

technique applied to said plurality of time WindoWs is

selected from the group comprised of:

Fractal analysis;

Fourier transform;

Wavelet analysis;

Entropy analysis.

3. Method according to claim 1, Wherein said determined

technique applied to said plurality of time WindoWs is Fractal analysis and said level of correlations Lc existing in the Web

resources W(S) of a same time WindoW T is expressed as a

fractal Dimension D.

4. Method according to claim 1, Wherein said combining

step comprises a step of associating to said average number of

said Web resources WM(S) and to said fractal dimension D a

different Weight, said associating step carried out selecting a

parameter 0t in the range comprises betWeen 0 and 1 depend

ing on the importance to assign to said averaged number of

Web resources WM(S) and to said fractal dimension D respec

tively.

5. Method according to claim 1, Wherein said estimating

step of said average number of Web resources of a determined

time WindoW is carried out applying the folloWing equation:

W} (s) (I)

W14 (s) = W

Where Ti, is the it)’ time WindoW, W 1(S) is the average num

ber of Web resources Wji(S) in the it time WindoW and |Ti| is

the length of the ith time WindoW.

6. Method according to claim 1, Wherein said computing step of said trend index is carried out applying the folloWing

equation:

Wherein Dl-(S) is the fractal dimension of the time WindoW and

0t is a parameter comprised in the range betWeen 0 and 1.

7. Method according to claim 1, Wherein said collecting step of data from the Web and said counting step of said

number of Web resources W(S) are automatically carried out by a computer program, or Web craWler, Which broWses the Web at said intervals of time of length d.

8. Method according to claim 1, Wherein said computing step of said fractal dimension D comprises the steps of:

covering the curve of said time-series of data With a grid of square boxes of a determined side (L);

recording the number M(L) of boxes needed to cover said curve as a function of said box;

computing said fractal dimension D applying the folloWing

equation:

D : —1ir%logLM(L). (III)

9. Method according to claim 1, Wherein tWo consecutive WindoWs of said plurality of WindoWs are partially over

lapped.

10. Method according to claim 1, Wherein said determined length Z is comprised in the range betWeen 12 hours and 60