(12) United States Patent
Montangero et a].
US008117227B2
US 8,117,227 B2
Feb. 14, 2012
(10) Patent N0.:
(45) Date of Patent:
(54) METHOD FOR ANALYZING WEB SPACE
DATA
(75) Inventors: Simone Montangero, Pisa (IT); Marco
Furini, Melara (IT)
(73) Assignee: Scuola Normale Superiore Di Pisa,
Pisa (IT)
( * ) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35
U.S.C. 154(b) by 230 days.
(21) Appl. No.: 12/635,004
(22) Filed: Dec. 10, 2009
(65) Prior Publication Data
US 2011/0145215A1 Jun. 16, 2011
(51) Int. Cl.
G06F 7/00 (2006.01) G06F 17/30 (2006.01)
(52) US. Cl. ... .. 707/769; 707/709 (58) Field of Classi?cation Search ... .. 707/769
See application ?le for complete search history.
(56) References Cited
U.S. PATENT DOCUMENTS
2007/0100875 A1 * 5/2007 Chi et al. ... .. 707/102
2008/0033587 A1 * 2/2008 Kurita et al. . 700/100
2010/0185641 A1 * 7/2010 Brazier et a1. ... .. 707/758
OTHER PUBLICATIONS
Yun Chi et al “Eigen-Trend: Trend Analysis in the Blogosphere Based on Singular Value Decompositions”, NEC Laboratories America, 2006.
X. Ni et al “Exploring in the Weblog Space by Detecting Informative and Affective Articles” Dept. of Computer Science etc., May 2007, pp. 281-290.
Yang Liu et al “ARSA: A Sentiment-Aware Model for Predicting
Sales Performance Using Blogs”, Dept. of Computer Science etc., Jul. 2007.
T. Fukuhara et a1 “Analyzing Concerns of People Using Weblog
Articles and Real World Temporal Data” Research Institute of Sci
ence and etc., May 2005.
N. S. Glance et a1 “BlogPulse: Automated Trend Discovery for
Weblogs” Intelliseek Applied Research Center, May 2004. D. Gruhl et al “How to Build a WebFountain: An Architecture for Very Large-scale Text Analytics”, IBM Systems Journal, 2004, pp. 64-77.
D. Gruhl et al “Information Diffusion Through Blogspace”, IBM
Research, May 2004, pp. 491501.
S. Morinaga et al “Mining Product Reputations on the Web”, NEC Corporation, 2007, pp. 341-349.
R. Agrawal “Mining Newsgroups Using Networks Arising From
Social Behavior”, IBM Almaden Research Center, Mar. 3, 2007, pp.
529-535.
A.S. Sachraj da et al “Fractal Conductance Fluctuations in a Soft-Wall Stadium and a Sinai Billiard”, Inst. For Microstructural Sciences,
1984, pp. 1948-1951. * cited by examiner
Primary Examiner * Apu Mo?z
Assistant Examiner * Mohammad Rahman
(74) Attorney, Agent, or FirmiDennison, Schultz &
MacDonald
(57)
ABSTRACT
A method for analyzing data from the web that determine the
importance that a chosen subject has in society, e.g., subject
matter relating a concert, a scienti?c discovery, a football match, a person, a corporation, a brand, or a car, and analyze
such data that can represent the entire society better than the
known techniques. The method according to the invention
can avoid malicious alterations and is able to measure and
detect the temporal relations among all the web resources that talk about a particular topic or subject matter.
10 Claims, 7 Drawing Sheets
10k CHOISE OF 101 COLLECTION OF 10 wEaREsouRcEs CALCULATION OF THE NUMBER OF WEBREsoURcEs GENERATION OF 10 THE TIME-SERIES TIME-s
IN TIME WINDOWS ERIES 1 n5
108 quANTIFIcATIoN OF THE LEVEL OF CORRELATION 0F EAcN WINDOW ESTIMATION OF VERAGE NUMBER OF wEsREsoURoEs COMPUTATION OF YREND INDEX
US. Patent
Feb. 14,
2012
Sheet 1 0f 7
CHOISE OF
A SUBJECT (S)
I
COLLECTION OF
WEBRESOURCES
I
CALCULATION OF
THE NUMBER OF
WEBRESOURCES
VUS 8,117,227 B2
101
/103a
CLASSIFICATION
OF THE WEBRESOURCES
\103b
V104/
THE TIME-SERIES
GENERATION OF
I
SPLIT OF THE
TIME-SERIES
IN TIME WINDOWS
V106
QUANTIFICATION OF THE
LEVEL OF CORRELATION
OF EACH WINDOW
VESTIMATION OF
AVERAGE NUMBER OF
WEBRESOURCES
\107
108
VCOMPUTATION
OF TREND INDEX
US. Patent
Number of Web resources
Number of
web I
Feb. 14, 2012
Sheet 3 0f 7
US 8,117,227 B2
,5
4,53
3,5
333
2,5
Qecember
January
Februaw
285)?
A2868
2998
9‘ ‘Q N ***** "N163 y
mas
Agarii
2888
Mar-ch
2868
US. Patent
Feb. 14, 2012
Sheet 4 on
US 8,117,227 B2
in
-;*- Barack Obama 1 y 4 1,9 1,8xmngucmp
‘1.55“ Apm ‘*Tiiw
2008 .2: >‘ '' February ‘ March December r‘ Januayy 20GB 2008 2008 2008 2007US. Patent
Feb. 14, 2012
Sheet 5 017
US 8,117,227 B2
— ... ..4,3
V \‘v\»\ ‘>4: 5:5: w’ . w “w‘mmxwis Q 'Q‘V'Q‘ _, - \ \ L“. W, 5:5;'13
F1:
4,3‘
{1}
Q:
~11“ Swank Shame} ....
--::»Jehn McCain
5%"? .. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, “ .A l (‘my ‘+1 Arr)»
_\__,_...»--?~~~---E"
w’ "- -‘\\ ‘fr’ ’ ~ “ “ 1 .11 n »-&W' ~‘ ’ wNM-M' _ sir-“P” ><z \ "('3 M{i
W -» -E i, .,
m?w?arack Onzama
G3‘
~1=-JQW1 Ma?a“:
L.. ;..._. :3, . .Q."
$2
US. Patent
Feb. 14, 2012
Sheet 6 017
US 8,117,227 B2
N111. mar
a?
Web féI-Zi-BGUfCi-EE (x
16?’)
‘:3 \ 1 1 1 \ l 4 »»»»»»»»»»»»»»»» -~aw-»»»»»M~--+<§ ~~~~~~~~~~~~ -~a» WM~W-—M*M~M+QW~§»G?’
Jan £33
February M
(<2: ch
Apm
Nay
Jung
Juiy
Fig. 10
MW“- Facebook “ a!311
,x
3,5
--c:=-MySpace
(190“ ‘$
a 3,1Trendlndex
2%
2,9 2,7 2'5 l l l l \ lUS. Patent
Trendl‘nd ex
3:? 3.5 3.3 3,1 29Feb. 14, 2012
Sheet 7 0f 7
US 8,117,227 B2
Fig. 11
W“ Faoeboo k “9-”- M ySpaceUS 8,117,227 B2
1
METHOD FOR ANALYZING WEB SPACE
DATA
FIELD OF THE INVENTION
The present invention relates to a method for analyzing Web space data, i.e. data from the internet Web, in order to ?nd
the society interests, in particular, for communication strate
gies, marketing analysis, business investment, sociological
activities, product planning, or targeted advertising.
BACKGROUND OF THE INVENTION
As Well knoWn, understanding social concerns and opin ions is of high importance in several scenarios and in many decision making processes. NoWadays, society thoughts are mainly investigated by conducting a series of questions to a selected sample of the population. Questions like. What do
you think about that brand? Did you like the commercial aired during the superboWl? Do you Watch that TV-shoW? Do you
buy that product?, are commonly used by polling organiZa
tions to ?gure out the society opinions.
Recently, the Widespread use of the Web as a Way of con
veying personal opinions has Whetted researchers to propose methods that aim at understanding the society through Web
space analysis.
The rationale behind the usage of Web data to ?nd the society interests is that, thanks to the set of Internet technolo gies grouped under the label Web 2.0, the Web is more and
more a signi?cant representation of our society: it is a modern
version of the Ancient Greek Agora, Where people gathered together to do commercial and administrative activities, to discuss politics and philosophy, to participate to social and religious events, to understand and in?uence society.
Web 2.0 technologies like blogs, podcasts, and Wikis are so
important in noWadays the society that they are affecting its morphology by creating neW spaces of freedom, giving voice
to any opinion, easing interpersonal relationships, and
encouraging the creation of collaborating collectivities.
The revolution of Web 2.0 is that it potentially transforms every user from a mere passive reader to an active modern citiZen, apart from ethnicity, gender, or Walk of life. Using
Web 2.0 technologies, people can meet virtually to share
knoWledge, conduct business, discuss different topics, social
iZe, and even in?uence society. The society and Web are so
strongly linked that they affect each other. When something happens in the society it is very likely that feW seconds later
someone Writes about it in the Webspace, for example, more and more people consider the Web as the ?rst place to look for neWs, or When a product is released, the Blogosphere, Which is made up of all the blogs and their interconnections, is the place Where to discuss about it. On the other hand, the Web
might in?uence the society providing several communication
tools and an easy access to information. For instance, on May
2007 a post on a blog reported that Apple Was delaying the “iPhone” and “Leopard OS”. Although this post turned out to be a false alarm, during the period that the neWs Was consid ered to be true, Apple’s stocks Were negatively affected.
In the literature, different proposals exploit the society
Web relation so as to ?nd out society’s interests like people’
concerns, HollyWood stars’ notoriety, politicians’ popularity,
or consumers’ opinions. These proposals are based on the idea that When you see something interesting, e.g., on TV, on
the Web, or at the movie theater, you usually converse about
it With friends, and if people talk about it and spread the voice around, there Will be several on-going conversations about
the topic. The more people converse on a same subject matter, 20 25 30 35 40 45 50 55 60 65
2
the more the topic is considered in society. By supposing the Blogosphere as the place Where modern conversations hap pens, these methods compute the number of on-going discus
sions about a speci?c topic and uses this number as an indi
cation of the importance of the topic in society.
Also commercial products like Google Trends, BlogPulse,
Trendpedia, and Blogmeter, just to name a feW, exploit the Webspace to analyZe human society. These tools assume that
the more people use Web search engines to look for a particu lar topic, or the more people discuss a particular topic in the
Blogosphere, the more the topic is popular, important, or
simply discussed in society.
A critical thinking to these approaches is that they may help understanding What’s going on in the Web, but they might be
misleading or might even represent a distorted vieW of the
society. TWo are the main concerns about these methods.
Firstly, results better represent a part and not the entire
society. In fact, being based on the Blogosphere, these meth ods analyZe a portion of the society composed of tens of
millions of users Who share information and exchange per
sonal opinions, a portion of the society usually de?ned as composed of technologically advanced people. With no doubt, the Blogosphere offers great commercial values and
provides neW business opportunities in areas such as product
survey, customer relationship, and marketing, but compared
to the 700 millions of Web users, the Blogosphere represents a very small portion of the Web, and therefore of the society.
The second critical note is related to the usage of the sole
magnitude of volume data search in Web search engines, or of
keyWords in the Blogosphere; it is easy to maliciously alter
the results as one can Write a softWare that automatically, and
periodically, issues Web searches, or posts blog messages, so
as to make a brand, a Website, or a politician more popular
than they really are.
Recently, in the literature many proposals focused on using
Web data to understand social opinions and/or concerns, as
Well as many commercial blog sites and Web search engines introduced services that try to give an indication of public
opinions.
In the literature, much research Work is being conducted on the Blogosphere, as blogs are much more dynamic than tra ditional Web pages:
Chi et al. analyZe the Blogosphere and propose a trend
analysis technique based on the singular value decomposi
tion.
Ni et al. propose a machine learning method for classifying informative and affective articles inside the Blogosphere.
Liu et al. study the predictive poWer of opinions and sen
timents expressed in blogs, in order to predict product sales
performance.
Fukuhara et al. describe a system that counts the number of
blog articles containing a speci?c Word so as to understand concerns of people.
Glance et al. propose a mechanism to discover trends
inside the Blogosphere by using data mining techniques.
Gruhl et al. use the volume of blogs or link structures to
predict the trend of product sales.
Morinaga et al. present an approach that automatically mines consumer opinions With respect to given products, in
order to facilitate customer relationship management. AgraWal et al. and Gamon et al. have also conducted
research in opinion mining for marketing purposes.
Also commercial blog sites and Web search engines are
offering services that aim at understanding the society
through Web data analysis.
The Webfountain project uses Web mining techniques for
US 8,117,227 B2
3
Google Trends charts hoW often a particular search term is entered relative the total search volume across various regions
of the World, and in various languages.
All the knoWn methods based on the simple magnitude of
the results, either in the Blogosphere, Web searches engines,
or the entire Web space, are misleading and provide different,
and sometimes controversial, understandings of the society.
SUMMARY OF THE INVENTION
It is therefore a feature of the present invention to provide
a method for analyZing data from the Web that can ?nd out the
importance that a “subject” has in society, e.g., a subject
matter relating to concert, a scienti?c discovery, a football
match, a person, a corporation, a brand, a car.
It is also a feature of the present invention to provide a
method for analyZing data from the Web that can represent the
entire society better than the knoWn techniques.
It is another feature of the present invention to provide a method for analyZing data from the Web that can avoid mali cious alterations.
It is a particular feature of the present invention to provide a method for analyZing data from the Web that is able to measure and detect the temporal relations among all the Web resources that talk about a particular topic or subject matter.
These and other features are accomplished With a generally
computer-based method, according to the invention, for ana lyZing data from the Web comprising the steps of:
choosing a determined subject or topic (S), said topic (S) being identi?ed by at least one keyword;
collecting data, or Web resources, from the Web that men
tion said determined topic (S) at successive instants t,
tWo successive instants t being separated by an interval
of time of determined length d;
counting the number W(S) of said Web resources that
mention said determined topic (S) at each instant t;
generating a time-series of consecutive measures of the
number of said Web resources, said time-series repre
senting said number W(S) of Web resources as a func
tion of time;
splitting said time-series into a plurality of consecutive time WindoWs of determined length Z, With Zid in such
a Way that each time WindoW comprises at least one Web resource among said Web resources;
applying a determined technique to said plurality of time
WindoWs for quantifying, for at least one time WindoW among said time WindoWs, the level of correlations Lc
existing in the Web resources W(S) of a same time Win doW T and/or to characterize the structure of a so de?ned
signal;
estimating, for each time WindoW, the average number
WM(S) of said Web resources W(S) that mention said
topic (S);
computing, for each time WindoW, a trend index by com bining said average number of said Web resources
WM(S) With said level of correlations Lc;
repeating said computing step of said trend index for all
said time WindoWs generating a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S changed over time.
After said step of counting the number W(S) of said Web
resources a step of classifying the Web resources by means of
the space locations can be also provided. In particular, the step
of classifying the Web resources can be carried out by means
of an IP address, the last page update, or any other property
that can be identi?ed by means of a “selection rule” (trans
lated in a “regular expression” in softWare), also associated to
20 25 30 35 40 45 50 55 60 65
4
Web 2.0 properties (such being part of blog, a given commu
nity, etc.) and to semantic Web properties (semantic Web).
In particular, the “level of correlation” is the quantity of
cross-correlation of a signal With itself. The mathematical tools to compute cross-correlations may be developed, for example, to ?nd repeating patterns, such as the presence of a
periodic signal Which has been buried under noise.
Advantageously, said determined technique applied to said
plurality of time WindoWs can be selected from the group
comprised of:
Fractal analysis;
Fourier transform;
Wavelet analysis;
Entropy analysis.
In particular, said determined technique applied to said plurality of time WindoWs is an Entropy analysis.
Preferably, said determined technique applied to said plu
rality of time WindoWs is Fractal analysis. In this case, each of said level of correlations of each time WindoW is expressed in
terms of a Fractal dimension D.
Advantageously, after said step of applying a determined technique to said plurality of time WindoWs, a step of post
processing the resulting data is also provided.
In particular, the above described method detects and mea
sures the temporal relations among the Web resources and
uses Fractal analysis to retrieve the correlations among the Web resources. The results obtained from applying fractal analysis are combined With the number of Web documents in order to compute an Index, i.e. the Trend Index, that aims to
give an indication of the interest that the society has on a
speci?c topic.
In particular, the computing step comprises a step of asso
ciating to said average number of said Web resources WM(S)
and to said fractal dimension D a different Weight, said asso
ciating step carried out selecting a parameter a in the range comprises betWeen 0 and 1 depending on the importance to assign to said averaged number of Web resources WM(S) and to said fractal dimension D respectively.
Advantageously, the estimating step of said average num
ber of Web resources of a determined time WindoW is carried
out applying the folloWing equation:
WW) = 2
Where Ti is the ith time WindoW, W Mi(S) the average number of Web resources Wji(S) in the ith time WindoW, and Ti is the length of the ith time WindoW.
Advantageously, the computing step of said trend index is
carried out applying the folloWing equation:
ml
Where Dl-(S) is the fractal dimension of the ith time WindoW
and 0t is a parameter comprised in the range betWeen 0 and l .
In particular, the collecting step of data from the Web and
said counting step of said number of Web resources W(S) are automatically carried out by a computer program, or Web craWler, Which broWses the Web at said intervals of time of
length d.
Advantageously, the computing step of said fractal dimen sion D comprises the steps of:
covering the curve of said time-series of data With a grid of square boxes of a determined side (L);
recording the number M(L) of boxes needed to cover said curve as a function of said box;
US 8,117,227 B2
5
computing said fractal dimension D applying the following
equation:
D : —lirrélogLM(L) (III)
In particular, the fractal dimension D is comprised in the range betWeen 1 and 2, said fractal dimension D being equal to 1 When the Web resources of a same interval time WindoW create a regular system, While D being equal to 2 When the Web resources of a same interval of time create a random
system.
In particular, tWo consecutive WindoWs of said plurality of WindoWs are partially overlapped.
Advantageously, the determined length Z is comprised in
the range betWeen 12 hours and 60 days.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention Will be noW shoWn With the folloWing
description of an exemplary embodiment thereof, exemplify ing but not limitative, With reference to the attached draWings
in Which:
FIG. 1 shoWs a ?owchart With the main steps of the method,
according to the invention, for analyZing data from the Web;
FIGS. from 2A to 2D diagrammatically shoW the step, as it
is provided by the method illustrated in FIG. 1, of splitting the
time-series of Web resources related to a determined topic
into several overlapping and consecutive WindoWs;
FIG. 3 diagrammatically shoWs a technique that can be
applied for carrying out the step of computing the fractal
dimension D of the Web resources;
FIG. 4 shoWs a diagram reporting the number of Web resources talking about the Democratic candidates Barack Obama and Hillary Clinton collected from December 2007
and May 2008;
FIG. 5 shoWs a diagram reporting the trend index, herein
after also called TrendIndex, as it is computed by the method,
according to the invention, on the basis of the Web resources
shoWn in FIG. 4;
FIG. 6 shoWs a diagram reporting the number of Web resources talking about the candidates Barack Obama and John Mc Cain collected from September 2008 to November
2008;
FIGS. 7 and 8 shoW tWo diagrams reporting the TrendIndex
computed by the method, according to the invention, on the
basis of the Web resources shoWn in FIG. 6 for tWo different
time-WindoWs;
FIG. 9 shoWs a diagram reporting the number of Web
resources talking about My Space and Facebook collected from December 2007 to July 2008;
FIG. 10 shoWs a diagram reporting the TrendIndex com
puted by the method, according to the invention, on the basis
of the Web resources shoWn in FIG. 9 for a time WindoW of 30
days;
FIG. 11 shoWs a diagram reporting the TrendIndex related to “MySpace” and “Facebook”, With a time WindoW of 7 days and (X:0.5.
DESCRIPTION OF PREFERRED EXEMPLARY EMBODIMENTS
FIG. 1 shoWs the main steps of the method, according to the
invention, for understanding the society through Web data
analysis.
20 25 30 35 40 45 50 55 60 656
The method starts With the choice of a determined topic (S), identi?ed by one or more keyWords, block 101.
A collecting step, block 102, is then provided at successive
instants t for collecting data, or Web resources, mentioning
the topic (S) from the Web. In particular, tWo successive
instants t are separated by an interval of time of determined
length.
A counting step is also provided for counting the number of
Web resources W(S) that mention the topic (S) at each instant t, block 103a. After the step of counting the number W(S) a step of classifying the Web resources by means of the space locations can be also provided, block 1031). In particular, the step of classifying the Web resources can be carried out by
means of an IP address, the last page update, or any other property that can be identi?ed by means of a “selection rule”
(translated into a “regular expression” in softWare), also asso ciated to Web 2.0 properties (such being part of blog, a given
community, etc.) and to semantic Web properties (semantic
Web).
Then a time-series of N consecutive periodic measures of
the number of Web resources is generated, block 104. In
particular, the time-series represents the average number
W(S) of Web resources as a function of time.
The time-series is successively split into a plurality of
consecutive time WindoWs of determined length Z, block 105. In particular, the length Z of each time WindoW is bigger than the length of the intervals of time betWeen tWo successive
instants at Which the Web resources are collected and counted in order to have at least one Web resource Within each time
WindoW. In a preferred embodiment of the invention, the collecting and counting steps of the Web resources are carried out by a Web craWler, ie a computer program that automati cally broWses the Web at each instant t.
A determined technique is then applied to each time Win doW of the time-series in order to compute, for each of them, a corresponding level of correlations Lc, block 106.
For example if the technique applied to each time WindoW is Fractal analysis the level of correlations is expressed in
terms of a fractal dimension D. In particular, the fractal dimension D indicates the level of correlation existing in the
Web resources W(S) of a same time WindoW. The value of the
fractal dimension D is comprised in the range betWeen 1 and 2. More in detail, the value D of the fractal dimension is equal to 1 When the Web resources of a same interval time WindoW corresponds to a regular system, While D is equal to 2 When the Web resources of a same time WindoW corresponds to a
random system. If the Web resources of a same time WindoW are correlated then the value of the fractal dimension D is
equal to 1.5. An example of regular system is represented by
a single blogger Who posts different messages about the same
topic. Although, the number of Web resources talking about the topic smoothly increases, this increasing number does not re?ect a groWing of interest in topic by society. It simply
represents a blogger Who is very interested in the topic. An
example of a random system is represented by several Web
resources Without any correlations that talk about the same
topic (e.g., people Who post messages about the same topic
but do not relate each other).
To compute the fractal dimension D it is possible to use the box counting algorithm as disclosed in “Fractal conductance ?uctuations in a soft wall stadium and a sinai billiard” Phys. Rev. Lett., 80: 1 948, 1998 in the name ofA. S. Sachrajda et al. As illustrated in FIG. 3, the fractal dimension D of a signal is obtained by covering the curve of data 30 With a grid of square boxes 50 of siZe L2. The number M(L) of boxes needed to cover the curve is recorded as a function of the box siZe L.
US 8,117,227 B2
7
The fractal dimension D of the curve is then de?ned as:
D : —lir%logLM(L) (III)
If the value of D, as calculated With the equation, is equal to
1, then the curve is a straight line, as it is in the case of a
regular system, Whereas if D is equal to 2 the curve is a random curve. Indeed, eventually a random curve covers uniformly the Whole plane. Any given value of D betWeen
these integer values is a signal of the fractality of the curve. In another embodiment of the invention, the fractal dimen
sion D is calculated by applying a different technique Which
uses rectangular boxes of siZe L><Ai, Where Ai is the largest excursion of the curve in the region L.
Then, the number
A;
is computed.
For any curve a region exists of box lengths Lmin<L<Lmax. Outside of this region either DIl or D:2. The ?rst equality (DIl) holds for L<Lmin and is due to the
coarse grain arti?cially introduced by any discrete time series. The second equality (D:2) is obtained for L>Lmax
and is due to the ?nite length of the analyZed time series. The boundaries Lmin, Lmax have to be chosen properly for
any time series. Unfortunately, the selection of Lmin, Lmax is
prone to errors since non optimal boundaries may be selected.
HoWever, as reported in the experimental results, this error does not really affect the computation of the TrendIndex. The fractal dimension represents the temporal correlation of a sequence of given values. To better appreciate hoW this tem
poral correlation changes With time, the method, according to
the present invention, considers the N collected samples of
the time-series FWeb(S) as several overlapping and consecu tive time WindoWs of length Z (e.g., in our experiments We consider Z equal to 7 and 30 days).
Once the fractal dimension D has been calculated, a step of estimation of the average number WM(S) of the Web resources is provided. This step is carried out for estimating
the average number of the Web resources of a same time
WindoW, block 107.
Therefore, the fractal dimension of each time WindoW is combined With the average number WM(S) of the Web
resources of the same time-WindoW for computing a Trend
Index, block 108. The iteration of the latter step for all the time WindoWs produces a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S has changed over time.
Fractal Analysis has been extensively employed in diverse scienti?c, sociological, and philosophical areas of research, and is used to describe physical, visual, acoustic, and chemi cal processes, and biological, Weather, and ?nancial systems. The importance of Fractal Analysis is that, given a sequence
of values related to different time points (i.e., a time-series), it
gives a fast insight of the “system” that generated the
sequence of values.
In many scenarios, the knowledge of the “system” brings
considerable bene?ts: for instance, it may be useful to predict
the near future behaviour of earthquakes, or the stock market
trend. The computation of the fractal dimension of a time
series alloWs discerning Whether the system that generated
20 25 30 35 40 45 50 55 60 65
8
the sequence is regular or random. Roughly, a regular system
produces smooth changes in the sequence of values, Whereas a random system produces highly irregular changes in the
sequence. In our scenario, the sequence of values is the time
series FWeb(S) and the system is composed by all the Web
resources that talk about topic S in the Webspace. We regard as appropriate to consider temporal correlations among Web resources of fundamental importance. In fact, the Web is a time evolving scenario Where the number of Web resources talking about a topic is different from time to time, and the
more these Web resources are temporarily correlated, the
more the topic re?ects an interest of the society.
In particular, correlations that survive long enough are
likely to create a netWork of Web resources. If this happens,
the netWork Will likely respond to subsequent stimuli (neW
events related to the topic) in a similar “correlated” Way.
Conversely, if the netWork is not suf?ciently correlated, it Will eventually vanish and disappear, and the response to subse
quent stimuli Will be negligible. By applying Fractal Analysis
and by computing the Fractal Dimension D, it is possible to
have an insight of the amount of correlations present in the netWork of Web resources, i.e. Whether the system is regular,
random, or something in betWeen, as anticipated above. The present technique starts from the concept that anything
betWeen a regular and a random system means that the net Work of Web resources that generated the sequence is corre
lated and thus it Will likely cause other people to become part
of the netWork. As a result, more parts of the society are
interested in the topic. To better clarify, let us consider a simple example: the success of a TV-series. An ensemble of fans can be triggered by the pilot episode so as to form a group of people interested in the TV-series. In this case, the group of fans is a correlated netWork as they talk almost every day about the TV-series. In fact, When a neW episode is aired this group of fans Will easily be the ?rst to talk about it, and it is
likely that they Will cause other people to become fans; this
means that the netWork groWs, as additional people become
part of the netWork.
Hereafter three examples are illustrated of application of
the method, according to the invention, for analyZing the society through Web data. In particular, three Well-knoWn scenarios are presented: the 2008 USA primary elections, the
2008 USA Presidential elections, and MySpace vs Facebook.
In the folloWing examples, Fractal analysis has been
applied to a plurality of time WindoWs for quantifying the level of correlations existing in the Web resources. The level of correlations Lc existing in the Web resources W(S) of a same time WindoW is, therefore, expressed as a fractal Dimen
sion D computed using the equation (III) above indicated. In
particular, it has been chosen a value of the parameter (X:0.5 Which corresponds to a Weight of the importance of the cor
relations as much as one order of magnitude in the number of
Web resources. A possible reasonable choice is that if com
paring tWo different topics, one appears more than ten times more than the other ones, it should be recorded in any case as
more in?uential. That is, during direct comparisons the cor relations play a role only if the tWo topics have the same order
of Web resources. Other choices of the value of 0t are of
course possible.
The TrendIndex is computed using the equation (II) above
indicated: TrendiZ(S):0t~log WMi(S)+(l —(X)'Dl-(S).
Example 1
Web resources talking about US Presidential candidates have been collected from the second Week of December 2007 to the second Week of May 2008. FIG. 4 reports the number of
US 8,117,227 B2
9
Web resources talking about the Democratic candidates Barack Obama and Hillary Clinton. The number of Web
resources has been detected every hour.
At ?rst look, many more Web resources talked about Hil
lary Clinton (only since the beginning of April, the tWo can
didates had a comparable number of Web resources). It can
also be noted the presence of peaks. By analyZing Where these peaks happen, it can be noted that both candidates have peaks
around primary election contests. Also, it is interesting to note
that the number of Web resources increased a lot at the begin
ning of March. A reasonable explanation is that at the begin ning of March, the candidate John McCain got the Republi
can nomination, and hence all the media attentions began
focusing mainly on the Democratic party.
Practically speaking, looking at this chart, Hillary Clinton
should have Won all the primary election contests, but she did
not. An interesting analysis can be done by analyZing the
volume of Web searches or the number of posts inside the Blogosphere, rather than the number of Web resources. The term “Barack Obama” has been entered in Web search engines many more times than the term “Hillary Clinton”, as Well as many more posts of the Blogosphere talked about
Barack Obama than those about Hillary Clinton. Looking at
these latter scenarios, Barack Obama should have Won all the primary election contests, but he did not.
This shoWs that an analysis based on the simple magnitude of results (either Web resources, posts in the Blogosphere, or searches in Web search engines) may represent a distorted reality, and therefore may not be not suf?cient to ?nd the
society interests.
FIG. 5 reports the TrendIndex. The period up to the end of
January saW Barack Obama Winning primary election con tests (IoWa and S. Carolina) and getting interesting results in
others (New Hampshire, Nevada, and Florida). The majority
of the media de?ned these results as unexpected. But looking
at the TrendIndex, these results Were not unexpected at all. In
fact, in this period the TrendIndex related to Barack Obama has been alWays higher than the one of Hillary Clinton, mean ing that people discussed much more about Barack Obama
than about Hillary Clinton.
A second interesting period to analyZe is February. In that
period several discussions focused on a possible WithdraW of Hillary Clinton from the Presidential race. The TrendIndex
shoWs that in February the buZZ around Hillary Clinton
increased a lot, and has been alWays higher than the TrendIn dex related to Barack Obama. In the second half of March (When no primary election contests Were scheduled), the buZZ around the tWo candidates decreased. When the primary elec
tion contests begun again, the buZZ of both increased, With the
one about Barack Obama higher than the one of Hillary Clinton (it is to note that at the beginning of June 2008, Hillary Clinton WithdraWs from the Presidential race, and Barack Obama became the Democratic nominee for Presi
dent of the United States). In summary, While approaches
based on the simple magnitude of results Were not suf?cient
to ?nd the society interests, the TrendIndex better represented
What Was going on in society.
Example 2
FIG. 6 reports the number of Web resources talking about Barack Obama and John McCain from mid September to
November 3 (the period When the battle for the Presidency
become interesting).
The number of Web resources talking about Barack Obama is higher than the ones talking about John McCain. It is interesting to observe that on October 16, John McCain
20 25 30 35 40 45 50 55 60 65
10
almost reached Barack Obama, and even passed him on Octo ber 22. Looking at What happened in society, We observe that
on October 16, John McCain Was a guest of the “David
Letterman ShoW” and the video became very popular on
video sharing sites like YouTube. On October 22, media
talked a lot about rumors related to expenses campaign of the
John McCain’s Vice-President. Also to note the impact that a speech held in St. Louis on October 18 had on Barack Obama.
Note also hoW, as of November 3, the difference among the tWo candidates is quite clear.
Looking at the number of Web searches, the term “Barack
Obama” has been entered many more times than the term
“John McCain”. In the Blogosphere, the tWo had a compa rable number of posts until mid October, and since then the difference betWeen the tWo is Widening With many more posts
talking about “Barack Obama”.
In summary, approaches based on the magnitude of results shoW that Barack Obama is clearly taking the lead over John McCain.
FIGS. 7-8 report the TrendIndex computed on time Win
doWs of 7 and 30 days, i.e. 2:7 and 2:30 in the equation (II),
respectively.
FIG. 7 presents TrendIndex computed on time WindoWs of
7 days. It can be observed that in the last month there Were
several tumarounds shoWing that one Week the society is
more interested in “Barack Obama” Whereas the successive
Week the interest goes to “John McCain”. This highlights the
fact that the tWo candidates are discussed in the society in a
comparable Way and depending on particular events (e. g.,
presence on a popular TV-shoW, rumors about personal
things) the discussion moves from one candidate to the other. FIG. 8 presents TrendIndex computed on time WindoWs of 30 days. Also in this investigation it can be observed that the
distance betWeen the tWo candidates ?uctuates over time, and
since October 28, the distance betWeen the tWo candidate is
Widening.
Example 3
FIG. 9 reports the number of Web resources talking about
MySpace and Facebook (the tWo most popular social net Working sites) from the beginning of December 2007 to the end of July 2008. With the exception of the ?rst half of
December 2007, many more Web resources talk about MyS
pace than Facebook. Therefore, an analysis based on the
simple magnitude of results Would indicate that the society
talks more about MySpace than Facebook. Since it is dif?cult to tell Whether this is true or not, it is Worth investigating both the number of Web searches and the Blogosphere.
Results obtained from analyZing the number of Web
searches, in the same period, shoW that, beginning from mid
April 2008, the term Facebook Was much more entered in
search engines than the term MySpace. Therefore, methods
based on the number of Web searches Would indicate that the
society talks more about Facebook.
Results obtained While analyZing the sole Blogosphere
shoW that the keyWord “MySpace” appears in many more posts than the keyWord “Facebook”. The difference is con
siderable (“MySpace” appears around tWice the keyWord “Facebook”), but beginning from September 2008, the tWo
keyWords appear in a similar number of posts (although “MySpace” has around 20% more posts than “Facebook”).
The comparison among the three analysis (the Whole Web,
the number of Web searches, and the Blogosphere) shoWs that methods based on the simple magnitude of results (either Web
US 8,117,227 B2
11
search engines) produce results that contradict each other,
and therefore they are not suited to ?nd the society interests. FIG. 10 shoWs Trendlndex computed on time WindoWs of
30 days (i.e., 2:30 in the Equation (11) used for computing the
Trendlndex). The tWo curves shoW that, beginning from Feb ruary, Facebook received increasing attention from the soci ety (a reason can be the launch of the Spanish language version of Facebook), but on June, MySpace overtook Face book (a possible reason is that on June MySpace redesigned the site With improved TV player and start page). To summa
riZe, this analysis shoWs that the society talks about “MyS
pace” and “Facebook” With comparable frequency, and
events like redesign of the Website or availability of neW
languages, clearly shoW their effects on society.
To understand Whether a particular event (e.g., a commer
cial, an article) produces effects on society or not, it is inter
esting to perform the analysis using a 7 days time WindoW
(FIG. 11). The high ?uctuation of the tWo curves is due to
relative short length of the observed time WindoWs. A length
of 7 days is effective to understand Whether a particular event (e. g., a commercial, an article) produces effects on society or not. Results shoW that beginning from February, there Were
frequent turnarounds. Once again, the analysis shoWs that
“MySpace” and “Facebook” receive comparable attention by
society.
The foregoing description of a speci?c embodiment Will so
fully reveal the invention according to a conceptual point of vieW, so that others, by applying current knowledge, Will be
able to modify and/or adapt for various applications such an embodiment Without further research and Without parting from the invention, and it is therefore to be understood that such adaptations and modi?cations Will have to be considered as equivalent to the speci?c embodiment. The means and the materials to realise the different functions described herein could have a different nature Without, for this reason, depart ing from the ?eld of the invention. It is to be understood that
the phraseology or terminology employed herein is for the
purpose of description and not of limitation. What is claimed is:
1. Method for analyZing data from the Web comprising the steps of:
choosing a determined topic (S), said topic (S) being iden
ti?ed by at least one keyWord;
collecting data, or Web resources, from the Web that men
tion said determined topic (S) at successive instants t,
tWo successive instants t being separated by an interval
of time of determined length d;
counting the number W(S) of said Web resources that
mention said determined topic (S) at each instant t;
generating a time-series of consecutive measures of the
number of said Web resources, said time-series repre
senting said number W(S) of Web resources as a func
tion of time;
splitting said time-series into a plurality of consecutive time WindoWs of determined length Z, With Zid in such
a Way that each time WindoW comprises at least one Web resource among said Web resources;
applying a determined technique to said plurality of time
WindoWs for quantifying, for at least one time WindoW among said time WindoWs, the level of correlations Lc
existing in the Web resources W(S) of a same time Win
doW T;
estimating, for each time WindoW, the average number
WM(S) of said Web resources W(S) that mention said
topic (S);
computing, for each time WindoW, a trend index by com bining said average number of said Web resources
WM(S) With said level of correlations Lc;
20 25 35 40 45 50 55 60
12
repeating said computing step of said trend index for all
said time WindoWs generating a sequence of trend indexes Which shoW hoW opinions that the society has on a topic S changed over time.
2. Method according to claim 1, Wherein said determined
technique applied to said plurality of time WindoWs is
selected from the group comprised of:
Fractal analysis;
Fourier transform;
Wavelet analysis;
Entropy analysis.
3. Method according to claim 1, Wherein said determined
technique applied to said plurality of time WindoWs is Fractal analysis and said level of correlations Lc existing in the Web
resources W(S) of a same time WindoW T is expressed as a
fractal Dimension D.
4. Method according to claim 1, Wherein said combining
step comprises a step of associating to said average number of
said Web resources WM(S) and to said fractal dimension D a
different Weight, said associating step carried out selecting a
parameter 0t in the range comprises betWeen 0 and 1 depend
ing on the importance to assign to said averaged number of
Web resources WM(S) and to said fractal dimension D respec
tively.
5. Method according to claim 1, Wherein said estimating
step of said average number of Web resources of a determined
time WindoW is carried out applying the folloWing equation:
W} (s) (I)
W14 (s) = W
Where Ti, is the it)’ time WindoW, W 1(S) is the average num
ber of Web resources Wji(S) in the it time WindoW and |Ti| is
the length of the ith time WindoW.
6. Method according to claim 1, Wherein said computing step of said trend index is carried out applying the folloWing
equation:
Wherein Dl-(S) is the fractal dimension of the time WindoW and
0t is a parameter comprised in the range betWeen 0 and 1.
7. Method according to claim 1, Wherein said collecting step of data from the Web and said counting step of said
number of Web resources W(S) are automatically carried out by a computer program, or Web craWler, Which broWses the Web at said intervals of time of length d.
8. Method according to claim 1, Wherein said computing step of said fractal dimension D comprises the steps of:
covering the curve of said time-series of data With a grid of square boxes of a determined side (L);
recording the number M(L) of boxes needed to cover said curve as a function of said box;
computing said fractal dimension D applying the folloWing
equation:
D : —1ir%logLM(L). (III)
9. Method according to claim 1, Wherein tWo consecutive WindoWs of said plurality of WindoWs are partially over
lapped.
10. Method according to claim 1, Wherein said determined length Z is comprised in the range betWeen 12 hours and 60