• Non ci sono risultati.

Chronological corpora curve clustering: From scientific corpora construction to knowledge dynamics discovery through word life-cycles clustering

N/A
N/A
Protected

Academic year: 2021

Condividi "Chronological corpora curve clustering: From scientific corpora construction to knowledge dynamics discovery through word life-cycles clustering"

Copied!
12
0
0

Testo completo

(1)

Method

Article

Chronological

corpora

curve

clustering:

From

scienti

fic

corpora

construction

to

knowledge

dynamics

discovery

through

word

life-cycles

clustering

Matilde

Trevisani

*

,

Arjuna

Tuzzi

DepartmentofEconomics,Business,MathematicsandStatistics(DEAMS)ofUniversityofTrieste,Department

ofPhilosophy,Sociology,EducationandAppliedPsychology(FISPPA)ofUniversityofPadova,Italy

A B S T R A C T

Aimofthisproceduralmethodistoconstructwell-foundedcorporaofscientificliterature,and,hence,totrackthe evolutionofknowledgefieldsfromthereconstructionandclusteringofwords’life-cycles.Themethodcontains:

anoriginalselectionprocessofrelevantkeywordsinvolvingtheidentificationofrelevantstemsandstem n-gramsthroughamatchingwithitemlistsofrelevantglossaries;

severaltypesofnormalizationoftemporaltrajectoriesofwordrawfrequencies

aproperlycustomized clusteringofwordlife-cycles,withagraphicalextensiveinvestigationofthebest candidatesforclusternumber,tounveiltheimportantdynamicsanddecipherthehistoryofascientificfield.

©2018TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).

A R T I C L E I N F O

Methodname:Chronologicalcorporacurveclustering

Keywords:Diachroniccorpora,Functionaldataanalysis,Normalization,Clusternumberselection

Articlehistory:Received23March2018;Accepted10November2018;Availableonline19November2018

SpecificationsTable

Subjectarea ComputerScience

Morespecificsubjectarea Computationallinguistics

Methodname Chronologicalcorporacurveclustering

*Correspondingauthor.

E-mailaddress:matilde.trevisani@deams.units.it(M.Trevisani).

https://doi.org/10.1016/j.mex.2018.11.010

2215-0161/©2018TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://

creativecommons.org/licenses/by-nc-nd/4.0/).

ContentslistsavailableatScienceDirect

MethodsX

(2)

Nameandreferenceof

originalmethod

M.Trevisani,A.Tuzzi[1]Learningtheevolutionofdisciplinesfromscientificliterature:A

functionalclusteringapproachtonormalizedkeywordcounttrajectories,Knowledge-Based

Systems146(2018)129-141.

Methoddetails

Givenaknowledgefieldofinterest,theproceduralmethodconsistsoftwomainphases: IAninformationretrievalprocessthatstartingfromalargecorpusoftextsretrievedfromscientific

articlespublishedoveralengthyperiodbyaselectionofpremierjournalsofthefield,leadstoan effectiverepresentationofthecorpusbyalexicalcontingencytablereportingthefrequenciesover timeofallrelevantkeywords.

IIAstatisticallearningprocessthatthroughfourstages

 normalizationoftimetrajectoriesofword(raw)frequencies,chosenaccordingtothedifferent aspectsofwordlife-cyclestobehighlighted;

 filteringtimetrajectoriesofword(normalized)frequencies,interpretedasfunctionaldata(FD) andthusrepresentedassmoothfunctions;

 curveclustering(CC)todiscoverimportantmacro-dynamicslatenttowordmicro-histories;  interpretationbyexpertopiniontodecipherdetecteddynamics,

leadstoareading(orreadings)ofthehistoryoftheknowledgefield.

WeadoptabasisfunctionapproachtofilteringwithaB-splinebasissystem.Moreover,wetakea distance-basedapproachtoCCanduseak-meansalgorithmforFDcombinedwithanappropriate metricformeasuringdistancebetweencurves.

Relatedwork

Themethodaimsatcomposinganhistoryofafieldofknowledgebyadistantreadingofscientific literatureavailablethroughanarticlesdatabase.Theobjectivesituatesourmethodwithinthevarious approachesforsciencemappingwhichhasdrawnmuchattentionintherecentyears.However,the main methodologies developed in bibliometrics,scientometrics, informetrics and related fields, thoughpartlysharingsimilarpurposes, aresubstantivelydifferentfromourproposaland cannot answerourparticularquestioneffectively.

Topic modelling aims at detecting topics, i.e. thematic groups, in collections of documents. Moreover,whendocumentsexhibitatemporalordering,it enablesthediscoveryof topictrends. LatentDirichlet Allocation(LDA), themostwidespread topicmodel,is a probabilistic generative process that modelseach document as a mixture of topics where each topic corresponds toa multinomialdistributionoverwords[2].Topicsovertimecanbedetectedbymodellingtimejointly withwordco-occurrencepatternsfortopicdiscovery[3,4].AfurtherextensionofLDAincorporates boththetemporalorderingandtheauthorshipinformationofdocumentstoimprovetopicdiscovery process[5].Topicmodellingconnectstoscientometricsor,moreingeneral,toquantitativemethods formappingknowledgedomainsfromscientificarticledatabases.Theyarebasedontermand/or citationco-occurrencesindocuments,possiblyobservedovertimeinordertoreconstructafield’s evolution[6,7].Recentdevelopmentsofco-citationnetwork-basedanalysesbuildadynamicscientific mapviaoverlappingauthorsacrossfields[8]orviacommunitiesofauthorsworkingonsemantically relatedtopicsatthesametime[9].

(3)

possiblyobservedovertime,whileourworkconsiders wordco-occurrencesolely intime,asour primaryfocusisthetemporalevolutionofwords.Then,moreimportantly,topic-centeredmethods focusfirstonthestructureofscienceandondetectingtopicsandthenontrackingtheirevolution, whereasourapproachfocusesfirstontracinglifecyclesofwordsandthenondetectingimportant dynamicsoftemporallyhomogeneousgroupsofwordsinordertodecipherthehistoryofaknowledge field.Asa consequence,intopic-centeredmethods,wordsthatrepresentthesametopic(asthey appear togetherin documents) mayhave an irreconcilable temporal evolution, whereas, in our approach,differentthemes,researchfieldsandschoolsofthoughtcanonprincipleberepresented withinthesamegroupofwords.Moreover,intopic-centeredmethods,topicevolutioncanonlybea roadmap,i.e.,anabstractdescription(theaverageevolutionofwordsgroupedbyco-occurrence)of basicmovementsovertime.Additionally,theabstractdefinitionoftopicsissubjectedtocontinuous destructionandreconstructionbytime,makingtopictrackingafragileandquestionable artefact. Conversely,inourapproach,thedetecteddynamicsreallyrepresenttemporalpatternsofwords,e.g., essentiallyincreasing,decreasingorconstanttrends,trendswithanisolatedpeakforbrieflyfaddish words,orroughlybell-shapedtrendsforwordswhichhadagoldenageandthendisappeared.

Finally,ourchoiceofspecificstatisticaltoolsisunderpinnedbytheliteratureasfollows.Thebasis functionapproachisthemostwidelyusedforrepresentingFD,andB-splinesareaveryflexiblebasis systemfornon-periodicFD[12].Moreover,B-splinesenableustorecognisecontinuousandregular curves,and hencemore easily interpretableshapes. Upstream, wedecided for a distance-based approachtoCC,asoneofourobjectiveswastosetupanexploratoryandmostlyautomatedprocedure. Infact,theprocedureiscalledupontolookforinterestingpatternstobesubmittedtoexpertswhocan potentially formulate new hypotheses and research questions. This eminently exploratory task requirestheproceduretobefastandrelativelyeasytouseandunderstandevenbynon-statisticiansin interdisciplinarygroupsinvolvedinresearchprojects.Onceoptedfordistance-basedmethods, k-meanstypeclusteringalgorithmshavebeenwidelyappliedtoFD,especiallywhencombinedwiththe finitebasisexpansionapproach.Otherstrategieswhichextendtheclassicalk-meansalgorithmwith FDareessentiallybasedonfunctionalprincipalcomponents.However,theyarerecentextensions, rarelyusedand, thus,less justifiableasthebasis forourexplorativeapproach (someinteresting overviewsofstrategiesforclusteringFDareprovidedby[13]and[14]).

Procedure

I–Compilingandpre-processingthecorpus Corpusdesignandcompilation

0 Selectionofdatasources,i.e.choiceofoutstandingjournalsabletocovermaintopicsandrepresent thetemporalevolutionoftheknowledgefield.

1 Textharvesting, i.e.downloadingofavailableinformationonarticles(authors,title/abstract/full text, number, issue, volume) from journal archives, to constitute the corpus. Texts under considerationmayconsistoftitlesorabstractsorfulltextsofthearticles.Thecorpusistypically organizedintosubcorpora,i.e.collectionsoftextssharingthesametimereference,thusgenerating asequenceoftextsetsassociatedwithchronologicalpointsonthetimeaxis.

2 Tokenizationofthecorpus,i.e.identificationofallwords(sequencesoflettersisolatedbymeansof separators).Thecorpuscontainsafinitesetofdifferentwords(i.e.word-types)thatrepresentsthe vocabulary(orwordlist)ofit.Aword-tokenisaparticularoccurrenceofaword-typeand the numberofoccurrencesistheword-typefrequency.

Preparationoftextualdata

3 Stemming,i.e.transformationofwordsintostemsbymeansofthePorter’sstemmingalgorithm [15].

(4)

5Taggingkeywords,i.e.identificationofallrelevantstatisticalkeywords(stemsandstem-segments) bymatchingthe(stemmed)vocabularyofthecorpuswiththe(stemmed)listofitemsretrieved from relevant glossaries of the knowledge field. The taggingprocedure assigns a labelto all vocabularyitemsthatareincludedinglossaries.

6Thresholding,i.e.selectionofallkeywordswithfrequenciesatleastequaltoanopportunelyfixed threshold.

Finally, the corpus is represented by a keywordsdocuments/time-points contingency table containingthefrequenciesoftheselectedkeywords(byrow)alongthetime-points(bycolumn)ofthe consideredperiod.

StemmingcanbecarriedoutbythePorterStemmeravailableonline(http://textanalysisonline. com/nltk-porter-stemmer)or,alternatively,withintheRsoftwareenvironment[17]bythewordstem routineofthesnowballClibrary.WeuseTaltacsoftware[18]fortaggingthoughitcanbeequivalently performedbyanysoftwareenablingthecomparisonbetweentwolists(e.g.,Excel).

II–Statisticallearning Normalization

Achronologicalcorpusistypicallycharacterizedbythefollowingfeatures.

(i)Sizeofsubcorpora(numberoftextsandtheirsizeinword-tokens)mayvarygreatlyovertime. (ii)Thelargenumberofrareevents(LNRE)propertyoftextualdata,i.e.alargenumberofword-types

havingaquitelowprobabilityofoccurring.Thispropertyimplies:

totalfrequency(orpopularity)ofindividualwordsintheentirecorpusisgreatlyvariable frequencyspectrumbytime-pointishighlyasymmetric,

sparsity,i.e.manycellsofthecontingencytablehavesmallcountsorareempty.

In thesection Methodvalidation, features(ii) areevident fromtheplotof theoriginal word trajectories(Fig.2).Classificationofwordsaccordingtotheirpopularityhighlightsthegreatdisparity ofcurveamplitudebetweenhigh-frequencyandlow-frequencywords(VH,H,LandVLclassesare identifiedbycolourintensityinFig.2)andthe0-levelcurvesectionscharacterizingrarewords.

Fromtheforegoing,normalizationofrawfrequenciesisnecessarytoproperlyreconstruct and comparethetemporalevolutionofwords.

Severaltypesofnormalizationareshowedinthetablebelow(whichisanexcerptofTableA.2in[1]). Asortofnormalizationbycolumn(c1,c2,c3orc4)isnecessarytoadjusttheunevendocument

dimensionacrosstime (i). Asort of normalizationbyrow(r1, r2 orr3)allowsto compareword

trajectoriesbytiming(synchrony)regardlessofheight(popularity)(ii).Adouble(bothbyrowand column)normalization(d)servestofixboth(i)and(ii).

InthesectionMethodvalidation,thecalculationofaspecificdoublenormalization(d1)isshowed.

Filtering

Inourmethod,thetimetrajectoryofwordfrequenciesisviewedasaproxyofworddiffusionand vitality,i.e.ofwordlife-cycle.Then,weadoptafunctionaldataanalysis(FDA)approachunderwhich thetimetrajectoryofwordfrequenciesconstitutesafunctionaldatumassumedtobearealizationof anunderlyingcontinuousfunctionrepresentingthewordtemporalevolution.

Table1

ExcerptofthenormalizationplanfromTableA.2in[1].

Normalization:bycol Subcorpus Matrix

byrow #titles #tokens colsum(p) colmaxfreq

rowsum d d d1 d r1

z-scorebyrow d d d d r2

maxrowfreq d d d d r3

(5)

Letyi={yij}thefunctionalobservationofwordiconsistingofthesetof(normalized)frequenciesat

time-pointsj=1,...,T,foreachi=1,...,N,andxi(t)theunderlyingcontinuousfunctionrepresenting

thewordtemporaldevelopment.Thefollowingchoicesaretakenforfilteringxi(t)fromyi.

We adoptthebasis functionapproachfor representingFD assmoothfunctionswhere xi(t)is

expressedasafinitelinearcombinationofbasisfunctions[12].WeconsiderB-splinebaseswhichare piecewisepolynomialsjoinedsmoothlyattheinteriornodes.Lastly,weplaceknots–thevaluesoftat whichadjacentsegmentsarejoined–ateachtime-pointofobservation.

Asregardstheestimation,weadopttheroughnesspenaltyapproachforsmoothingFDwherethe estimateofxiistheoneoptimizingthebias-variancetrade-offbytuningthesmoothingparameter

l

. Weconsiderthegeneralizedcrossvalidation(GCV)criterionforselectingtheoptimalsmoothingby varyingsplineorderm(mfrom1to8)aswellasroughnesspenaltyorderr(besidesthestandardr= m-2,r=2,form>3,r=1,form>2,finally,r=0)[1].

InthesectionMethodvalidation,theoptimalsmoothingselectionisillustratedforthecaseofd1 normalizeddata(Fig.4).

ThecalculationiscarriedoutwithinthefdalibraryinRandanad-hocdevelopedroutine.

Curveclustering

Weadoptadistance-basedmethodtoCCwherethedistancebetweencurvesisapproximatedby usingthediscretelyobservedevaluationpointsoftheestimatedcurvesxi(t)[13].

Thefollowingchoicesaretakenforclustering: - k-meansalgorithm

- severaloptionsfordistance:besidestheconventionaldistances(EuclideanorManhattan,between others),otheroptionscanbetaken fromthebroadrange ofdissimilaritymeasuressetout to performclusteringoftimeseries[19].

- foreach clusternumber(kfrom2toanopportunerangemaximum),20re-runsfromdifferent initialconfigurationssetthroughthek-means++seedingmethod.

- thebestcandidatestoclusternumberareidentifiedbypoolingtheratingsfromalargenumberof clusteringqualitycriteria(about50,see[20]and[21]).Moreindetail,intheorder:

arankingofclusternumberiscomputedforeachqualityindex,

alltherankingsarepooledand,foreachclusternumber,thefrequenciesofbeingrankedfirst (top-1),second(top-2),third(top-3)andfourth(top-4)arecalculated,

anorderedsetofbestcandidatesforclusternumberisretrievedfromaqualitativeinspectionof thegraphicalrepresentationofthefrequenciesofbeinginthefirstfourtoppositionsforeach clusternumber(seeFig.6;insectionMethodvalidationforillustration).

Clustering resultsobtainedwiththeclusternumbersselectedas thebestcandidatesarethen presentedtoexperts(ofthesubjectmatter)whopossiblywillguidetowardsotheranalyses.

Rcontainsseveralk-meansimplementationsaswellaslibrariesforcomputingclusteringquality criteria.Ourprocedureusesthekmlroutine[21]whichisdesignedspecificallyforlongitudinaldata andwhichprovidesvariousefficientmethodsofkmeansinitialization.TheclusterCrit[20]andkml [21]arethepackagesusedtogatherthelargebasketofqualitycriteriaconsideredbyourmethod. These include measures of within-cluster homogeneity, e.g., Ball-Hall, Banfeld-Raftery, C-index, Marriot,Scott-Symons;ofbetween-clusterseparation,e.g.,Rubin,Scott,Ratkowsky-Lance;andoftheir combination,e.g.,Calinski-Harabasz,Davies-Bouldin,Dunnanditsgeneralizations,Gamma,Hartigan, McClain,PBM,Point-Biserial,Ray-Turi,SD,Silhouette,Friedman,Xie-Beni,Tau;aswellasmeasuresof similaritybetweentheempiricalwithin-clusterdistributionand distributionalshapessuchasthe Gaussiandistribution,e.g.,BIC,AICandtheirvariants.

Methodvalidation

(6)

Corpusdesignandcompilation

0TheASArepresentstheworld’slargestcommunityofstatisticiansandtheJournaloftheASA(JASA) haslongbeenconsideredtheworld’spremierreviewinitsfield.Establishedin1888,JASA,which hastwopredecessors(PublicationsoftheASA,1888–1912,QuarterlyPublicationsoftheASA,1912– 1921)isoneoftheoldestandprestigiousstatisticaljournals.

1Downloadfromjournalarchivesofavailableinformationforallissuesthatreferto12,577items publishedintheperiod1888–2012(125years,fromVolumeNo.1.,IssueNo.1,toVolumeNo.107, IssueNo.500,sinceattheverybeginningthevolumesoftheASA’sjournalswerebiennial).Titlesof articlesarethetextconsideredinthisstudy.

2Afterdiscardingitemsthatarenotarticles(e.g.,Listofpublications,News)ordonotincludecontent words(e.g.,Comment,Rejoinder),thecorpusincludes10,077titlesandiscomposedof7746 word-typesand87,060word-tokens.

Preparationoftextualdata

3Afterstemming,4834differentstemsareobtained(e.g.,theword-types:model,models,modeling, andmodellingarereplacedwiththesamestemmodel).

4Allpotentiallyrelevantstem-segmentsareidentified(e.g.,modelselect,hierarchmodel,loglinear model)andincludedinthewordlist.

5Relevantstatisticalkeywords(e.g.stemmedwords:statist,model,test,distribut,analysi,regress, probabl; and sequences ofstemmed words:time seri, regressmodel, contingtabl, confid interv, maximumlikelihoodestim,analysiofvarianc,normaldistribut)aretaggedbymatchingthestemmed vocabularyof thecorpuswith astemmed list(over12,700uniqueentries) includingall non-redundantentriesofsixStatisticsglossaries:

1 ISI-InternationalStatisticalInstitute;

2 OECD-OrganisationforEconomicCooperationandDevelopment; 3 Statistics.com-InstituteforStatisticsEducation;

4 StatSoftInc.;

5 UniversityofCalifornia,Berkeley; 6 UniversityofGlasgow.

6Afterfixingthethresholdat10,900keywordsarefinallyselected.

Fig.1.Excerptofthe900(words)107(volumes)tablefromthecorpusoftitlesofpaperspublishedbytheASA’sjournals

(7)

Attheend,thecorpusoriginatesa900(words)107(time-points/volumes)contingencytable (Fig.1).

Normalization

Forillustration,wechoosetotransformdata(Fig.2)bythedoublenormalizationd1(Table1)which

isequivalenttocalculateaχ2distancebetweenoriginalwordprofilesiftheEuclideandistanceisused

asmeasureofdissimilarity.

Letnijbetherawfrequencyofwordiattime-point/volumej,ni.thei-rowsum,n.jthej-columnsum

andnthematrixtotalofthecorpustable.Then,thed1normalizedfrequencyiscomputedas:

yij¼

nij

ni:pffiffiffiffiffiffiffiffiffiffiffin:j=n

(n.j/nisthej-columnmassincorrespondenceanalysis).

Notethatthisdoublenormalizationproducesasomewhatreversedasymmetry(low-frequency wordstendtodominateinamplitudeonhigh-frequencywords,seetheinversionofcolorintensityin

Fig.3).Thisismainlyduetoagreatersparsityoflow-frequencykeywordsacrosstime.

Filtering

Optimal smoothingford1normalizeddatais achievedwithsplineorderm=3 andsmoothing

parameter

l

=101.75(df=7.4)underaroughnesspenaltyoforderr=1(Fig.4).

AsampleofcurvesfittedbytheoptimalsmoothingareshowninFig.5,fromthewordwithhighest rootmeansquare(RMS)residual(rural)tothewordwithlowestRMSresidual(model).

Fig.2. Wordtrajectories(originaldata):y-axisrepresentsthewordrawfrequencyforeachvolume;x-axisrepresentsthe

volumepublicationyear;linecoloridentifiesthewordfrequencyclass(VeryLow,Low,HighandVeryHighdenote

equal-frequencyintervalsofwordtotalfrequencyintheentirecorpus).Anexampleofwordtrajectoryhasbeensuperimposedforeach

(8)

Curveclustering

Curvesarepartitionedbymeansofthek-meansalgorithmcombinedwiththeEuclideandistance withclusternumberkrangingfrom2to26and20rerunsforeachk.

Asetof49qualitycriteriaarethencomputedinordertoidentifythebestcandidatestocluster number.

Fig.4.Smoothingselection:overviewoflog10l,effectivedegreesoffreedom(df),sumofsquareerrors(SSE)andGCVbyvarying

ordermandroughnesspenaltyorderr(PENr).OptimalsmoothingisobtainedbyminimizingGCV.d1normalization.

(9)

Visualrepresentationoftheratingfortheclusternumbershows(Fig.6)that:

(i)partitionsintotwo/threeclustersarethebestrated,

(ii)partitionswithaclusternumberclosetothemaximumoftheconsideredrange(24–26)havealso beenfrequentlyselectedinthehighestpositions,

(iii)intherangeofmoreinterestingclusternumbers(neithertoolownortoohigh),themostselected inthetopfourpositionsis6,secondis4,thirdis19(theeyeshouldbeguidedbothbythebar height,correspondingtothecumulatedfrequencyof beingin thetopfour,and bythecolor composition,informingonthepositionlevel).

Note that thefinal setofbest candidatesforcluster number istheoutput ofan Rcodethat essentiallymimicsaqualitativeratingpurelybasedonagraphicalinspection.

Fig.5.Optimalsmoothingfit:aselectionoffittedcurvesorderedaccordingtodecreasingrootmeansquare(RMS)residual.Fit

ofasmoothingsplineoforderm=3,withPEN1,tod1normalizeddata.

Fig.6.Clusternumberselection:frequencyofbeingrankedfirst(top-1),second(top-2),third(top-3)andfourth(top-4)for

(10)

Fig.7. Clustering:bestpartitioninto6groups.d1normalization.

(11)

Solution(i)(thetwo-grouppartition)reflectsthesubstantialbifurcationofthehistoricalperiod around thesixties whenStatistics wasborn as anautonomous discipline(see [1] and[22] for explanation).Solution(ii)(25/26-grouppartition),ononehand,mayreflectthelackofadefined structureandparsimonious grouping,but,ontheother, itmaybea failureduetothestandard assumptionunderlyingmanyqualitycriteriaofdatanormally distributedhenceofcompact and convexclusters.Thatpremised,wechoosetoinvestigatethemostinterestingsolution(iii),thatis thesetofclusternumbersneithertoosmallnortoolarge,andtosubjectthemtothescrutinyof experts.

Here,weillustratethebestpartitionfoundwiththeclusternumberrankedfirst,thatisk=6. Thegraphicaloutputshows thegroupsall togetherwiththecluster meanpatterns(Fig.7), and individually(Fig.8).Notethat,inordertomakethereadingeasier,stemshavebeenreplacedwith thesingularnounor,incasethisisnotpresentinthecorpus,withthetypicalwordrelatedtothe stem. Moreover, in order to make the identification of possible subsequent phases in the knowledgefieldevolutioneasier,individualclustershavebeenchronologicallyordered.Thefound dynamics are then examined and - whether considered interesting - eventually interpreted by subjectmatterexperts.ApossiblereadingofthehistoryofStatisticsonthebasisoftheillustrated findingsis offered in[1].

Acknowledgements

ThisstudywassupportedbytheUniversityofPadova,fundCPDA145940“TracingtheHistoryof Words.APortraitofaDisciplineThroughAnalysesofKeywordCountsinLargeCorporaofScientific Literature”(P.I.ArjunaTuzzi,2014).

AppendixA.Supplementarydata

Supplementarymaterialrelatedtothisarticlecanbefound,intheonlineversion,atdoi:https://doi. org/10.1016/j.mex.2018.11.010.

References

[1]M.Trevisani,A.Tuzzi,Learningtheevolutionofdisciplinesfromscientificliterature:afunctionalclusteringapproachto normalizedkeywordcounttrajectories,Knowl.BasedSyst.146(2018)129–141,doi:http://dx.doi.org/10.1016/j. knosys.2018.01.035.

[2]D.M.Blei,A.Y.Ng,M.Jordan,LatentDirichletAllocation,J.Mach.Learn.Res.3(2003)993–1022. [3]T.L.Griffiths,M.Steyvers,Findingscientifictopics,Proc.Natl.Acad.Sci.101(Suppl.1)(2004)5228–5235.

[4]D.M.Blei,J.D.Lafferty,Dynamictopicmodels,Proceedingsofthe23rdInternationalConferenceonMachineLearning, (2006),pp.113–120.

[5]L.Bolelli,Ş.Ertekin,C.L.Giles,Topicandtrenddetectionintextcollectionsusinglatentdirichletallocation,Advancesin InformationRetrieval,ECIR2009,Lect.NotesComput.Sci.5478(2009)776–780.

[6]D.Chavalarias,J.-P.Cointet,Phylomemeticpatternsinscienceevolution–theriseandfallofscientificfields,PLoSOne8(2) (2013)e54847.

[7]K.W.Boyack,R.Klavans,Includingcitednon-sourceitemsinalarge-scalemapofscience:Whatdifferencedoesitmake?J. Informetr.8(3)(2014)569–580.

[8]X.Sun,K.Ding,Y.Lin,Mappingtheevolutionofscientificfieldsbasedoncross-fieldauthors,J.Informetr. 10(3)(2016)750– 761.

[9]F.Osborne,G.Scavo,E.Motta,Identifyingdiachronictopic-basedresearchcommunitiesbyclusteringsharedresearch trajectories,EuropeanSemanticWebConference(2014)114–129.

[10]W.Ding,C.Chen,Dynamictopicdetectionandtracking:acomparisonofHDP,C-word,andcocitationmethods,J.Assoc.Inf. Sci.Technol.65(10)(2014)2084–2097,doi:http://dx.doi.org/10.1002/asi.23134.

[11]Y.Zhang,H.Chen,J.Lu,G.Zhang,DetectingandpredictingthetopicchangeofKnowledge-basedSystems:atopic-based bibliometricanalysisfrom1991to2016,Knowl.BasedSyst.133(Suppl.C)(2017)255–268,doi:http://dx.doi.org/10.1016/j. knosys.2017.07.011.

[12]J.Ramsay,B.W.Silverman,FunctionalDataAnalysis(SpringerSeriesinStatistics),Springer,2005,doi:http://dx.doi.org/ 10.1007/b98888.

[13]J.Jacques,C.Preda,Functionaldataclustering:asurvey,Adv.DataAnal.Classif.8(3)(2014)231–255,doi:http://dx.doi.org/ 10.1007/s11634-013-0158-y.

[14]J.L.Wang,J.M.Chiou,H.G.Mueller,Functionaldataanalysis,Annu.Rev.Stat.Appl.3(1)(2016)257–295,doi:http://dx.doi. org/10.1146/annurev-statistics-041715-033624.

(12)

[16]A.Morrone,Temigeneralietemispecificideiprogrammidigovernoattraversolesequenzedidiscorso,in:M.Villone,A. Zuliani(Eds.),L’attivitàdeigovernidellaRepubblicaitaliana(1948–1994),IlMulino,Bologna,1996,pp.351–369. [17]RCoreTeam,R:ALanguageandEnvironmentforStatisticalComputing,RFoundationforStatisticalComputing,Vienna,

Austria,2017.

[18]S.Bolasco,Taltac2.10.Sviluppi,esperienzeedelementiessenzialidianalisiautomaticadeitesti,LED,Milano,2010. [19]P.Montero,J.Vilar,Tsclust:AnRpackagefortimeseriesclustering,J.Stat.Softw.62(1)(2014)1–43,doi:http://dx.doi.org/

10.18637/jss.v062.i01.

[20]B.Desgraupes,clusterCrit:ClusteringIndices,Rpackageversion1.2.7,(2016).

[21]C.Genolini,X.Alacoque,M.Sentenac,C.Arnaud,kmlandkml3d:RPackagestoClusterLongitudinalData,J.Stat.Softw.65 (4)(2015)1–34,doi:http://dx.doi.org/10.18637/jss.v065.i04.

Riferimenti

Documenti correlati

130 che di Lewis, la cui densità relativa dipende dallo stato di idratazione del materiale nelle particolari condizioni nelle quali viene analizzato. La

La riforma della formazione del giurista approvata in Germania nel 2002 ed entrata in vigore il lO luglio 2003', pur incisiva, non ha mu- tato i tratti essenziali del

BINDING PROTEIN SUPPRESSES HER2 ONCOGENE EXPRESSION IN HUMAN BREAST CANCER.. Marco Ruggiero 1,3 ,

The analysis highlights the tile bodies as “stonepaste” (quartz grains are predominant), directly linking them to high quality glazed Islamic ceramic and shows that bodies and glazes

We may conclude that, for a single interferometer used in axial mode (as for IPM), the choice between classic and telecentric mount is only a compromise between spectral resolution

La personalità può far riferimento a diversi concetti (Meleddu E., Scalas L.F. 2003): ➔ l'effetto esterno, che prende in considerazione il punto sociale e quindi esterno

Date le responsabilità e il potere concentrato sui soci accomandatari, possono essere previsti dei requisiti minimi (di carriera, di studi, di età) per accedere a

Ne conseguiva che, per esclusione, non erano classificabili come enti ecclesiastici quegli organismi che, sebbene destinati ad uso o servizio di culto, 19 non