Method
Article
Chronological
corpora
curve
clustering:
From
scienti
fic
corpora
construction
to
knowledge
dynamics
discovery
through
word
life-cycles
clustering
Matilde
Trevisani
*
,
Arjuna
Tuzzi
DepartmentofEconomics,Business,MathematicsandStatistics(DEAMS)ofUniversityofTrieste,Department
ofPhilosophy,Sociology,EducationandAppliedPsychology(FISPPA)ofUniversityofPadova,Italy
A B S T R A C T
Aimofthisproceduralmethodistoconstructwell-foundedcorporaofscientificliterature,and,hence,totrackthe evolutionofknowledgefieldsfromthereconstructionandclusteringofwords’life-cycles.Themethodcontains:
anoriginalselectionprocessofrelevantkeywordsinvolvingtheidentificationofrelevantstemsandstem n-gramsthroughamatchingwithitemlistsofrelevantglossaries;
severaltypesofnormalizationoftemporaltrajectoriesofwordrawfrequencies
aproperlycustomized clusteringofwordlife-cycles,withagraphicalextensiveinvestigationofthebest candidatesforclusternumber,tounveiltheimportantdynamicsanddecipherthehistoryofascientificfield.
©2018TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/).
A R T I C L E I N F O
Methodname:Chronologicalcorporacurveclustering
Keywords:Diachroniccorpora,Functionaldataanalysis,Normalization,Clusternumberselection
Articlehistory:Received23March2018;Accepted10November2018;Availableonline19November2018
SpecificationsTable
Subjectarea ComputerScience
Morespecificsubjectarea Computationallinguistics
Methodname Chronologicalcorporacurveclustering
*Correspondingauthor.
E-mailaddress:matilde.trevisani@deams.units.it(M.Trevisani).
https://doi.org/10.1016/j.mex.2018.11.010
2215-0161/©2018TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense(http://
creativecommons.org/licenses/by-nc-nd/4.0/).
ContentslistsavailableatScienceDirect
MethodsX
Nameandreferenceof
originalmethod
M.Trevisani,A.Tuzzi[1]Learningtheevolutionofdisciplinesfromscientificliterature:A
functionalclusteringapproachtonormalizedkeywordcounttrajectories,Knowledge-Based
Systems146(2018)129-141.
Methoddetails
Givenaknowledgefieldofinterest,theproceduralmethodconsistsoftwomainphases: IAninformationretrievalprocessthatstartingfromalargecorpusoftextsretrievedfromscientific
articlespublishedoveralengthyperiodbyaselectionofpremierjournalsofthefield,leadstoan effectiverepresentationofthecorpusbyalexicalcontingencytablereportingthefrequenciesover timeofallrelevantkeywords.
IIAstatisticallearningprocessthatthroughfourstages
normalizationoftimetrajectoriesofword(raw)frequencies,chosenaccordingtothedifferent aspectsofwordlife-cyclestobehighlighted;
filteringtimetrajectoriesofword(normalized)frequencies,interpretedasfunctionaldata(FD) andthusrepresentedassmoothfunctions;
curveclustering(CC)todiscoverimportantmacro-dynamicslatenttowordmicro-histories; interpretationbyexpertopiniontodecipherdetecteddynamics,
leadstoareading(orreadings)ofthehistoryoftheknowledgefield.
WeadoptabasisfunctionapproachtofilteringwithaB-splinebasissystem.Moreover,wetakea distance-basedapproachtoCCanduseak-meansalgorithmforFDcombinedwithanappropriate metricformeasuringdistancebetweencurves.
Relatedwork
Themethodaimsatcomposinganhistoryofafieldofknowledgebyadistantreadingofscientific literatureavailablethroughanarticlesdatabase.Theobjectivesituatesourmethodwithinthevarious approachesforsciencemappingwhichhasdrawnmuchattentionintherecentyears.However,the main methodologies developed in bibliometrics,scientometrics, informetrics and related fields, thoughpartlysharingsimilarpurposes, aresubstantivelydifferentfromourproposaland cannot answerourparticularquestioneffectively.
Topic modelling aims at detecting topics, i.e. thematic groups, in collections of documents. Moreover,whendocumentsexhibitatemporalordering,it enablesthediscoveryof topictrends. LatentDirichlet Allocation(LDA), themostwidespread topicmodel,is a probabilistic generative process that modelseach document as a mixture of topics where each topic corresponds toa multinomialdistributionoverwords[2].Topicsovertimecanbedetectedbymodellingtimejointly withwordco-occurrencepatternsfortopicdiscovery[3,4].AfurtherextensionofLDAincorporates boththetemporalorderingandtheauthorshipinformationofdocumentstoimprovetopicdiscovery process[5].Topicmodellingconnectstoscientometricsor,moreingeneral,toquantitativemethods formappingknowledgedomainsfromscientificarticledatabases.Theyarebasedontermand/or citationco-occurrencesindocuments,possiblyobservedovertimeinordertoreconstructafield’s evolution[6,7].Recentdevelopmentsofco-citationnetwork-basedanalysesbuildadynamicscientific mapviaoverlappingauthorsacrossfields[8]orviacommunitiesofauthorsworkingonsemantically relatedtopicsatthesametime[9].
possiblyobservedovertime,whileourworkconsiders wordco-occurrencesolely intime,asour primaryfocusisthetemporalevolutionofwords.Then,moreimportantly,topic-centeredmethods focusfirstonthestructureofscienceandondetectingtopicsandthenontrackingtheirevolution, whereasourapproachfocusesfirstontracinglifecyclesofwordsandthenondetectingimportant dynamicsoftemporallyhomogeneousgroupsofwordsinordertodecipherthehistoryofaknowledge field.Asa consequence,intopic-centeredmethods,wordsthatrepresentthesametopic(asthey appear togetherin documents) mayhave an irreconcilable temporal evolution, whereas, in our approach,differentthemes,researchfieldsandschoolsofthoughtcanonprincipleberepresented withinthesamegroupofwords.Moreover,intopic-centeredmethods,topicevolutioncanonlybea roadmap,i.e.,anabstractdescription(theaverageevolutionofwordsgroupedbyco-occurrence)of basicmovementsovertime.Additionally,theabstractdefinitionoftopicsissubjectedtocontinuous destructionandreconstructionbytime,makingtopictrackingafragileandquestionable artefact. Conversely,inourapproach,thedetecteddynamicsreallyrepresenttemporalpatternsofwords,e.g., essentiallyincreasing,decreasingorconstanttrends,trendswithanisolatedpeakforbrieflyfaddish words,orroughlybell-shapedtrendsforwordswhichhadagoldenageandthendisappeared.
Finally,ourchoiceofspecificstatisticaltoolsisunderpinnedbytheliteratureasfollows.Thebasis functionapproachisthemostwidelyusedforrepresentingFD,andB-splinesareaveryflexiblebasis systemfornon-periodicFD[12].Moreover,B-splinesenableustorecognisecontinuousandregular curves,and hencemore easily interpretableshapes. Upstream, wedecided for a distance-based approachtoCC,asoneofourobjectiveswastosetupanexploratoryandmostlyautomatedprocedure. Infact,theprocedureiscalledupontolookforinterestingpatternstobesubmittedtoexpertswhocan potentially formulate new hypotheses and research questions. This eminently exploratory task requirestheproceduretobefastandrelativelyeasytouseandunderstandevenbynon-statisticiansin interdisciplinarygroupsinvolvedinresearchprojects.Onceoptedfordistance-basedmethods, k-meanstypeclusteringalgorithmshavebeenwidelyappliedtoFD,especiallywhencombinedwiththe finitebasisexpansionapproach.Otherstrategieswhichextendtheclassicalk-meansalgorithmwith FDareessentiallybasedonfunctionalprincipalcomponents.However,theyarerecentextensions, rarelyusedand, thus,less justifiableasthebasis forourexplorativeapproach (someinteresting overviewsofstrategiesforclusteringFDareprovidedby[13]and[14]).
Procedure
I–Compilingandpre-processingthecorpus Corpusdesignandcompilation
0 Selectionofdatasources,i.e.choiceofoutstandingjournalsabletocovermaintopicsandrepresent thetemporalevolutionoftheknowledgefield.
1 Textharvesting, i.e.downloadingofavailableinformationonarticles(authors,title/abstract/full text, number, issue, volume) from journal archives, to constitute the corpus. Texts under considerationmayconsistoftitlesorabstractsorfulltextsofthearticles.Thecorpusistypically organizedintosubcorpora,i.e.collectionsoftextssharingthesametimereference,thusgenerating asequenceoftextsetsassociatedwithchronologicalpointsonthetimeaxis.
2 Tokenizationofthecorpus,i.e.identificationofallwords(sequencesoflettersisolatedbymeansof separators).Thecorpuscontainsafinitesetofdifferentwords(i.e.word-types)thatrepresentsthe vocabulary(orwordlist)ofit.Aword-tokenisaparticularoccurrenceofaword-typeand the numberofoccurrencesistheword-typefrequency.
Preparationoftextualdata
3 Stemming,i.e.transformationofwordsintostemsbymeansofthePorter’sstemmingalgorithm [15].
5Taggingkeywords,i.e.identificationofallrelevantstatisticalkeywords(stemsandstem-segments) bymatchingthe(stemmed)vocabularyofthecorpuswiththe(stemmed)listofitemsretrieved from relevant glossaries of the knowledge field. The taggingprocedure assigns a labelto all vocabularyitemsthatareincludedinglossaries.
6Thresholding,i.e.selectionofallkeywordswithfrequenciesatleastequaltoanopportunelyfixed threshold.
Finally, the corpus is represented by a keywordsdocuments/time-points contingency table containingthefrequenciesoftheselectedkeywords(byrow)alongthetime-points(bycolumn)ofthe consideredperiod.
StemmingcanbecarriedoutbythePorterStemmeravailableonline(http://textanalysisonline. com/nltk-porter-stemmer)or,alternatively,withintheRsoftwareenvironment[17]bythewordstem routineofthesnowballClibrary.WeuseTaltacsoftware[18]fortaggingthoughitcanbeequivalently performedbyanysoftwareenablingthecomparisonbetweentwolists(e.g.,Excel).
II–Statisticallearning Normalization
Achronologicalcorpusistypicallycharacterizedbythefollowingfeatures.
(i)Sizeofsubcorpora(numberoftextsandtheirsizeinword-tokens)mayvarygreatlyovertime. (ii)Thelargenumberofrareevents(LNRE)propertyoftextualdata,i.e.alargenumberofword-types
havingaquitelowprobabilityofoccurring.Thispropertyimplies:
totalfrequency(orpopularity)ofindividualwordsintheentirecorpusisgreatlyvariable frequencyspectrumbytime-pointishighlyasymmetric,
sparsity,i.e.manycellsofthecontingencytablehavesmallcountsorareempty.
In thesection Methodvalidation, features(ii) areevident fromtheplotof theoriginal word trajectories(Fig.2).Classificationofwordsaccordingtotheirpopularityhighlightsthegreatdisparity ofcurveamplitudebetweenhigh-frequencyandlow-frequencywords(VH,H,LandVLclassesare identifiedbycolourintensityinFig.2)andthe0-levelcurvesectionscharacterizingrarewords.
Fromtheforegoing,normalizationofrawfrequenciesisnecessarytoproperlyreconstruct and comparethetemporalevolutionofwords.
Severaltypesofnormalizationareshowedinthetablebelow(whichisanexcerptofTableA.2in[1]). Asortofnormalizationbycolumn(c1,c2,c3orc4)isnecessarytoadjusttheunevendocument
dimensionacrosstime (i). Asort of normalizationbyrow(r1, r2 orr3)allowsto compareword
trajectoriesbytiming(synchrony)regardlessofheight(popularity)(ii).Adouble(bothbyrowand column)normalization(d)servestofixboth(i)and(ii).
InthesectionMethodvalidation,thecalculationofaspecificdoublenormalization(d1)isshowed.
Filtering
Inourmethod,thetimetrajectoryofwordfrequenciesisviewedasaproxyofworddiffusionand vitality,i.e.ofwordlife-cycle.Then,weadoptafunctionaldataanalysis(FDA)approachunderwhich thetimetrajectoryofwordfrequenciesconstitutesafunctionaldatumassumedtobearealizationof anunderlyingcontinuousfunctionrepresentingthewordtemporalevolution.
Table1
ExcerptofthenormalizationplanfromTableA.2in[1].
Normalization:bycol Subcorpus Matrix
byrow #titles #tokens colsum(p) colmaxfreq
rowsum d d d1 d r1
z-scorebyrow d d d d r2
maxrowfreq d d d d r3
Letyi={yij}thefunctionalobservationofwordiconsistingofthesetof(normalized)frequenciesat
time-pointsj=1,...,T,foreachi=1,...,N,andxi(t)theunderlyingcontinuousfunctionrepresenting
thewordtemporaldevelopment.Thefollowingchoicesaretakenforfilteringxi(t)fromyi.
We adoptthebasis functionapproachfor representingFD assmoothfunctionswhere xi(t)is
expressedasafinitelinearcombinationofbasisfunctions[12].WeconsiderB-splinebaseswhichare piecewisepolynomialsjoinedsmoothlyattheinteriornodes.Lastly,weplaceknots–thevaluesoftat whichadjacentsegmentsarejoined–ateachtime-pointofobservation.
Asregardstheestimation,weadopttheroughnesspenaltyapproachforsmoothingFDwherethe estimateofxiistheoneoptimizingthebias-variancetrade-offbytuningthesmoothingparameter
l
. Weconsiderthegeneralizedcrossvalidation(GCV)criterionforselectingtheoptimalsmoothingby varyingsplineorderm(mfrom1to8)aswellasroughnesspenaltyorderr(besidesthestandardr= m-2,r=2,form>3,r=1,form>2,finally,r=0)[1].InthesectionMethodvalidation,theoptimalsmoothingselectionisillustratedforthecaseofd1 normalizeddata(Fig.4).
ThecalculationiscarriedoutwithinthefdalibraryinRandanad-hocdevelopedroutine.
Curveclustering
Weadoptadistance-basedmethodtoCCwherethedistancebetweencurvesisapproximatedby usingthediscretelyobservedevaluationpointsoftheestimatedcurvesxi(t)[13].
Thefollowingchoicesaretakenforclustering: - k-meansalgorithm
- severaloptionsfordistance:besidestheconventionaldistances(EuclideanorManhattan,between others),otheroptionscanbetaken fromthebroadrange ofdissimilaritymeasuressetout to performclusteringoftimeseries[19].
- foreach clusternumber(kfrom2toanopportunerangemaximum),20re-runsfromdifferent initialconfigurationssetthroughthek-means++seedingmethod.
- thebestcandidatestoclusternumberareidentifiedbypoolingtheratingsfromalargenumberof clusteringqualitycriteria(about50,see[20]and[21]).Moreindetail,intheorder:
arankingofclusternumberiscomputedforeachqualityindex,
alltherankingsarepooledand,foreachclusternumber,thefrequenciesofbeingrankedfirst (top-1),second(top-2),third(top-3)andfourth(top-4)arecalculated,
anorderedsetofbestcandidatesforclusternumberisretrievedfromaqualitativeinspectionof thegraphicalrepresentationofthefrequenciesofbeinginthefirstfourtoppositionsforeach clusternumber(seeFig.6;insectionMethodvalidationforillustration).
Clustering resultsobtainedwiththeclusternumbersselectedas thebestcandidatesarethen presentedtoexperts(ofthesubjectmatter)whopossiblywillguidetowardsotheranalyses.
Rcontainsseveralk-meansimplementationsaswellaslibrariesforcomputingclusteringquality criteria.Ourprocedureusesthekmlroutine[21]whichisdesignedspecificallyforlongitudinaldata andwhichprovidesvariousefficientmethodsofkmeansinitialization.TheclusterCrit[20]andkml [21]arethepackagesusedtogatherthelargebasketofqualitycriteriaconsideredbyourmethod. These include measures of within-cluster homogeneity, e.g., Ball-Hall, Banfeld-Raftery, C-index, Marriot,Scott-Symons;ofbetween-clusterseparation,e.g.,Rubin,Scott,Ratkowsky-Lance;andoftheir combination,e.g.,Calinski-Harabasz,Davies-Bouldin,Dunnanditsgeneralizations,Gamma,Hartigan, McClain,PBM,Point-Biserial,Ray-Turi,SD,Silhouette,Friedman,Xie-Beni,Tau;aswellasmeasuresof similaritybetweentheempiricalwithin-clusterdistributionand distributionalshapessuchasthe Gaussiandistribution,e.g.,BIC,AICandtheirvariants.
Methodvalidation
Corpusdesignandcompilation
0TheASArepresentstheworld’slargestcommunityofstatisticiansandtheJournaloftheASA(JASA) haslongbeenconsideredtheworld’spremierreviewinitsfield.Establishedin1888,JASA,which hastwopredecessors(PublicationsoftheASA,1888–1912,QuarterlyPublicationsoftheASA,1912– 1921)isoneoftheoldestandprestigiousstatisticaljournals.
1Downloadfromjournalarchivesofavailableinformationforallissuesthatreferto12,577items publishedintheperiod1888–2012(125years,fromVolumeNo.1.,IssueNo.1,toVolumeNo.107, IssueNo.500,sinceattheverybeginningthevolumesoftheASA’sjournalswerebiennial).Titlesof articlesarethetextconsideredinthisstudy.
2Afterdiscardingitemsthatarenotarticles(e.g.,Listofpublications,News)ordonotincludecontent words(e.g.,Comment,Rejoinder),thecorpusincludes10,077titlesandiscomposedof7746 word-typesand87,060word-tokens.
Preparationoftextualdata
3Afterstemming,4834differentstemsareobtained(e.g.,theword-types:model,models,modeling, andmodellingarereplacedwiththesamestemmodel).
4Allpotentiallyrelevantstem-segmentsareidentified(e.g.,modelselect,hierarchmodel,loglinear model)andincludedinthewordlist.
5Relevantstatisticalkeywords(e.g.stemmedwords:statist,model,test,distribut,analysi,regress, probabl; and sequences ofstemmed words:time seri, regressmodel, contingtabl, confid interv, maximumlikelihoodestim,analysiofvarianc,normaldistribut)aretaggedbymatchingthestemmed vocabularyof thecorpuswith astemmed list(over12,700uniqueentries) includingall non-redundantentriesofsixStatisticsglossaries:
1 ISI-InternationalStatisticalInstitute;
2 OECD-OrganisationforEconomicCooperationandDevelopment; 3 Statistics.com-InstituteforStatisticsEducation;
4 StatSoftInc.;
5 UniversityofCalifornia,Berkeley; 6 UniversityofGlasgow.
6Afterfixingthethresholdat10,900keywordsarefinallyselected.
Fig.1.Excerptofthe900(words)107(volumes)tablefromthecorpusoftitlesofpaperspublishedbytheASA’sjournals
Attheend,thecorpusoriginatesa900(words)107(time-points/volumes)contingencytable (Fig.1).
Normalization
Forillustration,wechoosetotransformdata(Fig.2)bythedoublenormalizationd1(Table1)which
isequivalenttocalculateaχ2distancebetweenoriginalwordprofilesiftheEuclideandistanceisused
asmeasureofdissimilarity.
Letnijbetherawfrequencyofwordiattime-point/volumej,ni.thei-rowsum,n.jthej-columnsum
andnthematrixtotalofthecorpustable.Then,thed1normalizedfrequencyiscomputedas:
yij¼
nij
ni:pffiffiffiffiffiffiffiffiffiffiffin:j=n
(n.j/nisthej-columnmassincorrespondenceanalysis).
Notethatthisdoublenormalizationproducesasomewhatreversedasymmetry(low-frequency wordstendtodominateinamplitudeonhigh-frequencywords,seetheinversionofcolorintensityin
Fig.3).Thisismainlyduetoagreatersparsityoflow-frequencykeywordsacrosstime.
Filtering
Optimal smoothingford1normalizeddatais achievedwithsplineorderm=3 andsmoothing
parameter
l
=101.75(df=7.4)underaroughnesspenaltyoforderr=1(Fig.4).AsampleofcurvesfittedbytheoptimalsmoothingareshowninFig.5,fromthewordwithhighest rootmeansquare(RMS)residual(rural)tothewordwithlowestRMSresidual(model).
Fig.2. Wordtrajectories(originaldata):y-axisrepresentsthewordrawfrequencyforeachvolume;x-axisrepresentsthe
volumepublicationyear;linecoloridentifiesthewordfrequencyclass(VeryLow,Low,HighandVeryHighdenote
equal-frequencyintervalsofwordtotalfrequencyintheentirecorpus).Anexampleofwordtrajectoryhasbeensuperimposedforeach
Curveclustering
Curvesarepartitionedbymeansofthek-meansalgorithmcombinedwiththeEuclideandistance withclusternumberkrangingfrom2to26and20rerunsforeachk.
Asetof49qualitycriteriaarethencomputedinordertoidentifythebestcandidatestocluster number.
Fig.4.Smoothingselection:overviewoflog10l,effectivedegreesoffreedom(df),sumofsquareerrors(SSE)andGCVbyvarying
ordermandroughnesspenaltyorderr(PENr).OptimalsmoothingisobtainedbyminimizingGCV.d1normalization.
Visualrepresentationoftheratingfortheclusternumbershows(Fig.6)that:
(i)partitionsintotwo/threeclustersarethebestrated,
(ii)partitionswithaclusternumberclosetothemaximumoftheconsideredrange(24–26)havealso beenfrequentlyselectedinthehighestpositions,
(iii)intherangeofmoreinterestingclusternumbers(neithertoolownortoohigh),themostselected inthetopfourpositionsis6,secondis4,thirdis19(theeyeshouldbeguidedbothbythebar height,correspondingtothecumulatedfrequencyof beingin thetopfour,and bythecolor composition,informingonthepositionlevel).
Note that thefinal setofbest candidatesforcluster number istheoutput ofan Rcodethat essentiallymimicsaqualitativeratingpurelybasedonagraphicalinspection.
Fig.5.Optimalsmoothingfit:aselectionoffittedcurvesorderedaccordingtodecreasingrootmeansquare(RMS)residual.Fit
ofasmoothingsplineoforderm=3,withPEN1,tod1normalizeddata.
Fig.6.Clusternumberselection:frequencyofbeingrankedfirst(top-1),second(top-2),third(top-3)andfourth(top-4)for
Fig.7. Clustering:bestpartitioninto6groups.d1normalization.
Solution(i)(thetwo-grouppartition)reflectsthesubstantialbifurcationofthehistoricalperiod around thesixties whenStatistics wasborn as anautonomous discipline(see [1] and[22] for explanation).Solution(ii)(25/26-grouppartition),ononehand,mayreflectthelackofadefined structureandparsimonious grouping,but,ontheother, itmaybea failureduetothestandard assumptionunderlyingmanyqualitycriteriaofdatanormally distributedhenceofcompact and convexclusters.Thatpremised,wechoosetoinvestigatethemostinterestingsolution(iii),thatis thesetofclusternumbersneithertoosmallnortoolarge,andtosubjectthemtothescrutinyof experts.
Here,weillustratethebestpartitionfoundwiththeclusternumberrankedfirst,thatisk=6. Thegraphicaloutputshows thegroupsall togetherwiththecluster meanpatterns(Fig.7), and individually(Fig.8).Notethat,inordertomakethereadingeasier,stemshavebeenreplacedwith thesingularnounor,incasethisisnotpresentinthecorpus,withthetypicalwordrelatedtothe stem. Moreover, in order to make the identification of possible subsequent phases in the knowledgefieldevolutioneasier,individualclustershavebeenchronologicallyordered.Thefound dynamics are then examined and - whether considered interesting - eventually interpreted by subjectmatterexperts.ApossiblereadingofthehistoryofStatisticsonthebasisoftheillustrated findingsis offered in[1].
Acknowledgements
ThisstudywassupportedbytheUniversityofPadova,fundCPDA145940“TracingtheHistoryof Words.APortraitofaDisciplineThroughAnalysesofKeywordCountsinLargeCorporaofScientific Literature”(P.I.ArjunaTuzzi,2014).
AppendixA.Supplementarydata
Supplementarymaterialrelatedtothisarticlecanbefound,intheonlineversion,atdoi:https://doi. org/10.1016/j.mex.2018.11.010.
References
[1]M.Trevisani,A.Tuzzi,Learningtheevolutionofdisciplinesfromscientificliterature:afunctionalclusteringapproachto normalizedkeywordcounttrajectories,Knowl.BasedSyst.146(2018)129–141,doi:http://dx.doi.org/10.1016/j. knosys.2018.01.035.
[2]D.M.Blei,A.Y.Ng,M.Jordan,LatentDirichletAllocation,J.Mach.Learn.Res.3(2003)993–1022. [3]T.L.Griffiths,M.Steyvers,Findingscientifictopics,Proc.Natl.Acad.Sci.101(Suppl.1)(2004)5228–5235.
[4]D.M.Blei,J.D.Lafferty,Dynamictopicmodels,Proceedingsofthe23rdInternationalConferenceonMachineLearning, (2006),pp.113–120.
[5]L.Bolelli,Ş.Ertekin,C.L.Giles,Topicandtrenddetectionintextcollectionsusinglatentdirichletallocation,Advancesin InformationRetrieval,ECIR2009,Lect.NotesComput.Sci.5478(2009)776–780.
[6]D.Chavalarias,J.-P.Cointet,Phylomemeticpatternsinscienceevolution–theriseandfallofscientificfields,PLoSOne8(2) (2013)e54847.
[7]K.W.Boyack,R.Klavans,Includingcitednon-sourceitemsinalarge-scalemapofscience:Whatdifferencedoesitmake?J. Informetr.8(3)(2014)569–580.
[8]X.Sun,K.Ding,Y.Lin,Mappingtheevolutionofscientificfieldsbasedoncross-fieldauthors,J.Informetr. 10(3)(2016)750– 761.
[9]F.Osborne,G.Scavo,E.Motta,Identifyingdiachronictopic-basedresearchcommunitiesbyclusteringsharedresearch trajectories,EuropeanSemanticWebConference(2014)114–129.
[10]W.Ding,C.Chen,Dynamictopicdetectionandtracking:acomparisonofHDP,C-word,andcocitationmethods,J.Assoc.Inf. Sci.Technol.65(10)(2014)2084–2097,doi:http://dx.doi.org/10.1002/asi.23134.
[11]Y.Zhang,H.Chen,J.Lu,G.Zhang,DetectingandpredictingthetopicchangeofKnowledge-basedSystems:atopic-based bibliometricanalysisfrom1991to2016,Knowl.BasedSyst.133(Suppl.C)(2017)255–268,doi:http://dx.doi.org/10.1016/j. knosys.2017.07.011.
[12]J.Ramsay,B.W.Silverman,FunctionalDataAnalysis(SpringerSeriesinStatistics),Springer,2005,doi:http://dx.doi.org/ 10.1007/b98888.
[13]J.Jacques,C.Preda,Functionaldataclustering:asurvey,Adv.DataAnal.Classif.8(3)(2014)231–255,doi:http://dx.doi.org/ 10.1007/s11634-013-0158-y.
[14]J.L.Wang,J.M.Chiou,H.G.Mueller,Functionaldataanalysis,Annu.Rev.Stat.Appl.3(1)(2016)257–295,doi:http://dx.doi. org/10.1146/annurev-statistics-041715-033624.
[16]A.Morrone,Temigeneralietemispecificideiprogrammidigovernoattraversolesequenzedidiscorso,in:M.Villone,A. Zuliani(Eds.),L’attivitàdeigovernidellaRepubblicaitaliana(1948–1994),IlMulino,Bologna,1996,pp.351–369. [17]RCoreTeam,R:ALanguageandEnvironmentforStatisticalComputing,RFoundationforStatisticalComputing,Vienna,
Austria,2017.
[18]S.Bolasco,Taltac2.10.Sviluppi,esperienzeedelementiessenzialidianalisiautomaticadeitesti,LED,Milano,2010. [19]P.Montero,J.Vilar,Tsclust:AnRpackagefortimeseriesclustering,J.Stat.Softw.62(1)(2014)1–43,doi:http://dx.doi.org/
10.18637/jss.v062.i01.
[20]B.Desgraupes,clusterCrit:ClusteringIndices,Rpackageversion1.2.7,(2016).
[21]C.Genolini,X.Alacoque,M.Sentenac,C.Arnaud,kmlandkml3d:RPackagestoClusterLongitudinalData,J.Stat.Softw.65 (4)(2015)1–34,doi:http://dx.doi.org/10.18637/jss.v065.i04.