Chronological corpora curve clustering: From scientific corpora construction to knowledge dynamics discovery through word life-cycles clustering

(1)

Method

Article

Chronological

corpora

curve

clustering:

From

scienti

ﬁc

corpora

construction

to

knowledge

dynamics

discovery

through

word

life-cycles

clustering

Matilde

Trevisani

*

,

Arjuna

Tuzzi

DepartmentofEconomics,Business,MathematicsandStatistics(DEAMS)ofUniversityofTrieste,Department

ofPhilosophy,Sociology,EducationandAppliedPsychology(FISPPA)ofUniversityofPadova,Italy

A B S T R A C T

Aimofthisproceduralmethodistoconstructwell-foundedcorporaofscientiﬁcliterature,and,hence,totrackthe evolutionofknowledgeﬁeldsfromthereconstructionandclusteringofwords’life-cycles.Themethodcontains:

anoriginalselectionprocessofrelevantkeywordsinvolvingtheidentiﬁcationofrelevantstemsandstem n-gramsthroughamatchingwithitemlistsofrelevantglossaries;

severaltypesofnormalizationoftemporaltrajectoriesofwordrawfrequencies

aproperlycustomized clusteringofwordlife-cycles,withagraphicalextensiveinvestigationofthebest candidatesforclusternumber,tounveiltheimportantdynamicsanddecipherthehistoryofascientiﬁcﬁeld.

A R T I C L E I N F O

Methodname:Chronologicalcorporacurveclustering

Keywords:Diachroniccorpora,Functionaldataanalysis,Normalization,Clusternumberselection

Articlehistory:Received23March2018;Accepted10November2018;Availableonline19November2018

SpeciﬁcationsTable

Subjectarea ComputerScience

Morespeciﬁcsubjectarea Computationallinguistics

Methodname Chronologicalcorporacurveclustering

*Correspondingauthor.

E-mailaddress:matilde.trevisani@deams.units.it(M.Trevisani).

https://doi.org/10.1016/j.mex.2018.11.010

creativecommons.org/licenses/by-nc-nd/4.0/).

ContentslistsavailableatScienceDirect

MethodsX

(2)

Nameandreferenceof

originalmethod

M.Trevisani,A.Tuzzi[1]Learningtheevolutionofdisciplinesfromscientiﬁcliterature:A

functionalclusteringapproachtonormalizedkeywordcounttrajectories,Knowledge-Based

Systems146(2018)129-141.

Methoddetails

Givenaknowledgeﬁeldofinterest,theproceduralmethodconsistsoftwomainphases: IAninformationretrievalprocessthatstartingfromalargecorpusoftextsretrievedfromscientiﬁc

articlespublishedoveralengthyperiodbyaselectionofpremierjournalsofthe_ﬁeld,leadstoan effectiverepresentationofthecorpusbyalexicalcontingencytablereportingthefrequenciesover timeofallrelevantkeywords.

IIAstatisticallearningprocessthatthroughfourstages

normalizationoftimetrajectoriesofword(raw)frequencies,chosenaccordingtothedifferent aspectsofwordlife-cyclestobehighlighted;

ﬁlteringtimetrajectoriesofword(normalized)frequencies,interpretedasfunctionaldata(FD) andthusrepresentedassmoothfunctions;

curveclustering(CC)todiscoverimportantmacro-dynamicslatenttowordmicro-histories; interpretationbyexpertopiniontodecipherdetecteddynamics,

leadstoareading(orreadings)ofthehistoryoftheknowledgeﬁeld.

WeadoptabasisfunctionapproachtoﬁlteringwithaB-splinebasissystem.Moreover,wetakea distance-basedapproachtoCCanduseak-meansalgorithmforFDcombinedwithanappropriate metricformeasuringdistancebetweencurves.

Relatedwork

Themethodaimsatcomposinganhistoryofafieldofknowledgebyadistantreadingofscientific literatureavailablethroughanarticlesdatabase.Theobjectivesituatesourmethodwithinthevarious approachesforsciencemappingwhichhasdrawnmuchattentionintherecentyears.However,the main methodologies developed in bibliometrics,scientometrics, informetrics and related fields, thoughpartlysharingsimilarpurposes, aresubstantivelydifferentfromourproposaland cannot answerourparticularquestioneffectively.

Topic modelling aims at detecting topics, i.e. thematic groups, in collections of documents. Moreover,whendocumentsexhibitatemporalordering,it enablesthediscoveryof topictrends. LatentDirichlet Allocation(LDA), themostwidespread topicmodel,is a probabilistic generative process that modelseach document as a mixture of topics where each topic corresponds toa multinomialdistributionoverwords[2].Topicsovertimecanbedetectedbymodellingtimejointly withwordco-occurrencepatternsfortopicdiscovery[3,4].AfurtherextensionofLDAincorporates boththetemporalorderingandtheauthorshipinformationofdocumentstoimprovetopicdiscovery process[5].Topicmodellingconnectstoscientometricsor,moreingeneral,toquantitativemethods formappingknowledgedomainsfromscientificarticledatabases.Theyarebasedontermand/or citationco-occurrencesindocuments,possiblyobservedovertimeinordertoreconstructafield’s evolution[6,7].Recentdevelopmentsofco-citationnetwork-basedanalysesbuildadynamicscientific mapviaoverlappingauthorsacrossfields[8]orviacommunitiesofauthorsworkingonsemantically relatedtopicsatthesametime[9].

(3)

possiblyobservedovertime,whileourworkconsiders wordco-occurrencesolely intime,asour primaryfocusisthetemporalevolutionofwords.Then,moreimportantly,topic-centeredmethods focusfirstonthestructureofscienceandondetectingtopicsandthenontrackingtheirevolution, whereasourapproachfocusesfirstontracinglifecyclesofwordsandthenondetectingimportant dynamicsoftemporallyhomogeneousgroupsofwordsinordertodecipherthehistoryofaknowledge field.Asa consequence,intopic-centeredmethods,wordsthatrepresentthesametopic(asthey appear togetherin documents) mayhave an irreconcilable temporal evolution, whereas, in our approach,differentthemes,research_fieldsandschoolsofthoughtcanonprincipleberepresented withinthesamegroupofwords.Moreover,intopic-centeredmethods,topicevolutioncanonlybea roadmap,i.e.,anabstractdescription(theaverageevolutionofwordsgroupedbyco-occurrence)of basicmovementsovertime.Additionally,theabstractdefinitionoftopicsissubjectedtocontinuous destructionandreconstructionbytime,makingtopictrackingafragileandquestionable artefact. Conversely,inourapproach,thedetecteddynamicsreallyrepresenttemporalpatternsofwords,e.g., essentiallyincreasing,decreasingorconstanttrends,trendswithanisolatedpeakforbrieflyfaddish words,orroughlybell-shapedtrendsforwordswhichhadagoldenageandthendisappeared.

Finally,ourchoiceofspecificstatisticaltoolsisunderpinnedbytheliteratureasfollows.Thebasis functionapproachisthemostwidelyusedforrepresentingFD,andB-splinesareaveryflexiblebasis systemfornon-periodicFD[12].Moreover,B-splinesenableustorecognisecontinuousandregular curves,and hencemore easily interpretableshapes. Upstream, wedecided for a distance-based approachtoCC,asoneofourobjectiveswastosetupanexploratoryandmostlyautomatedprocedure. Infact,theprocedureiscalledupontolookforinterestingpatternstobesubmittedtoexpertswhocan potentially formulate new hypotheses and research questions. This eminently exploratory task requirestheproceduretobefastandrelativelyeasytouseandunderstandevenbynon-statisticiansin interdisciplinarygroupsinvolvedinresearchprojects.Onceoptedfordistance-basedmethods, k-meanstypeclusteringalgorithmshavebeenwidelyappliedtoFD,especiallywhencombinedwiththe finitebasisexpansionapproach.Otherstrategieswhichextendtheclassicalk-meansalgorithmwith FDareessentiallybasedonfunctionalprincipalcomponents.However,theyarerecentextensions, rarelyusedand, thus,less justifiableasthebasis forourexplorativeapproach (someinteresting overviewsofstrategiesforclusteringFDareprovidedby[13]and[14]).

Procedure

I–Compilingandpre-processingthecorpus Corpusdesignandcompilation

0 Selectionofdatasources,i.e.choiceofoutstandingjournalsabletocovermaintopicsandrepresent thetemporalevolutionoftheknowledgeﬁeld.

1 Textharvesting, i.e.downloadingofavailableinformationonarticles(authors,title/abstract/full text, number, issue, volume) from journal archives, to constitute the corpus. Texts under considerationmayconsistoftitlesorabstractsorfulltextsofthearticles.Thecorpusistypically organizedintosubcorpora,i.e.collectionsoftextssharingthesametimereference,thusgenerating asequenceoftextsetsassociatedwithchronologicalpointsonthetimeaxis.

2 Tokenizationofthecorpus,i.e.identiﬁcationofallwords(sequencesoflettersisolatedbymeansof separators).Thecorpuscontainsaﬁnitesetofdifferentwords(i.e.word-types)thatrepresentsthe vocabulary(orwordlist)ofit.Aword-tokenisaparticularoccurrenceofaword-typeand the numberofoccurrencesistheword-typefrequency.

Preparationoftextualdata

3 Stemming,i.e.transformationofwordsintostemsbymeansofthePorter’sstemmingalgorithm [15].

(4)

5Taggingkeywords,i.e.identiﬁcationofallrelevantstatisticalkeywords(stemsandstem-segments) bymatchingthe(stemmed)vocabularyofthecorpuswiththe(stemmed)listofitemsretrieved from relevant glossaries of the knowledge ﬁeld. The taggingprocedure assigns a labelto all vocabularyitemsthatareincludedinglossaries.

6Thresholding,i.e.selectionofallkeywordswithfrequenciesatleastequaltoanopportunelyﬁxed threshold.

Finally, the corpus is represented by a keywordsdocuments/time-points contingency table containingthefrequenciesoftheselectedkeywords(byrow)alongthetime-points(bycolumn)ofthe consideredperiod.

StemmingcanbecarriedoutbythePorterStemmeravailableonline(http://textanalysisonline. com/nltk-porter-stemmer)or,alternatively,withintheRsoftwareenvironment[17]bythewordstem routineofthesnowballClibrary.WeuseTaltacsoftware[18]fortaggingthoughitcanbeequivalently performedbyanysoftwareenablingthecomparisonbetweentwolists(e.g.,Excel).

II–Statisticallearning Normalization

Achronologicalcorpusistypicallycharacterizedbythefollowingfeatures.

(i)Sizeofsubcorpora(numberoftextsandtheirsizeinword-tokens)mayvarygreatlyovertime. (ii)Thelargenumberofrareevents(LNRE)propertyoftextualdata,i.e.alargenumberofword-types

havingaquitelowprobabilityofoccurring.Thispropertyimplies:

totalfrequency(orpopularity)ofindividualwordsintheentirecorpusisgreatlyvariable frequencyspectrumbytime-pointishighlyasymmetric,

sparsity,i.e.manycellsofthecontingencytablehavesmallcountsorareempty.

In thesection Methodvalidation, features(ii) areevident fromtheplotof theoriginal word trajectories(Fig.2).Classiﬁcationofwordsaccordingtotheirpopularityhighlightsthegreatdisparity ofcurveamplitudebetweenhigh-frequencyandlow-frequencywords(VH,H,LandVLclassesare identiﬁedbycolourintensityinFig.2)andthe0-levelcurvesectionscharacterizingrarewords.

Fromtheforegoing,normalizationofrawfrequenciesisnecessarytoproperlyreconstruct and comparethetemporalevolutionofwords.

Severaltypesofnormalizationareshowedinthetablebelow(whichisanexcerptofTableA.2in[1]). Asortofnormalizationbycolumn(c1,c2,c3orc4)isnecessarytoadjusttheunevendocument

dimensionacrosstime (i). Asort of normalizationbyrow(r1, r2 orr3)allowsto compareword

trajectoriesbytiming(synchrony)regardlessofheight(popularity)(ii).Adouble(bothbyrowand column)normalization(d)servestoﬁxboth(i)and(ii).

InthesectionMethodvalidation,thecalculationofaspeciﬁcdoublenormalization(d1)isshowed.

Filtering

Inourmethod,thetimetrajectoryofwordfrequenciesisviewedasaproxyofworddiffusionand vitality,i.e.ofwordlife-cycle.Then,weadoptafunctionaldataanalysis(FDA)approachunderwhich thetimetrajectoryofwordfrequenciesconstitutesafunctionaldatumassumedtobearealizationof anunderlyingcontinuousfunctionrepresentingthewordtemporalevolution.

Table1

ExcerptofthenormalizationplanfromTableA.2in[1].

Normalization:bycol Subcorpus Matrix

byrow #titles #tokens colsum(p) colmaxfreq

rowsum d d d1 d r1

z-scorebyrow d d d d r2

maxrowfreq d d d d r3

(5)

Letyi={yij}thefunctionalobservationofwordiconsistingofthesetof(normalized)frequenciesat

time-pointsj=1,...,T,foreachi=1,...,N,andxi(t)theunderlyingcontinuousfunctionrepresenting

thewordtemporaldevelopment.Thefollowingchoicesaretakenforﬁlteringxi(t)fromyi.

We adoptthebasis functionapproachfor representingFD assmoothfunctionswhere xi(t)is

expressedasa_ﬁnitelinearcombinationofbasisfunctions[12].WeconsiderB-splinebaseswhichare piecewisepolynomialsjoinedsmoothlyattheinteriornodes.Lastly,weplaceknots–thevaluesoftat whichadjacentsegmentsarejoined–ateachtime-pointofobservation.

Asregardstheestimation,weadopttheroughnesspenaltyapproachforsmoothingFDwherethe estimateofxiistheoneoptimizingthebias-variancetrade-offbytuningthesmoothingparameter

l

. Weconsiderthegeneralizedcrossvalidation(GCV)criterionforselectingtheoptimalsmoothingby varyingsplineorderm(mfrom1to8)aswellasroughnesspenaltyorderr(besidesthestandardr= m-2,r=2,form>3,r=1,form>2,ﬁnally,r=0)[1].

InthesectionMethodvalidation,theoptimalsmoothingselectionisillustratedforthecaseofd1 normalizeddata(Fig.4).

ThecalculationiscarriedoutwithinthefdalibraryinRandanad-hocdevelopedroutine.

Curveclustering

Weadoptadistance-basedmethodtoCCwherethedistancebetweencurvesisapproximatedby usingthediscretelyobservedevaluationpointsoftheestimatedcurvesxi(t)[13].

Thefollowingchoicesaretakenforclustering: - k-meansalgorithm

- severaloptionsfordistance:besidestheconventionaldistances(EuclideanorManhattan,between others),otheroptionscanbetaken fromthebroadrange ofdissimilaritymeasuressetout to performclusteringoftimeseries[19].

- foreach clusternumber(kfrom2toanopportunerangemaximum),20re-runsfromdifferent initialconﬁgurationssetthroughthek-means++seedingmethod.

- thebestcandidatestoclusternumberareidentiﬁedbypoolingtheratingsfromalargenumberof clusteringqualitycriteria(about50,see[20]and[21]).Moreindetail,intheorder:

arankingofclusternumberiscomputedforeachqualityindex,

alltherankingsarepooledand,foreachclusternumber,thefrequenciesofbeingrankedﬁrst (top-1),second(top-2),third(top-3)andfourth(top-4)arecalculated,

anorderedsetofbestcandidatesforclusternumberisretrievedfromaqualitativeinspectionof thegraphicalrepresentationofthefrequenciesofbeinginthe_ﬁrstfourtoppositionsforeach clusternumber(seeFig.6;insectionMethodvalidationforillustration).

Clustering resultsobtainedwiththeclusternumbersselectedas thebestcandidatesarethen presentedtoexperts(ofthesubjectmatter)whopossiblywillguidetowardsotheranalyses.

Rcontainsseveralk-meansimplementationsaswellaslibrariesforcomputingclusteringquality criteria.Ourprocedureusesthekmlroutine[21]whichisdesignedspeciﬁcallyforlongitudinaldata andwhichprovidesvariousefﬁcientmethodsofkmeansinitialization.TheclusterCrit[20]andkml [21]arethepackagesusedtogatherthelargebasketofqualitycriteriaconsideredbyourmethod. These include measures of within-cluster homogeneity, e.g., Ball-Hall, Banfeld-Raftery, C-index, Marriot,Scott-Symons;ofbetween-clusterseparation,e.g.,Rubin,Scott,Ratkowsky-Lance;andoftheir combination,e.g.,Calinski-Harabasz,Davies-Bouldin,Dunnanditsgeneralizations,Gamma,Hartigan, McClain,PBM,Point-Biserial,Ray-Turi,SD,Silhouette,Friedman,Xie-Beni,Tau;aswellasmeasuresof similaritybetweentheempiricalwithin-clusterdistributionand distributionalshapessuchasthe Gaussiandistribution,e.g.,BIC,AICandtheirvariants.

Methodvalidation

(6)

Corpusdesignandcompilation

0TheASArepresentstheworld’slargestcommunityofstatisticiansandtheJournaloftheASA(JASA) haslongbeenconsideredtheworld’spremierreviewinitsﬁeld.Establishedin1888,JASA,which hastwopredecessors(PublicationsoftheASA,1888–1912,QuarterlyPublicationsoftheASA,1912– 1921)isoneoftheoldestandprestigiousstatisticaljournals.

1Downloadfromjournalarchivesofavailableinformationforallissuesthatreferto12,577items publishedintheperiod1888–2012(125years,fromVolumeNo.1.,IssueNo.1,toVolumeNo.107, IssueNo.500,sinceattheverybeginningthevolumesoftheASA’sjournalswerebiennial).Titlesof articlesarethetextconsideredinthisstudy.

2Afterdiscardingitemsthatarenotarticles(e.g.,Listofpublications,News)ordonotincludecontent words(e.g.,Comment,Rejoinder),thecorpusincludes10,077titlesandiscomposedof7746 word-typesand87,060word-tokens.

Preparationoftextualdata

3Afterstemming,4834differentstemsareobtained(e.g.,theword-types:model,models,modeling, andmodellingarereplacedwiththesamestemmodel).

4Allpotentiallyrelevantstem-segmentsareidentiﬁed(e.g.,modelselect,hierarchmodel,loglinear model)andincludedinthewordlist.

5Relevantstatisticalkeywords(e.g.stemmedwords:statist,model,test,distribut,analysi,regress, probabl; and sequences ofstemmed words:time seri, regressmodel, contingtabl, conﬁd interv, maximumlikelihoodestim,analysiofvarianc,normaldistribut)aretaggedbymatchingthestemmed vocabularyof thecorpuswith astemmed list(over12,700uniqueentries) includingall non-redundantentriesofsixStatisticsglossaries:

1 ISI-InternationalStatisticalInstitute;

2 OECD-OrganisationforEconomicCooperationandDevelopment; 3 Statistics.com-InstituteforStatisticsEducation;

4 StatSoftInc.;

5 UniversityofCalifornia,Berkeley; 6 UniversityofGlasgow.

6Afterﬁxingthethresholdat10,900keywordsareﬁnallyselected.

Fig.1.Excerptofthe900(words)107(volumes)tablefromthecorpusoftitlesofpaperspublishedbytheASA’sjournals

(7)

Attheend,thecorpusoriginatesa900(words)107(time-points/volumes)contingencytable (Fig.1).

Normalization

Forillustration,wechoosetotransformdata(Fig.2)bythedoublenormalizationd1(Table1)which

isequivalenttocalculateaχ2_distance_between_original_word_pro_ﬁles_if_the_Euclidean_distance_is_used

asmeasureofdissimilarity.

Letnijbetherawfrequencyofwordiattime-point/volumej,ni.thei-rowsum,n.jthej-columnsum

andnthematrixtotalofthecorpustable.Then,thed1normalizedfrequencyiscomputedas:

yij¼

nij

ni:pffiffiffiffiffiffiffiffiffiffiffin:j=n

(n.j/nisthej-columnmassincorrespondenceanalysis).

Notethatthisdoublenormalizationproducesasomewhatreversedasymmetry(low-frequency wordstendtodominateinamplitudeonhigh-frequencywords,seetheinversionofcolorintensityin

Fig.3).Thisismainlyduetoagreatersparsityoflow-frequencykeywordsacrosstime.

Filtering

Optimal smoothingford1normalizeddatais achievedwithsplineorderm=3 andsmoothing

parameter

l

=101.75_(df₌_7.4)_under_a_roughness_penalty_of_order_r₌₁₍_Fig.₄_).

AsampleofcurvesﬁttedbytheoptimalsmoothingareshowninFig.5,fromthewordwithhighest rootmeansquare(RMS)residual(rural)tothewordwithlowestRMSresidual(model).

Fig.2. Wordtrajectories(originaldata):y-axisrepresentsthewordrawfrequencyforeachvolume;x-axisrepresentsthe

volumepublicationyear;linecoloridentiﬁesthewordfrequencyclass(VeryLow,Low,HighandVeryHighdenote

equal-frequencyintervalsofwordtotalfrequencyintheentirecorpus).Anexampleofwordtrajectoryhasbeensuperimposedforeach

(8)

Curveclustering

Curvesarepartitionedbymeansofthek-meansalgorithmcombinedwiththeEuclideandistance withclusternumberkrangingfrom2to26and20rerunsforeachk.

Asetof49qualitycriteriaarethencomputedinordertoidentifythebestcandidatestocluster number.

Fig.4.Smoothingselection:overviewoflog10l,effectivedegreesoffreedom(df),sumofsquareerrors(SSE)andGCVbyvarying

ordermandroughnesspenaltyorderr(PENr).OptimalsmoothingisobtainedbyminimizingGCV.d1normalization.

(9)

Visualrepresentationoftheratingfortheclusternumbershows(Fig.6)that:

(i)partitionsintotwo/threeclustersarethebestrated,

(ii)partitionswithaclusternumberclosetothemaximumoftheconsideredrange(24–26)havealso beenfrequentlyselectedinthehighestpositions,

(iii)intherangeofmoreinterestingclusternumbers(neithertoolownortoohigh),themostselected inthetopfourpositionsis6,secondis4,thirdis19(theeyeshouldbeguidedbothbythebar height,correspondingtothecumulatedfrequencyof beingin thetopfour,and bythecolor composition,informingonthepositionlevel).

Note that theﬁnal setofbest candidatesforcluster number istheoutput ofan Rcodethat essentiallymimicsaqualitativeratingpurelybasedonagraphicalinspection.

Fig.5.Optimalsmoothingﬁt:aselectionofﬁttedcurvesorderedaccordingtodecreasingrootmeansquare(RMS)residual.Fit

ofasmoothingsplineoforderm=3,withPEN1,tod1normalizeddata.

Fig.6.Clusternumberselection:frequencyofbeingrankedﬁrst(top-1),second(top-2),third(top-3)andfourth(top-4)for

(10)

Fig.7. Clustering:bestpartitioninto6groups.d1normalization.

(11)

Solution(i)(thetwo-grouppartition)reflectsthesubstantialbifurcationofthehistoricalperiod around thesixties whenStatistics wasborn as anautonomous discipline(see [1] and[22] for explanation).Solution(ii)(25/26-grouppartition),ononehand,mayreflectthelackofadefined structureandparsimonious grouping,but,ontheother, itmaybea failureduetothestandard assumptionunderlyingmanyqualitycriteriaofdatanormally distributedhenceofcompact and convexclusters.Thatpremised,wechoosetoinvestigatethemostinterestingsolution(iii),thatis thesetofclusternumbersneithertoosmallnortoolarge,andtosubjectthemtothescrutinyof experts.

Here,weillustratethebestpartitionfoundwiththeclusternumberrankedfirst,thatisk=6. Thegraphicaloutputshows thegroupsall togetherwiththecluster meanpatterns(Fig.7), and individually(Fig.8).Notethat,inordertomakethereadingeasier,stemshavebeenreplacedwith thesingularnounor,incasethisisnotpresentinthecorpus,withthetypicalwordrelatedtothe stem. Moreover, in order to make the identification of possible subsequent phases in the knowledgefieldevolutioneasier,individualclustershavebeenchronologicallyordered.Thefound dynamics are then examined and - whether considered interesting - eventually interpreted by subjectmatterexperts.ApossiblereadingofthehistoryofStatisticsonthebasisoftheillustrated findingsis offered in[1].

Acknowledgements

ThisstudywassupportedbytheUniversityofPadova,fundCPDA145940_“TracingtheHistoryof Words.APortraitofaDisciplineThroughAnalysesofKeywordCountsinLargeCorporaofScientiﬁc Literature”(P.I.ArjunaTuzzi,2014).

AppendixA.Supplementarydata

Supplementarymaterialrelatedtothisarticlecanbefound,intheonlineversion,atdoi:https://doi. org/10.1016/j.mex.2018.11.010.

References

[1]M.Trevisani,A.Tuzzi,Learningtheevolutionofdisciplinesfromscientiﬁcliterature:afunctionalclusteringapproachto normalizedkeywordcounttrajectories,Knowl.BasedSyst.146(2018)129–141,doi:http://dx.doi.org/10.1016/j. knosys.2018.01.035.

[2]D.M.Blei,A.Y.Ng,M.Jordan,LatentDirichletAllocation,J.Mach.Learn.Res.3(2003)993–1022. [3]T.L.Grifﬁths,M.Steyvers,Findingscientiﬁctopics,Proc.Natl.Acad.Sci.101(Suppl.1)(2004)5228–5235.

[4]D.M.Blei,J.D.Lafferty,Dynamictopicmodels,Proceedingsofthe23rdInternationalConferenceonMachineLearning, (2006),pp.113–120.

[5]L.Bolelli,Ş.Ertekin,C.L.Giles,Topicandtrenddetectionintextcollectionsusinglatentdirichletallocation,Advancesin InformationRetrieval,ECIR2009,Lect.NotesComput.Sci.5478(2009)776–780.

[6]D.Chavalarias,J.-P.Cointet,Phylomemeticpatternsinscienceevolution–theriseandfallofscientiﬁcﬁelds,PLoSOne8(2) (2013)e54847.

[7]K.W.Boyack,R.Klavans,Includingcitednon-sourceitemsinalarge-scalemapofscience:Whatdifferencedoesitmake?J. Informetr.8(3)(2014)569–580.

[8]X.Sun,K.Ding,Y.Lin,Mappingtheevolutionofscientificfieldsbasedoncross-fieldauthors,J.Informetr. 10(3)(2016)750– 761.

[9]F.Osborne,G.Scavo,E.Motta,Identifyingdiachronictopic-basedresearchcommunitiesbyclusteringsharedresearch trajectories,EuropeanSemanticWebConference(2014)114–129.

[10]W.Ding,C.Chen,Dynamictopicdetectionandtracking:acomparisonofHDP,C-word,andcocitationmethods,J.Assoc.Inf. Sci.Technol.65(10)(2014)2084–2097,doi:http://dx.doi.org/10.1002/asi.23134.

[11]Y.Zhang,H.Chen,J.Lu,G.Zhang,DetectingandpredictingthetopicchangeofKnowledge-basedSystems:atopic-based bibliometricanalysisfrom1991to2016,Knowl.BasedSyst.133(Suppl.C)(2017)255–268,doi:http://dx.doi.org/10.1016/j. knosys.2017.07.011.

[12]J.Ramsay,B.W.Silverman,FunctionalDataAnalysis(SpringerSeriesinStatistics),Springer,2005,doi:http://dx.doi.org/ 10.1007/b98888.

[13]J.Jacques,C.Preda,Functionaldataclustering:asurvey,Adv.DataAnal.Classif.8(3)(2014)231–255,doi:http://dx.doi.org/ 10.1007/s11634-013-0158-y.

[14]J.L.Wang,J.M.Chiou,H.G.Mueller,Functionaldataanalysis,Annu.Rev.Stat.Appl.3(1)(2016)257–295,doi:http://dx.doi. org/10.1146/annurev-statistics-041715-033624.

(12)

[16]A.Morrone,Temigeneralietemispeciﬁcideiprogrammidigovernoattraversolesequenzedidiscorso,in:M.Villone,A. Zuliani(Eds.),L’attivitàdeigovernidellaRepubblicaitaliana(1948–1994),IlMulino,Bologna,1996,pp.351–369. [17]RCoreTeam,R:ALanguageandEnvironmentforStatisticalComputing,RFoundationforStatisticalComputing,Vienna,

Austria,2017.

[18]S.Bolasco,Taltac2.10.Sviluppi,esperienzeedelementiessenzialidianalisiautomaticadeitesti,LED,Milano,2010. [19]P.Montero,J.Vilar,Tsclust:AnRpackagefortimeseriesclustering,J.Stat.Softw.62(1)(2014)1–43,doi:http://dx.doi.org/

10.18637/jss.v062.i01.

[20]B.Desgraupes,clusterCrit:ClusteringIndices,Rpackageversion1.2.7,(2016).

[21]C.Genolini,X.Alacoque,M.Sentenac,C.Arnaud,kmlandkml3d:RPackagestoClusterLongitudinalData,J.Stat.Softw.65 (4)(2015)1–34,doi:http://dx.doi.org/10.18637/jss.v065.i04.