• Non ci sono risultati.

Wheeler graphs: A framework for BWT-based data structures

N/A
N/A
Protected

Academic year: 2021

Condividi "Wheeler graphs: A framework for BWT-based data structures"

Copied!
12
0
0

Testo completo

(1)

Contents lists available atScienceDirect

Theoretical

Computer

Science

www.elsevier.com/locate/tcs

Wheeler

graphs:

A

framework

for

BWT-based

data

structures

Travis Gagie

a

,

Giovanni Manzini

b

,

c

,

,

Jouni Sirén

d

aDiegoPortalesUniversityandCEBIB,Santiago,Chile bUniversityofEasternPiedmont,Alessandria,Italy cIIT-CNR,Pisa,Italy

dWellcomeTrustSangerInstitute,Cambridge,UK

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received3March2017

Receivedinrevisedform3June2017 Accepted6June2017

Availableonline27June2017 Keywords:

Compresseddatastructures Burrows–Wheelertransform Patternmatching

ThefamousBurrows–WheelerTransform(BWT)wasoriginallydefined forasinglestring butvariationshavebeendevelopedforsetsofstrings,labeled trees,deBruijngraphs,etc. Inthispaperweproposeaframeworkthatincludesmanyofthesevariationsandthatwe hopewillsimplifythesearchformore.

WefirstdefineWheelergraphs andshowtheyhaveapropertywecallpathcoherence.We showthatifthestatediagramofafinite-stateautomatonisaWheelergraphthen,byits pathcoherence,wecanorderthenodessuchthat,foranystring,thenodesreachablefrom theinitialstateorstatesbyprocessingthatstringareconsecutive.Thismeansthateven iftheautomatonisnon-deterministic,wecanstill storeitcompactlyandprocessstrings withitquickly.

We thenrederiveseveralvariationsoftheBWTby designingstraightforwardfinite-state automata forthe relevant problemsand showingthat theirstatediagramsare Wheeler graphs.

©2017TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

TheBurrows–WheelerTransformation(BWT)hasaverypeculiarhistory.Firstconceivedin1983,itwas publishedonly elevenyearslaterinatechnicalreport[9],presumablybecauseitwassoinnovativethatthefirstreviewerswerenotableto graspitsfullsignificance.Afewyearslater,thecompressionalgorithm bzip2 basedontheBWTbecamepopular,challenging gzip’sdominance,thankstothefinelyengineeredimplementationofJulian Seward[46](the verysamecomputer scientist whogaveusalsotheinvaluabletool Valgrind[47]).

Afterits introductionasa compression tool,interest in theBWTwas rekindledwhen manyresearchers realizedthat, amongthe differenttechniquesdiscovered attheturn ofthecentury fordesigning compressedindexes [19,29,34],those basedon theBWTare probablythesimplestandmostspaceefficient [15,43].Afterthisrealization, inthelast tenyears wehavewitnessedan unusualphenomenonincomputerscience:variantsoftheBWThavebeenproposedandappliedto moreandmorecomplexobjects:fromtrees,tographs,to alignments.Thesevariantsareclearly relatedtotheBWTeven

ThisworkwassupportedbyAcademyofFinlandgrant268324,FONDECYT grant1171058,KITE-DiSITproject,INdAM-GNCSProject“Efficientalgorithms andtechniquesfortheorganization,managementandanalysisofbiologicalBigData”,andtheWellcomeTrustgrant[098051].

*

Correspondingauthor.

E-mailaddresses:travis.gagie@udp.cl(T. Gagie),

giovanni.manzini@uniupo.it

(G. Manzini),

jouni.siren@iki.fi

(J. Sirén). http://dx.doi.org/10.1016/j.tcs.2017.06.016

0304-3975/©2017TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(2)

68 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

Fig. 1. Aneight-nodeWheelergraph.Node1hasin-degree0;edgeslabeled a entersinnodes2,3,4;edgeslabeled b innodes5,6;edgeslabeled c in nodes7,8.

if some ofthem nolonger havethe two main features ofthe original BWT,namelyofbeing invertibleandof “helping” compression.

AtthispointitisnaturaltoaskwhetherwehaveapproachedtheBWTastheblind menapproachedtheelephant(see, e.g., [45]), withonetouching aleg andthinking theelephant islikea tree,anothertouching thetrunk andthinkingit is likea snake,andyetanothertouching thetailandthinking itislikearope.Wedonotdisparageprevioussurveysofthe BWTandrelateddatastructures,such as[1,35,42],sincewe toohavespent yearstryingtomakesense ofoursometimes disparate impressionsofthe it.Withoutpretending togive a completeanswer,in thispaperwe propose aunifyingview formanydifferentBWTvariants.Somewhatsurprisinglywegetourunifyingview consideringtheNondeterministicFinite Automatarelatedtodifferentpatternmatchingproblems.Weshowthatthestategraphsassociatedtotheseautomatahave commonpropertiesthatwesummarizewiththeconceptofWheelergraphs.1

Using thenotion of a Wheeler graph,we show that it is possible to process strings efficiently, e.g., in linear time if the alphabet is constant, even ifthe automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactlyrepresentedandtraversedusinguptothreearrayswithadditionaldatastructuressupportingefficientrankand selectoperations. Itturnsout thatthesearrayscoincide with,oraresubstantiallyequivalent to,theoutputofmanyBWT variantsdescribedintheliterature.

We believeourunifyingview canhelpresearchersdevelopnewBWTvariantsandnewindexingdatastructures. How-ever, westress that not everyBWT-related datastructure fitsour framework:forexample Gangulyetal.’s parameterized BWT[25] andtheindexfororder-preservingmatchingin[23].Therefore,wehope thatourcontributionwillspurfurther researchresultinginawidervisionofthefascinatingfieldoriginatedbytheseminalworkofBurrowsandWheeler. 2. Definitionsandbasicresults

Consider adirected edge-labeledgraphG suchthateach edgeislabeledby acharacterfromantotally-ordered alpha-bet A.Weuse

todenotetheorderingamongA’selements.Labelsontheedgesleavingagivennodearenotnecessarily distinct, andtherecanbe multipleedgeslinkingthesamepairofnodes(forsimplicitywe stillusethetermgraph rather

thanthemoreformallycorrectmulti-graph).

Definition1.G is a Wheelergraph if there is an ordering of the nodes such that nodes within-degree 0 precede those with positivein-degree and, foranypairofedges e

= (

u

,

v

)

ande

= (

u

,

v

)

labeled a and a respectively,the following monotonicitypropertieshold:

a

a

⇒

v

<

v

,

(a

=

a

)

∧ (

u

<

u

)

⇒

v

v

.

(1)

AnexampleofaWheelergraphisshowninFig. 1.Asanimmediateconsequenceof(1),alledgesenteringagivennode musthavethesamelabel.WenowshowthatWheelergraphsalsopossessthefollowingproperty:

Definition2.G ispathcoherent ifthereisatotalorderofthenodessuchthatforanyconsecutiverange

[

i

,

j

]

ofnodesand string

α

,thenodesreachable fromthose in

[

i

,

j

]

in

|

α

|

steps byfollowingedges whoselabels for

α

whenconcatenated, themselvesformaconsecutiverange.

Lemma3.IfG isaWheelergraphbyanordering

π

onitsnodesthenitispathcoherentby

π

.

Proof. SupposeG isaWheelergraphby

π

.Consideraconsecutiverange

[

i

,

j

]

ofnodesandlet

[

i

,

j

]

bethesmallestrange that contains all thenodes reachable fromthose in

[

i

,

j

]

in one step by followingedges labeled with some charactera.

Byourchoice of

[

i

,

j

]

,both iand j arereachablefromnodesin

[

i

,

j

]

inonestepby followingedgeslabeled a.Byour

1 OnmanyoccasionsMikeBurrowsstatedthat,asreportedalsoin[9],theoriginalideaoftheBWTisduetoDavidWheeler.Wethereforedecidedto namethisgraphclassafterthispioneerofcomputerscience.

(3)

definitionofaWheelergraph,nodeswithin-degree 0precedethose withpositive in-degree,suchasi,so everynode in

[

i

,

j

]

hasatleastoneincomingedge.

Assume some node v strictly betweeni and j hasan incomingedge labeled a

=

a.Since i

<

v we havea

a,by ourdefinitionofaWheelergraphandmodustollens;similarly,since v

<

jwe havea

a,thusobtaining acontradiction. Itfollowsthattheedgesarrivingatnodesin

[

i

,

j

]

arealllabeled a.Furthermore,sincethelabelsareequal,bythesecond implicationin(1)andmodustollens wegetthatanyedgewithdestinationstrictlybetweeniand jmustoriginatein

[

i

,

j

]

. Itfollowsthatthenodesreachableinonestepfromthosein

[

i

,

j

]

byfollowingedges labeled a aretheonesin

[

i

,

j

]

, whichisaconsecutiverange.Foranystring

α

,therefore,thenodesreachablein

|

α

|

stepsfromthosein

[

i

,

j

]

byfollowing edgeswhoselabelsform

α

,themselvesformaconsecutiverange,byinductiononthelengthof

α

.

2

Inthelatersectionsofthispaper,weexploretheimplicationsofpathcoherence.Intheremainderofthissectionwefirst showit ispossibletoobtainafastandcompactrepresentationofa Wheelergraph,then sketchwhyWheelergraphscan achievecompression.Wepointoutthat,evenwithoutexplicitlydefiningWheelergraphs,manyresearchershaveimplicitly consideredthemwhilestudyinghowtoimplementefficientlyBWTvariantsandhowtoboundtheirspaceusage,andgiven resultsanalogoustotheoneswegivenow.

Aplain,edge-by-edgerepresentationofalabeled graphwithn nodesande edgesuses

(

e

(

log n

+

log

|

A

|))

bits.Given aWheelergraphG,letx1

<

x2

<

· · · <

xn denotetheorderedsetofnodes.Fori

=

1

,

. . . ,

n let



i andki denoterespectively

theout-degreeandin-degreeofnodexi.Definethebinaryarraysoflengthe

+

n

O

=

011 021

· · ·

0n1

,

I

=

0k11 0k21

· · ·

0kn1

.

(2)

Notethat O (resp. I)consistsoftheconcatenatedunaryrepresentationsoftheout-degrees(resp. in-degrees).LetLi denote

themultisetoflabelsontheedges exitingfromxi arrangedinan arbitraryorder,andlet L

[

1

..

e

]

denotetheconcatenation L

=

L1L2

· · ·

Ln.Byconstruction,

|

Li

| = 

i andthereisa naturalone-to-one correspondencebetweenthe 0’sin O andthe

charactersinL.Forexample,forthegraphofFig. 1itis

O

=

0001 001 01 1 001 001 001 01

L

=

aab ac b ac bc bc a

whiletheindegreearrayisI

=

101001001001001001001.Finally,letC

[

1

..

|

A

|]

denotethearraysuchthatC

[

c

]

isthenumber ofedgeswithlabelsmallerthanc.Forsimplicity,assumeevery distinctcharacterlabelssomeedge;otherwise,westorea bitvector of

|

A

|

bitsmarking thecharactersthatdo labeledges andworkwiththat subset.Withthisassumption, C

[

c

] <

C

[

c

+

1

]

andxjhasallincomingedgeslabeledc ifandonlyifC

[

c

]

<



ijki

C

[

c

+

1

]

.

Given an array Z we use the standard notation rankc

(

Z

,

i

)

to denote the numberof occurrences of c in Z

[

1

,

i

]

, and

selectc

(

Z

,

j

)

todenotethepositionofthe j-thc in Z .Forsimplicityweassumerankc

(

Z

,

0

) =

0.Thefollowingtwoproperties

arestraightforwardtoprove;indeed,similarpropertieshavebeenestablishedinmanypapersdealingwithBWT-relateddata structures.

1. Theout-degree



iofnode xiisselect1

(

O

,

i

)

select1

(

O

,

i

1

)

.Thelabelsintheedgesleavingxi areL

[

wi

− 

i

+

1

,

wi

]

wherewi

=

select1

(

O

,

i

)

i.

2. If node xi hasone ormore outgoingedges labeled c, then thelargest index j such that

(

xi

,

xj

)

has labelc can be

computedasfollows.Letwi bedefinedasaboveandlethi

=

rankc

(

L

,

wi

)

.Ifweorderedgesbylabelwithtiesbroken

by origin,

(

xi

,

xj

)

is the hi-thedge labeled c. Since there are C

[

c

]

edges with a label smaller than c, we have j

=

1

+

rank1

(

I

,

select0

(

I

,

hi

+

C

[

c

]))

.Tofindthesmallerindices j suchthat

(

xi

,

xj

)

haslabelc,wedecrementhiandrepeat

thisprocedure,untilhi

=

rankc

(

L

,

wi

− 

i

)

+

1.

Withasymmetric reasoning,given xj andc wecanestablishwhetherthereexists anedge labeledc enteringin xj,and,

ifthisisthecase,all indicesi suchthat

(

xi

,

xj

)

haslabel c.Using standarddatastructurestorepresentcompactly arrays

supportingrankandselectqueries[43],wecanestablishthefollowingresult.

Lemma4.Itispossibletorepresentann-node,e-edgeWheelergraphwithlabelsoverthealphabet A in2

(

e

+

n

) +

elog

|

A

| +

|

A

|

log e

+

o

(

n

+

elog

|

A

|)

bits.Therepresentationsupportstheforwardandbackwardtraversingoftheedgesin

O (

log

|

A

|)

time.

2

We canoftenreduce even furtherthe spaceusedto representWheelergraphs. Forexample,suppose G is aWheeler graph;VkisthesetofnodesofG reachableinexactlyk stepsfromsomeothernodes;S isasetofstringsoflengthk such

thatevery nodeinVk isreachableink stepsbyfollowingedgeswhoselabelswhenconcatenatedformastringinS;and f is afunctionassigningto eachnode v

Vk astrings

S such that v isreachableink stepsbyfollowingedgeswhose

labelswhenconcatenatedforms.ConsiderthelistLkoflabelsofedgesoriginatinginVkinorderbyoriginwithtiesbroken

bylabel.

ByLemma 3,fors

S allthenodesv

Vksuchthat f

(

v

)

=

s formaconsecutiveinterval.Therefore,wecanpartition Lk

into

|

S

|

consecutiveintervalssuchthateachintervalLsistheconcatenationofthelabelsofedgesleavingnodesv suchthat f

(

v

) =

s; Lsisemptyfors

/

S.Itfollowsthatifk

≤ (

1



)

lgσ

|

Vk

|

,where



>

0 isaconstantand

σ

= |

A

|

isthesizeofthe

(4)

70 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

Fig. 2. Deterministic (left) and non-deterministic (right) finite automata accepting all substrings of ABRACADABRA. All states are accepting.

alphabetoflabels,thenwecanstoreLk in



sAk

|

Ls

|

H0

(

Ls

)

+

o

(

|

Vk

|)

bits,whereH0

(

Ls

)

isthe0th-orderempiricalentropy

of Ls.Informally,thismeansthat ifknowinghowwe cangettoa nodeink steps tellsusalotaboutthelabelsthatare

likelytobeontheedges leavingthatnode,thenwe cancompress Lkwell. Insome waysthisgeneralizesthewell-known

analyses[14,39]ofthecompressionachievableusingtheBurrows–WheelerTransformonasinglestring. 3. NFAs,WheelergraphsandFM-indexes

Suppose we haveneverheardthewords “SuffixArray” or“Burrows–Wheeler Transform”butwe wantto builda data structure supporting efficientqueries forthe substringsof,say, s

=

ABRACADABRA. Asimplesolution wouldbe tobuild the DeterministicFinite Automaton (DFA)acceptingall substringsof s,see Fig. 2(left). Althoughthere aretechniques to compactly representDFAs, such avenue willlikely lead toa data structure muchlarger than the

O (

n log

|

A

|)

bitsof the originaltext.

Amorespaceeconomicalalternative istobuildtheNondeterministicFinite Automaton(NFA)forthesamesetof sub-strings: Ithasalinearstructure,seeFig. 2(right),butbecauseofnondeterminism,processingofpatternsappearstobe a difficult task.It turnsout thatthisdifficulty isonlyamatter ofperspective:traversing theDFAbecomes muchsimplerif werecognizethatitisaWheelergraphindisguise.

The NFA of Fig. 2 is already a labeled directed graph. To make it a Wheeler graph we only need to eliminate the



-transitions(which violatetheproperty thatall theincomingedges toanode havethesamelabel), makeallthe states initial(sothelanguageisnotchanged),anddefineanorderingofthenodessuchthatthemonotonicityproperties(1)hold. Tothisend,weassociatetoeachnodeoftheNFAtheprefixofs thattakesustherewithoutan



-transitionandorderthe NFAnodesaccordingtotheright-to-leftlexicographic rankofsuchstrings,seeFig. 3(leftandcenter).Withthisdefinition the NFAwithout



-transitionsis aWheelergraph.Oncethenodeordering isestablishedwe candiscardtheprefixesand identifyeachnodewithitsrankintheordering,seeFig. 3(right).

IntheWheelergraphderivedfromtheNFAofFig. 3everynodehasout-degree1exceptfromnode 5,correspondingto thelongestprefix,thathasout-degreezero.Hence,intherepresentationoftheWheelergraphdescribedinSect.2,wecan getridofthebitarray O andinsteadinsertinposition5ofL asymbolnotoccurringelsewhereins,say$.Sinceallnodes havein-degree1exceptnode0(ignoringthesourcelessedgesindicatingthatallnodesareinitial)thereisnoneedtostore thebitarrayI either.Summingup,wecanrepresentandnavigatetheWheelergraphusingthestringL

=

ABDBC$RRAAAA, enriched withdatastructuresforrank/select operations,andthe arrayC

[

1

..|

A

|]

.Sincethestring L coincideswiththelast columnoftheBWTmatrixofsR,thestring s reversed,andC is awell-known representationofthefirst columnofsuch

matrix,wehaveestablishedthefollowingresult.

Lemma5.Givenastrings,letNFA

(

s

)

denotethecorrespondingNFAdescribedabove.TheFM-indexofsRisacompactrepresentation ofNFA

(

s

)

.TheLast-to-FirstandFirst-to-LastmapsoftheFM-indexcoincideswiththenavigationoperationsinNFA

(

s

)

.

2

(5)

Fig. 3. FromtheNFA(left)totheWheelergraph(center)totheFM-index(right).Allstatesareinitialandacceptingandnumberedontherighttoshow thatthediagramisaWheelergraph.

TheattentivereadermayaskwhytheWheelergraphrepresentstheFM-indexofsR whilehistoricallytheFM-indexwas definedfors.ThereasonisthattheFM-indexwas derivedfromtheBWTofs.The consequenceisthat thesearchofthe patternmustbe done right-to-left:the infamous backward-search procedure[20].Here, we havedefinedtheNFA sothat thesearchisdoneleft-to-rightandthroughtheWheelergraphwehavethereforeobtainedtheFM-indexofsR.

Lemma 5 provides anew perspective onFM-indexes that maybe interesting in its ownright and, moreimportantly, suggestsawaywecan systematicallygeneralize theideas behind FM-indexestoindexingother kindsofdatathansingle strings.Usingtheconceptofpathcoherence,wenowproveamorepowerfulversionofthislemma:

Theorem6.Considerafinite-stateautomatonoveranalphabetA without



-transitionsandwitheitheroneinitialstateorallstates initial.IfitsstatediagramisaWheelergraphwithn nodesande edgesthenwecanstoreitin2

(

e

+

n

)

+

elog

|

A

|

+ |

A

|

log e

+

o

(

n

+

elog

|

A

|)

bitssuchthat,foranystring

α

,in

O (|

α

|

log

|

A

|)

timewecancomputethesetofstatesreachablefromtheinitialstateson

α

. Proof. Supposethestate diagramisa Wheelergraphbyan ordering

π

onits nodesso,byLemma 3,itispath coherent by

π

.Sinceeitheronestateisinitialorallare,theinitialstatesformaconsecutiverangein

π

.ByDefinition 2,thenodes reachable fromtheinitial statesoneach prefix of

α

formaconsecutive interval.We canuse Lemma 4to mapfromthe intervalforeachprefixtotheintervalforthenextprefixin

O (

log

|

A

|)

time.

2

Theorem 6meansthat ifwehaveaproblemwecansolvewithafinite-stateautomaton,theneveniftheautomatonis non-deterministic wemaystill be abletoimplementitwithout theusual blowupsinspaceorquerytime. Intherestof thispaperweshowthatmanydatastructuresbasedonvariantsoftheBWTfitintothisframework.

WenoteasanasidethatFM-indexesusuallyincludeasampledsuffixarraywhichpermitsustolocateeachoccurrence ofapatternintheindexedstring.Extendingthisideatoautomataseems straightforward,bysamplingthenodes,butwe donotexplorethatinthispaper.

4. OtherWheelergraphsindisguise 4.1. Multi-stringBWTandpermutermindex

The first naturalgeneralization ofthe BWTis to extendit to acollection ofstrings. Thiswas done forthe first time in[36–38]wheretheauthorsdescribedareversiblemulti-stringtransformationinspiredbytheBWT,andshowedits effec-tivenessfordatacompressionandformeasuringstringsimilarity.

Inthecontextofpatternmatching,givenacollectionofstringss1

,

. . . ,

sd,inadditiontosubstringqueriesitisconvenient

to offer alsothe possibility ofprefix/suffix queries. In a prefix/suffix query,given two substring

α,

β

we want tofind all strings si such that si isprefixed by

α

andsuffixedby

β

.As first observed in [21,22] withthe Compressed Permuterm

Index,such queries can be solved by addinga specialsymbol $ to each stringandsearching forthe pattern

β

$

α

inside a circular versionof each si$. Fig. 4(left) showstheNFA supporting circularqueries forthestrings AT$, HOT$, HAT$. To

each nodewe naturally associatea circularshiftofoneofthe strings; ifwe sortthenodesaccordingto theright-to-left lexicographicorderofsuchshiftstheresultinggraphisaWheelergraph,seeFig. 4(right).

Since each node in the Wheeler graph has in-degree and out-degree 1, we can represent it with the approach of

Lemma 4, usingonly the arrays L and C . Note that the array L, L

=

AHHTTAOT$$$ in the example of Fig. 4, coincides withthemulti-stringBWTdefinedintermsofthe cyclicshiftsin[38].Instead,theCompressedPermutermIndexisbuilt computingthesingle-string BWToftheconcatenation s1$s2

· · ·

$sd thatdoesnot supportnaturally thesearch forcircular

(6)

72 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

Fig. 4. Left:Afiniteautomatonacceptingallcircularsubstringsof AT$, HOT$, HAT$;allstatesareinitialandaccepting.Right:Sortingthestatesaccording totheassociatedcircularshiftsgivesusaWheelergraph.

Fig. 5. Afiniteautomatonacceptingallsubstringsof AAC, ABA, ACAA, BA and BC.Allstatesareinitialandacceptingandnumberedtoshowthatthe diagramisaWheelergraph.

applying theso-called jump2end function.Morerecentalgorithms usingcircularpatternsearch allmake useofthecyclic multi-stringBWT[2,7,8,30].

4.2. XBWTandtrierepresentation

Ifwe are onlyinterested inthe substringsofacollection s1

,

. . . ,

sd, andnotincircular matches,asmaller automaton

thantheoneconsideredintheprevioussectionistheonederivedfromthetrie datastructure,seeFig. 5.Followingthelead from[16–18],weassociatetoeachnodethestringformedbythelabelsinthenode-to-rootupwardpath,andweorderall nodesaccordingtothelexicographicrankofsuchstrings.

The resultinggraphisaWheeler graphwiththe samenumberofnodesastheoriginal trie.Ina trieall nodesexcept theroothavein-degree1sointherepresentationofLemma 4we donotneedtostore thein-degreebinary array.Hence therepresentationconsistsoftheout-degreebinaryarray O ,thelabelsarray L andthecountarrayC .ForthetrieinFig. 5

wehave

O

=

001 0001 01 1 1 1 01 001 01 01 1

L

=

AB ABC C A AC A A

This representation isessentially equivalent to the XBWT(eXtended BWT) introduced in [16] to representlabeled trees, withthearraysO and L correspondingrespectivelyto

S

last and

S

α in[16].Notethatwecanrestrictthesearchtoprefixes

bysettingtherootastheonlystartstate,andwecanrestrictthesearchtosuffixesbysettingasacceptingstatesonlythose correspondingtooneoftheinputstringswithoutaffectingourrepresentation.

4.3. deBruijngraphs

Bowe,Onodera,SadakaneandShibuya[6](seealso,e.g.,[5,10,33]) extendedtheXBWTfromtreestodeBruijn graphs, whicharewidelyusedinbioinformaticsfordenovoassembly,readcorrection,identifyinggeneticvariationsinapopulation,

(7)

Fig. 6. Afiniteautomatonacceptingallprefixesoflengthatleast3ofstringsinwhichtheonlytriplesare AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA andinwhich BAAA and CABA donotoccur.State1isinitial,allstatesatdistance3fromtheinitialstateareaccepting.Statesarenumberedtoshowthat thediagramisaWheelergraph.

C C T C – A – A A C C C – C T C A A A C C C C T C C A – A A C A C C T C C A A A C A C C T T – A T A A C – C C T T A T A – A C C C T – – – – A A C C C – – – C T A A C C

Fig. 7. Left:Analignmentofstrings CCTCAAACC, CCTCCAAACA, CCTTATAAC,and CCTAACC.Thealignmenthasbeenseparatedintocommonand non-commonregions.Right:ThealignmenttransformedforFMA.Thesuffixesmovedtothenon-commonregionsare CT forthefirstcommonregionand AC forthesecondone.

andotherapplications.Akth-orderde Bruijngraphforastringorsetofstrings containsa nodeforeachdistinct k-tuple

thatoccursinthosestrings,andan edge

(

u

,

v

)

ifthereisa

(

k

+

1

)

-tupleinthestringsthatstarts withu andendswithv

(whichimpliesthatv canbeobtainedfromu bydeletingu’sfirstcharacterandappendingacharacter).

TogetherwiththeGCSA,describedinSection4.5,Boweetal.’srepresentationisreallytheprototypicalBWT-baseddata structurethatrequiresustothinkintermsofgraphsinsteadofstrings.Themaindifferencebetweenrepresentingalabeled treeandrepresentingadeBruijngraphisthat,inagraph,nodescanhavein-degreemorethan 1,butthiscanbehandled usingthein-degreearray I asdescribedinLemma 4.

We canalso rederive Bowe etal.’srepresentationfrom Theorem 6: we build a finite-stateautomaton that acceptsall prefixesoflengthatleastk ofstringscontaining onlyk-tuples and

(

k

+

1

)

-tuplesfroma givenlist,by buildingatrie for thek-tuplesandthenaddingedgesconnectingtheleavesappropriately.Fig. 6showsanexampleforthetriples AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA, withtheedges for BAAA and CABA missing:thisautomaton acceptsall prefixesof lengthatleast3ofstringsinwhichtheonlytriplesare AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA andinwhich BAAA and CABA donotoccur.Bynumberingeachnodeaccordingtothelexicographicrankofthestringlabeling thenode-to-root path,thestatediagramisimmediatelyseentobeaWheelergraph.

4.4. FM-indexofalignment

LetG beaWheelergraphundernodeordering

π

.Ifallincomingedgestonodesinrange

[

i

,

j

]

havethesamelabel,we canmergetherangeintoasinglenode v,andtheresultinggraphG willstillbeaWheelergraph.IfgraphG hasanedge withlabela fromnodeu toanodeinrange

[

i

,

j

]

,graphGwillhaveanedge

(

u

,

v

)

withlabela.Similarly,ifgraphG has

anedgewithlabelafromanodeinrange

[

i

,

j

]

tonode w,graphG willhaveanedge

(

v

,

w

)

withlabel a.

Ifwehaveamulti-stringBWT,wecantransformitsWheelergraphintoamorecompactrepresentationofthestringsby mergingrangesofnodes.Thiscompactrepresentationmaycontainfalsepositives:pathlabelsthatcombinesubstringsfrom differentoriginal strings. The FM-indexofalignment (FMA) [40,41] hasan efficientprocedure fordetecting false positives, basedoncarefullychoosingthenodestomerge.

We start with an alignment of the strings, and separate the alignment into common regions Xi shared between all

strings,andnon-common regions Yi wheresome stringsaredifferent. Weassume thateach commonregionhasaproper

suffix Zi that doesnot occur anywhere else in the strings. Ifno such suffix exists, we merge the common region into

the neighboringnon-common regions. We transformthe alignment intwo steps. First we move the suffixes Zi intothe

followingnon-commonregionsYi.Thenwejustifyeachnon-commonregiontotheright,anduse XiandYitodenotethe

commonandnon-commonregionsinthetransformedalignment.SeeFig. 7foranexample.

The transformed alignment guides us in merging the Wheeler graph of the multi-string BWT into the FM-index of alignment. As we are buildingan FM-index for the original strings, we need to build a Wheeler graph for the reverse strings.Wemergenodescorrespondingtoalignedpositions,ifthenodesa) havein-degree 0;b) areina commonregion; orc) correspondtothesamesuffixofthenon-commonregion.SeeFig. 8foranexample.

(8)

74 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

Fig. 8. Top:Wheelergraphofthemulti-stringBWTofthetransformedalignmentin

Fig. 7

.Nodeorderingisbasedonthelabelsinthenodes.Graynodes correspondtothecommonregions.Bottom:WheelergraphoftheFMAforthesamealignment.Thecolorsofeachedgemarktheoriginalstringsthat crosstheedge.Thelabelsofnon-mergednodesarethesameasinthetopgraphforconvenience.(Forinterpretationofthecolorsinthisfigure,thereader isreferredtothewebversionofthisarticle.)

ByLemma 3,theWheelergraphofamulti-stringBWTispathcoherent.ToseethatthegraphremainsaWheelergraph aftermerging,wenotethat:

1. Sourcenodeswithin-degree0formaconsecutiverangebyDefinition 1.

2. AcommonregionXiisenteredeitherfromthesourcenodesorfromthefollowingnon-commonregionYi.Inthelatter case, thesetofnodesreachablewithstring ZR

i isthesetofnodeswithoutgoingedgestothecommonregion Xi.By Lemma 3,thesenodesformaconsecutiverange.Asthecommonregionisenteredfromaconsecutiverangeofnodes, we see by iterating Definition 2 that nodes corresponding to aligned positions in the region also form consecutive ranges.

3. Bytheprevious twocases,thenon-commonregionisenteredfromaconsecutiverangeofnodes.ByDefinition 2,we seethatthenodescorrespondingtoasuffixoftheregionformconsecutiveranges.

False positivesare paths that donot correspond to anyof theoriginal strings. Theycan occur when the pathcomes from anon-common region Yi+1,enters a commonregion Xi+1, andexitsinto anothernon-common region Yi. Assume that foreachnode v wehavestoredthesetofstrings Sv passingthrough thenode.Tocheckforfalsepositives, wetake

theintersectionofsets Sv overallnodesv onthepath.Iftheintersectionisnon-empty,itcontainsthestringscompatible

withthepath.If

(

u

,

v

)

isthe onlyoutgoingedgefromnode u,we have Su

Sv.Similarly, if

(

u

,

v

)

istheonlyincoming

edgetonode v,wehaveSv

Su.Henceitisenoughtotaketheintersectiona)atthefinalnode;andb)atnodesu,where

thepathcrossesanedge

(

u

,

v

)

toanode v withmultipleincomingedges.ItisalsoenoughtostorethesetSu explicitlyif

a)nodeu hasmultipleornooutgoingedges;orb)Su

=

Sv fortheonlyoutgoingedge

(

u

,

v

)

.

Let

α

beastring,letV α bethesetofnodesreachablebyfollowingpathswithlabel

α

,andlet Sα bethesetofstrings compatiblewiththepaths.BecausethegraphisaDFA,thesetofreachablenodesismonotonicallydecreasing:

|

V αa

| ≤ |

V α

|

and

|

Vaα

| ≤ |

V α

|

foranycharactera

A.If Zi isamoved substring,





VZR ia



 ≤

1 foranycharacter a

A,asthepaths can onlyendattherightmostnodeofthecommonregion Xi.Ifset VZR

ia isnon-empty,set SZiRa containsallstrings,asa Zi is

asubstringofallstrings.

When we search in FMA, we move from V α to V αa. We update the set of compatible strings S lazily. Initially the

set contains all strings. If

|

V α

| >

1, there cannot be false positives. Either none of the paths enters a common region from anon-common region,orall such paths start within Zi andenter Xi through every incomingedge. Ineithercase,

=



vVαSv. If

|

V αa

| =

1 and the node v

V αa has multiple incoming edges, there is a risk of false positives. We thereforeupdateS

S



vV

αSv.WhenthesearchfinishesatV α,thesetofcompatiblestringsisS



vVαSv. 4.5. GCSA

While notall NFAs haveanordering ofnodesthatmakes themWheeler graphs,wecan useWheelergraphs toindex pathlabelsoflengthuptok inarbitrarygraphs[49].ThisistheideabehindGeneralizedCompressedSuffixArrays(GCSAs). WeemphasizethatWheelergraphscanbelargerthanthegraphstheyareusedtoindex.

(9)

Fig. 9. Top:ADFA.Second:A3rd-orderde BruijngraphforpathlabelsintheDFA.NodelabelsindicatenodeorderingintheWheelergraph.Wealsoshow thek-tuplescorrespondingtoeachnode(inreversetomatchthesortingorder)andthemapping f tothenodesoftheDFA.Third:Aprunedde Bruijn graphfortheDFA.Nodelabelsarecompatiblewiththesecondgraph.Graycolorindicatesnodeswherethemappingmustbestoredexplicitly.Bottom: A pathgraphbasedona4th-orderde Bruijngraph.

Definition7.LetG beanNFA,letGbeaWheelergraph,let

α

beastring,andletV α andVα be thesetsofnodesreachable with

α

inG andG,respectively.GraphG isakth-orderpathgraph ofG forak

>

0,ifthereisafunction f fromnodesof

GtosetsofnodesofG suchthat V α

=



vVα f

(

v

)

forall

|

α

| ≤

k.

We can usea kth-order de Bruijngraph ofpath labels ingraph G as a kth-order path graphof G. SeeFig. 9 foran example. Function f has the same role as the suffix array.As withthe sets of strings in FMA, we must store set f

(

u

)

explicitly,if a)node u hasmultipleorno outgoingedges;orb) we cannot derive f

(

u

)

from f

(

v

)

fortheonly outgoing edge

(

u

,

v

)

.WeoftennumberthenodesoftheNFAinawaysuchthat f

(

u

) = {

x

1

|

x

f

(

v

)}

,if

(

u

,

v

)

istheonlyoutgoing edgefromu andtheonlyincomingedgetov.

Withlargevaluesofk,de Bruijngraphsoftenhavemanyredundantnodeswhenweusethemaspathgraphs.If

|

α

| ≤

k,

thenodes Vα reachable with

α

inthede Bruijn grapharetheoneswith

α

asasuffix ofthecorresponding k-tuples.By

Lemma 3,thenodesformaconsecutiverange.If f

(

v

) =

f

(

v

)

forallv

,

v

Vα , wecanmergethenodeswithoutaffecting reachability.GCSA2[49]usessuchprunedde Bruijngraphs tosavespacewhenindexingpathlabelsinanNFA.SeeFig. 9for anexample.

The originalGCSA [50]considers a differentscenario.Instead ofindexingpaths oflength up tok inan arbitraryNFA, it indexespaths ofarbitrarylength in an acyclicDFA.Conceptually, GCSAconstruction searchesfor ade Bruijn graph G

thatisequivalent totheinputgraphG asaDFA,andthenusesthatde Bruijngraphasaninfinite-order pathgraphof G. Consecutive ranges of nodesare merged if f

(

v

) =

f

(

v

)

for all nodes v and v in the range,even ifthe ranges do not correspondtosharedsuffixesofthek-tuples.SeeFig. 9(bottom)foranexample.

4.6. PBWT,wavelettreesandwaveletmatrices

Another remarkablevariant of the BWT is Durbin’spositional BWT [12] (PBWT), which he introduced for haplotype matchingbuthasalsobeenusedforreconstructingancestralrecombinationgraphs[48].Givend stringsofequallengththe PBWTsupportsthematchingofsubstringsstartingataspecifiedpositioninthestrings.

(10)

76 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

Fig. 10. A standard representation of a wavelet tree and its representation as a Wheeler graph.

Fig. 11. A standard representation of a wavelet matrix and its representation as a Wheeler graph with edge labels over a larger alphabet.

WecanviewthePBWTasaWheelergraphwiththesamestructureasthatofthemulti-stringBWT.Ifthemulti-string BWThasanedge

(

u

,

v

)

withlabela fromtextpositioni topositioni

+

1,PBWTuses

(

i

+

1

,

a

)

asthelabel.Becausealledges frompositioni gotopositioni

+

1,thepositionalcomponentisalwaysi

+

1 foralloutgoingedgesinrange

[(

i

1

)

d

+

1

,

id

]

. Hence, we caninferthepositionfromtherangeandstore justthecharactera inthesuccinctrepresentation(Lemma 4). A generalization [44] ofPBWTrelaxes therequirements.Instead ofstoring stringsof equallength,we store thelabels of paths inagraph.Ifedge

(

u

,

v

)

intheWheelergraphcorrespondstoedge

(

u

,

v

)

withlabela intheunderlyinggraph,we use

(

v

,

a

)

asthelabeloftheedge.Aswecannolongerinferthepositionalcomponent,wehavetostoreitexplicitly.

Although introduced forcompletely differentapplications, the PBWT bearsa striking resemblanceto a datastructure calleda waveletmatrix,which Claude,Navarro andOrdóñez [11]introduced asa version ofthewavelettree [28] better suited forlarge alphabets.This suggeststhateven wavelettrees andmatricescan beviewed asWheelergraphs. Forthe sake ofbrevity, weassume thereaderis familiarwiththesedatastructureswith“myriadvirtues” [13,24],andnote only that,whilebuildingawavelettreeisanalogoustoamost-significant-bit-firstradixsort,buildingeitheraPBWTorawavelet matrixisanalogoustoaleast-significant-bit-firstradixsort.

ConsiderthewavelettreeshownontheleftinFig. 10.Sinceitisatree,wecanuseourconstructionfromSubsection4.2

to build an FSA (withone initial state)that acceptsall thebinary strings that are root-to-leaf paths inthe wavelettree. This doesnot store all theinformation inthe wavelet tree,however, sowe turnthe graphfrom an FSA intoa directed multi-graph, asshownonthe rightinFig. 10,whichisstill aWheelergraph.Thisgraphhassomeinteresting properties: e.g.,theorderoftheoutgoingedgesfromeachnodeencodethebitvectorstoredatthatnodeinthewavelettree;asinthe wavelettree,foreachinternalnode v excepttheroot, v’sin-degreeandout-degreeareequalanditsin-degreeisequalto thesumofthein-degreesofitsleafdescendants.Itisbeyondthescopeofthispapertoinvestigatethisrepresentationof wavelettreesasadatastructure,butitseems likeaninterestingdirectionforfutureresearch thatgoesbeyondthereach ofTheorem 6.

Now considerthewaveletmatrixshowontheleft inFig. 11.WecanturnitintoaWheelergraphif,asinthewavelet tree,wetransformitintoadirectedmulti-graphandthen,asinthePBWT,werenamethelabelc oneachedgewiththepair

(

i

,

c

)

,wherei isthelevelofthestartingnode,asshownontherightinFig. 11.Again,theWheelergraphclearlyencodes thewaveletmatrix,butweleaveasfutureworkinvestigatingthepossiblebenefitsofthisalternativerepresentation.

(11)

5. Conclusionsandfuturework

We have defined Wheeler graphs to try to capture the ideas behind many of the variants of the BWT,and given a frameworkfordevelopingnewvariantsby solvingproblemswithfinite-state automatawhosestatediagramsareWheeler graphs.Wedidnotpursuethetopichere,butWheelergraphsseemabletocapturepropertiesalsoofstringtransformations notrelatedtopatternmatching:byreversingtheinequalityu

<

uin(1)wegetaslightlydifferentnotionofWheelergraph thatcanbeusedtosuccinctlyrepresentthevariantoftheBWTdefinedintermsofthealternatinglexicographicorder[26]

(seealso[27]foranancestorofthemulti-string BWT).

There has been so much work involving the BWT, however, that it would be surprising if one idea could subsume it all, and indeed some BWT-related results, such as the positional BWT, seemingly cannot be reasonably modeled by finite-state automata, evenif they canstill be viewed asWheeler graphs; other BWT-related results,such asGanguly et al.’sparameterizedBWT[25] andtheindexfororder-preservingmatchingin[23],seemnoteventobebasedonWheeler graphs atall.We havenot yeteven consideredbidirectional FM-indexes[3,31,32] andbidirectionalBWT-basedde Bruijn graphs[4].

Apartfromapplying andextending ourframework, we hope to develop algorithms to recognizeWheeler graphs effi-ciently,andtocharacterizeclassesoffinite-statediagramsthatareWheelergraphsorcanbeexpandedslightlytobecome Wheelergraphswithoutchangingthelanguageacceptedbytheautomata(althoughcharacterizingallsuchstate diagrams mightbedifficult).Inthisregardweobservethatnotallregularlanguageshavefinite-stateautomatawhosestatediagrams areWheelergraphs:e.g.,inanyfinite-stateautomatonforthelanguage

(

axb

)

|(

cxd

)

,theremustbedisjointpathsforaxkb

andcxkd andbothendsofalledgeswithlabelx mustbeinthesameorder,sotheremustbeseparate nodesforxib and xid forallvaluesof i.

Finally,wehope ournewperspective onBWTvariantsmakes themmoreaccessibletocomputer scientistsfromareas outsidestringalgorithmsanddatastructures.

References

[1]D.Adjeroh,T.Bell,A.Mukherjee,TheBurrows–WheelerTransform:DataCompression,SuffixArrays,andPatternMatching,SpringerScience&Business Media,2008.

[2]M.J.Bauer,A.J.Cox,G.Rosone,LightweightalgorithmsforconstructingandinvertingtheBWTofstringcollections,Theoret.Comput.Sci.483(2013) 134–148.

[3]D.Belazzougui,F.Cunial,J.Kärkkäinen,V.Mäkinen,VersatilesuccinctrepresentationsofthebidirectionalBurrows–Wheelertransform,in:European SymposiumonAlgorithms,Springer,2013,pp. 133–144.

[4]D.Belazzougui,T.Gagie,V.Mäkinen,M.Previtali,S.J.Puglisi,Bidirectionalvariable-orderdeBruijngraphs,in:LatinAmericanSymposiumonTheoretical Informatics,Springer,2016,pp. 164–178.

[5]T.Beller,E.Ohlebusch,EfficientconstructionofacompresseddeBruijngraph forpan-genomeanalysis,in:AnnualSymposiumonCombinatorial PatternMatching,Springer,2015,pp. 40–51.

[6]A.Bowe,T.Onodera,K.Sadakane,T.Shibuya,SuccinctdeBruijngraphs,in:Int.WorkshoponAlgorithmsinBioinformatics,Springer,2012,pp. 225–235. [7]N.R. Brisaboa, A.Cerdeira-Pena, A.Fariña, G. Navarro, Acompact RDFstoreusing suffix arrays,in: SPIRE, in:LNCS, vol. 9309, Springer,2015,

pp. 103–115.

[8]N.R.Brisaboa,A.Fariña,D.Galaktionov,M.A.Rodríguez,Compacttriprepresentationovernetworks,in:SPIRE,in:LNCS,vol. 9954,2016,pp. 240–253. [9]M.Burrows,D.Wheeler,ABlock-SortingLosslessDataCompressionAlgorithm,Tech.Rep.124,DigitalEquipmentCorporation,1994.

[10]R.Chikhi,A.Limasset,S.Jackman,J.T.Simpson,P.Medvedev,OntherepresentationofdeBruijngraphs,J.Comput.Biol.22 (5)(2015)336–352. [11]F.Claude,G.Navarro,A.Ordónez,Thewaveletmatrix:anefficientwavelettreeforlargealphabets,Inform.Sci.47(2015)15–32.

[12]R.Durbin,EfficienthaplotypematchingandstorageusingthepositionalBurrows–Wheelertransform(PBWT),Bioinformatics30 (9)(2014)1266–1272. [13]P.Ferragina,R.Giancarlo,G.Manzini,Themyriadvirtuesofwavelettrees,Inform.andComput.207(2009)849–866.

[14]P.Ferragina,R.Giancarlo,G.Manzini,M.Sciortino,Boostingtextualcompressioninoptimallineartime,J.ACM52(2005)688–713. [15]P.Ferragina,R.González,G.Navarro,R.Venturini,Compressedtextindexes:fromtheorytopractice,ACMJ.Exp.Algorithmics13(2009).

[16]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Structuringlabeledtreesforoptimalsuccinctness,andbeyond,in:Proc.46thIEEESymposiumon FoundationsofComputerScience,FOCS’05,2005,pp. 184–193.

[17]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,CompressingandsearchingXMLdataviatwozips,in:Proc.15thInternationalWorldWideWeb Conference,WWW’06,2006,pp. 751–760.

[18]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Compressingandindexinglabeledtrees,withapplications,J.ACM57(2009).

[19]P.Ferragina,G.Manzini,Opportunisticdatastructureswithapplications,in:Proc.41stIEEESymp.onFound.ofComputerScience,2000,pp. 390–398. [20]P.Ferragina,G.Manzini,Indexingcompressedtext,J.ACM52 (4)(2005)552–581.

[21]P.Ferragina,R.Venturini,Compressedpermutermindex,in:SIGIR,ACM,2007,pp. 535–542.

[22]P.Ferragina,R.Venturini,Thecompressedpermutermindex,ACMTrans.Algorithms7 (1)(2010)10:1–10:21.

[23]T.Gagie,G.Manzini,R.Venturini,Anencodingfororder-preservingmatching,in:Proc.25thAnnualEuropeanSymposiumonAlgorithms,ESA2017, DagstuhlPublishing,Germany,2017.

[24]T.Gagie,G.Navarro,S.J.Puglisi,Newalgorithmsonwavelettreesandapplicationstoinformationretrieval,Theoret.Comput.Sci.426(2012)25–41. [25]A.Ganguly,R.Shah,S.V.Thankachan,pBWT:achievingsuccinctdatastructuresforparameterizedpatternmatchingandrelatedproblems,in:Proc.

28thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’17,SIAM,2017,pp. 397–407.

[26] I.M.Gessel,A.Restivo,C.Reutenauer,Abijectionbetweenwordsandmultisetsofnecklaces,EuropeanJ.Combin.33 (7)(2012)1537–1546,

http://

dx.doi.org/10.1016/j.ejc.2012.03.016.

[27]I.M.Gessel,C.Reutenauer,Countingpermutationswithgivencyclestructureanddescentset,J.Combin.TheorySer.A64 (2)(1993)189–215. [28]R.Grossi,A.Gupta,J.S.Vitter,High-orderentropy-compressedtextindexes,in:Proc.14thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’03,

SIAM,2003,pp. 841–850.

[29]R.Grossi, J.S.Vitter,Compressedsuffixarraysandsuffixtreeswith applicationstotextindexingand stringmatching,in:Proc.ofthe32ndACM SymposiumonTheoryofComputing,2000,pp. 397–406.

(12)

78 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78

[31]G.Kucherov,K.Salikhov,D.Tsur,Approximatestringmatchingusingabidirectionalindex,Theoret.Comput.Sci.638(2016)145–158.

[32]T.W.Lam,R.Li,A.Tam,S.Wong,E.Wu,S.-M.Yiu,Highthroughputshortreadalignmentviabi-directionalBWT,in:IEEEInternationalConferenceon BioinformaticsandBiomedicine,BIBM’09,IEEE,2009,pp. 31–36.

[33]D.Li,C.-M.Liu,R.Luo,K.Sadakane,T.-W.Lam,Megahit:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinct de Bruijngraph,Bioinformatics31 (10)(2015)1674–1676.

[34]V.Mäkinen,Compactsuffixarray, in:Proc.ofthe11thSymposiumonCombinatorialPatternMatching,in:LNCS,vol. 1848,Springer-Verlag,2000, pp. 305–319.

[35]V.Mäkinen,D.Belazzougui,F.Cunial,A.I.Tomescu,Genome-ScaleAlgorithmDesign,CambridgeUniversityPress,2015.

[36]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,Anextensionoftheburrowswheelertransformandapplicationstosequencecomparisonanddata compression,in:CPM,in:LNCS,vol. 3537,Springer,2005,pp. 178–189.

[37]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,AnextensionoftheBurrows–Wheelertransform,Theoret.Comput.Sci.387 (3)(2007)298–312. [38]S.Mantaci,A.Restivo,M.Sciortino,AnextensionoftheBurrows–Wheelertransformtok words,in:DCC,IEEEComputerSociety,2005,p. 469. [39]G.Manzini,AnanalysisoftheBurrows–Wheelertransform,J.ACM48 (3)(2001)407–430.

[40]J.C.Na,H.Kim,S.Min,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignmentwithgaps,arXiv:1606.03897,2016.

[41]J.C.Na,H.Kim,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignment:acompressedindexforsimilarstrings,Theoret.Comput. Sci.638(2016)159–170.

[42]G.Navarro,CompactDataStructures:APracticalApproach,CambridgeUniversityPress,2016. [43]G.Navarro,V.Mäkinen,Compressedfull-textindexes,ACMComput.Surv.39 (1)(2007)2.

[44]A.M.Novak,E.Garrison,B.Paten,AgraphextensionofthepositionalBurrows–Wheelertransformanditsapplications,in:InternationalWorkshopon AlgorithmsinBioinformatics,Springer,2016,pp. 246–256.

[45]J.G.Saxe,ThePoemsofJohnGodfreySaxe,Houghton,MifflinandCo.,1881. [46] J.Seward,The bzip2homepage,

http://www.bzip.org

,1996.

[47] J.Seward,The valgrind homepage,

http://valgrind.org

,2000.

[48]V.Shchur,R.Durbin,TreeconsistentPBWTandtheirapplicationtoreconstructingancestralrecombinationgraphsanddemographicinference,in: PosterPresentedatthe19thAnnualInternationalConferenceonResearchinComputationalMolecularBiology(RECOMB),2015.

[49]J.Sirén,Indexingvariationgraphs,in:Proc.19thMeetingonAlgorithmEngineeringandExperiments,ALENEX’17,SIAM,2017,pp. 13–27.

[50]J.Sirén,N.Välimäki,V.Mäkinen,Indexinggraphsforpathquerieswithapplicationsingenomeresearch,IEEE/ACMTrans.Comput.Biol.Bioinform. 11 (2)(2014)375–388.

Figura

Fig. 2. Deterministic (left) and non-deterministic (right) finite automata accepting all substrings of ABRACADABRA
Fig. 3. From the NFA (left) to the Wheeler graph (center) to the FM-index (right). All states are initial and accepting and numbered on the right to show that the diagram is a Wheeler graph.
Fig. 4. Left: A finite automaton accepting all circular substrings of AT$, HOT$, HAT$; all states are initial and accepting
Fig. 6. A finite automaton accepting all prefixes of length at least 3 of strings in which the only triples are AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA and in which BAAA and CABA do not occur
+4

Riferimenti

Documenti correlati

It remains to show that the partition exists when n is multiple of 8 and n ≥ 16.. For any non-negative integer x, we have a partition of {1

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Roma “Tor Vergata”, via della Ricerca Scientifica, 00133 Roma,

Let S(n, k) be the Stirling number of the second kind, i.e., the number of ways to partition a set of n objects into k

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Roma “Tor Vergata”, via della Ricerca Scientifica, 00133 Roma,

[r]

[r]

Thus c stays inside D and since it reaches the maximal distance from the boundary, it belongs to

Doney, One-sided Local Large Deviation and Renewal Theorems in the Case of Infinite Mean, Probab. Erickson, Strong renewal theorems with infinite