Contents lists available atScienceDirect
Theoretical
Computer
Science
www.elsevier.com/locate/tcs
Wheeler
graphs:
A
framework
for
BWT-based
data
structures
✩
Travis Gagie
a,
Giovanni Manzini
b,
c,
∗
,
Jouni Sirén
daDiegoPortalesUniversityandCEBIB,Santiago,Chile bUniversityofEasternPiedmont,Alessandria,Italy cIIT-CNR,Pisa,Italy
dWellcomeTrustSangerInstitute,Cambridge,UK
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received3March2017
Receivedinrevisedform3June2017 Accepted6June2017
Availableonline27June2017 Keywords:
Compresseddatastructures Burrows–Wheelertransform Patternmatching
ThefamousBurrows–WheelerTransform(BWT)wasoriginallydefined forasinglestring butvariationshavebeendevelopedforsetsofstrings,labeled trees,deBruijngraphs,etc. Inthispaperweproposeaframeworkthatincludesmanyofthesevariationsandthatwe hopewillsimplifythesearchformore.
WefirstdefineWheelergraphs andshowtheyhaveapropertywecallpathcoherence.We showthatifthestatediagramofafinite-stateautomatonisaWheelergraphthen,byits pathcoherence,wecanorderthenodessuchthat,foranystring,thenodesreachablefrom theinitialstateorstatesbyprocessingthatstringareconsecutive.Thismeansthateven iftheautomatonisnon-deterministic,wecanstill storeitcompactlyandprocessstrings withitquickly.
We thenrederiveseveralvariationsoftheBWTby designingstraightforwardfinite-state automata forthe relevant problemsand showingthat theirstatediagramsare Wheeler graphs.
©2017TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
TheBurrows–WheelerTransformation(BWT)hasaverypeculiarhistory.Firstconceivedin1983,itwas publishedonly elevenyearslaterinatechnicalreport[9],presumablybecauseitwassoinnovativethatthefirstreviewerswerenotableto graspitsfullsignificance.Afewyearslater,thecompressionalgorithm bzip2 basedontheBWTbecamepopular,challenging gzip’sdominance,thankstothefinelyengineeredimplementationofJulian Seward[46](the verysamecomputer scientist whogaveusalsotheinvaluabletool Valgrind[47]).
Afterits introductionasa compression tool,interest in theBWTwas rekindledwhen manyresearchers realizedthat, amongthe differenttechniquesdiscovered attheturn ofthecentury fordesigning compressedindexes [19,29,34],those basedon theBWTare probablythesimplestandmostspaceefficient [15,43].Afterthisrealization, inthelast tenyears wehavewitnessedan unusualphenomenonincomputerscience:variantsoftheBWThavebeenproposedandappliedto moreandmorecomplexobjects:fromtrees,tographs,to alignments.Thesevariantsareclearly relatedtotheBWTeven
✩ ThisworkwassupportedbyAcademyofFinlandgrant268324,FONDECYT grant1171058,KITE-DiSITproject,INdAM-GNCSProject“Efficientalgorithms andtechniquesfortheorganization,managementandanalysisofbiologicalBigData”,andtheWellcomeTrustgrant[098051].
*
Correspondingauthor.E-mailaddresses:travis.gagie@udp.cl(T. Gagie),
giovanni.manzini@uniupo.it
(G. Manzini),jouni.siren@iki.fi
(J. Sirén). http://dx.doi.org/10.1016/j.tcs.2017.06.0160304-3975/©2017TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
68 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 1. Aneight-nodeWheelergraph.Node1hasin-degree0;edgeslabeled a entersinnodes2,3,4;edgeslabeled b innodes5,6;edgeslabeled c in nodes7,8.
if some ofthem nolonger havethe two main features ofthe original BWT,namelyofbeing invertibleandof “helping” compression.
AtthispointitisnaturaltoaskwhetherwehaveapproachedtheBWTastheblind menapproachedtheelephant(see, e.g., [45]), withonetouching aleg andthinking theelephant islikea tree,anothertouching thetrunk andthinkingit is likea snake,andyetanothertouching thetailandthinking itislikearope.Wedonotdisparageprevioussurveysofthe BWTandrelateddatastructures,such as[1,35,42],sincewe toohavespent yearstryingtomakesense ofoursometimes disparate impressionsofthe it.Withoutpretending togive a completeanswer,in thispaperwe propose aunifyingview formanydifferentBWTvariants.Somewhatsurprisinglywegetourunifyingview consideringtheNondeterministicFinite Automatarelatedtodifferentpatternmatchingproblems.Weshowthatthestategraphsassociatedtotheseautomatahave commonpropertiesthatwesummarizewiththeconceptofWheelergraphs.1
Using thenotion of a Wheeler graph,we show that it is possible to process strings efficiently, e.g., in linear time if the alphabet is constant, even ifthe automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactlyrepresentedandtraversedusinguptothreearrayswithadditionaldatastructuressupportingefficientrankand selectoperations. Itturnsout thatthesearrayscoincide with,oraresubstantiallyequivalent to,theoutputofmanyBWT variantsdescribedintheliterature.
We believeourunifyingview canhelpresearchersdevelopnewBWTvariantsandnewindexingdatastructures. How-ever, westress that not everyBWT-related datastructure fitsour framework:forexample Gangulyetal.’s parameterized BWT[25] andtheindexfororder-preservingmatchingin[23].Therefore,wehope thatourcontributionwillspurfurther researchresultinginawidervisionofthefascinatingfieldoriginatedbytheseminalworkofBurrowsandWheeler. 2. Definitionsandbasicresults
Consider adirected edge-labeledgraphG suchthateach edgeislabeledby acharacterfromantotally-ordered alpha-bet A.Weuse
≺
todenotetheorderingamongA’selements.Labelsontheedgesleavingagivennodearenotnecessarily distinct, andtherecanbe multipleedgeslinkingthesamepairofnodes(forsimplicitywe stillusethetermgraph ratherthanthemoreformallycorrectmulti-graph).
Definition1.G is a Wheelergraph if there is an ordering of the nodes such that nodes within-degree 0 precede those with positivein-degree and, foranypairofedges e
= (
u,
v)
ande= (
u,
v)
labeled a and a respectively,the following monotonicitypropertieshold:a
≺
a⇒
v<
v,
(a
=
a)
∧ (
u<
u)
⇒
v≤
v.
(1)AnexampleofaWheelergraphisshowninFig. 1.Asanimmediateconsequenceof(1),alledgesenteringagivennode musthavethesamelabel.WenowshowthatWheelergraphsalsopossessthefollowingproperty:
Definition2.G ispathcoherent ifthereisatotalorderofthenodessuchthatforanyconsecutiverange
[
i,
j]
ofnodesand stringα
,thenodesreachable fromthose in[
i,
j]
in|
α
|
steps byfollowingedges whoselabels forα
whenconcatenated, themselvesformaconsecutiverange.Lemma3.IfG isaWheelergraphbyanordering
π
onitsnodesthenitispathcoherentbyπ
.Proof. SupposeG isaWheelergraphby
π
.Consideraconsecutiverange[
i,
j]
ofnodesandlet[
i,
j]
bethesmallestrange that contains all thenodes reachable fromthose in[
i,
j]
in one step by followingedges labeled with some charactera.Byourchoice of
[
i,
j]
,both iand j arereachablefromnodesin[
i,
j]
inonestepby followingedgeslabeled a.Byour1 OnmanyoccasionsMikeBurrowsstatedthat,asreportedalsoin[9],theoriginalideaoftheBWTisduetoDavidWheeler.Wethereforedecidedto namethisgraphclassafterthispioneerofcomputerscience.
definitionofaWheelergraph,nodeswithin-degree 0precedethose withpositive in-degree,suchasi,so everynode in
[
i,
j]
hasatleastoneincomingedge.Assume some node v strictly betweeni and j hasan incomingedge labeled a
=
a.Since i<
v we havea≺
a,by ourdefinitionofaWheelergraphandmodustollens;similarly,since v<
jwe havea≺
a,thusobtaining acontradiction. Itfollowsthattheedgesarrivingatnodesin[
i,
j]
arealllabeled a.Furthermore,sincethelabelsareequal,bythesecond implicationin(1)andmodustollens wegetthatanyedgewithdestinationstrictlybetweeniand jmustoriginatein[
i,
j]
. Itfollowsthatthenodesreachableinonestepfromthosein[
i,
j]
byfollowingedges labeled a aretheonesin[
i,
j]
, whichisaconsecutiverange.Foranystringα
,therefore,thenodesreachablein|
α
|
stepsfromthosein[
i,
j]
byfollowing edgeswhoselabelsformα
,themselvesformaconsecutiverange,byinductiononthelengthofα
.2
Inthelatersectionsofthispaper,weexploretheimplicationsofpathcoherence.Intheremainderofthissectionwefirst showit ispossibletoobtainafastandcompactrepresentationofa Wheelergraph,then sketchwhyWheelergraphscan achievecompression.Wepointoutthat,evenwithoutexplicitlydefiningWheelergraphs,manyresearchershaveimplicitly consideredthemwhilestudyinghowtoimplementefficientlyBWTvariantsandhowtoboundtheirspaceusage,andgiven resultsanalogoustotheoneswegivenow.
Aplain,edge-by-edgerepresentationofalabeled graphwithn nodesande edgesuses
(
e(
log n+
log|
A|))
bits.Given aWheelergraphG,letx1<
x2<
· · · <
xn denotetheorderedsetofnodes.Fori=
1,
. . . ,
n leti andki denoterespectively
theout-degreeandin-degreeofnodexi.Definethebinaryarraysoflengthe
+
nO
=
011 021· · ·
0n1,
I=
0k11 0k21· · ·
0kn1.
(2)Notethat O (resp. I)consistsoftheconcatenatedunaryrepresentationsoftheout-degrees(resp. in-degrees).LetLi denote
themultisetoflabelsontheedges exitingfromxi arrangedinan arbitraryorder,andlet L
[
1..
e]
denotetheconcatenation L=
L1L2· · ·
Ln.Byconstruction,|
Li| =
i andthereisa naturalone-to-one correspondencebetweenthe 0’sin O andthecharactersinL.Forexample,forthegraphofFig. 1itis
O
=
0001 001 01 1 001 001 001 01L
=
aab ac b ac bc bc awhiletheindegreearrayisI
=
101001001001001001001.Finally,letC[
1..
|
A|]
denotethearraysuchthatC[
c]
isthenumber ofedgeswithlabelsmallerthanc.Forsimplicity,assumeevery distinctcharacterlabelssomeedge;otherwise,westorea bitvector of|
A|
bitsmarking thecharactersthatdo labeledges andworkwiththat subset.Withthisassumption, C[
c] <
C[
c+
1]
andxjhasallincomingedgeslabeledc ifandonlyifC[
c]
<
i≤jki≤
C[
c+
1]
.Given an array Z we use the standard notation rankc
(
Z,
i)
to denote the numberof occurrences of c in Z[
1,
i]
, andselectc
(
Z,
j)
todenotethepositionofthe j-thc in Z .Forsimplicityweassumerankc(
Z,
0) =
0.Thefollowingtwopropertiesarestraightforwardtoprove;indeed,similarpropertieshavebeenestablishedinmanypapersdealingwithBWT-relateddata structures.
1. Theout-degree
iofnode xiisselect1
(
O,
i)
−
select1(
O,
i−
1)
.Thelabelsintheedgesleavingxi areL[
wi−
i+
1,
wi]
wherewi
=
select1(
O,
i)
−
i.2. If node xi hasone ormore outgoingedges labeled c, then thelargest index j such that
(
xi,
xj)
has labelc can becomputedasfollows.Letwi bedefinedasaboveandlethi
=
rankc(
L,
wi)
.Ifweorderedgesbylabelwithtiesbrokenby origin,
(
xi,
xj)
is the hi-thedge labeled c. Since there are C[
c]
edges with a label smaller than c, we have j=
1
+
rank1(
I,
select0(
I,
hi+
C[
c]))
.Tofindthesmallerindices j suchthat(
xi,
xj)
haslabelc,wedecrementhiandrepeatthisprocedure,untilhi
=
rankc(
L,
wi−
i)
+
1.Withasymmetric reasoning,given xj andc wecanestablishwhetherthereexists anedge labeledc enteringin xj,and,
ifthisisthecase,all indicesi suchthat
(
xi,
xj)
haslabel c.Using standarddatastructurestorepresentcompactly arrayssupportingrankandselectqueries[43],wecanestablishthefollowingresult.
Lemma4.Itispossibletorepresentann-node,e-edgeWheelergraphwithlabelsoverthealphabet A in2
(
e+
n) +
elog|
A| +
|
A|
log e+
o(
n+
elog|
A|)
bits.TherepresentationsupportstheforwardandbackwardtraversingoftheedgesinO (
log|
A|)
time.2
We canoftenreduce even furtherthe spaceusedto representWheelergraphs. Forexample,suppose G is aWheeler graph;VkisthesetofnodesofG reachableinexactlyk stepsfromsomeothernodes;S isasetofstringsoflengthk such
thatevery nodeinVk isreachableink stepsbyfollowingedgeswhoselabelswhenconcatenatedformastringinS;and f is afunctionassigningto eachnode v
∈
Vk astrings∈
S such that v isreachableink stepsbyfollowingedgeswhoselabelswhenconcatenatedforms.ConsiderthelistLkoflabelsofedgesoriginatinginVkinorderbyoriginwithtiesbroken
bylabel.
ByLemma 3,fors
∈
S allthenodesv∈
Vksuchthat f(
v)
=
s formaconsecutiveinterval.Therefore,wecanpartition Lkinto
|
S|
consecutiveintervalssuchthateachintervalLsistheconcatenationofthelabelsofedgesleavingnodesv suchthat f(
v) =
s; Lsisemptyfors/
∈
S.Itfollowsthatifk≤ (
1−
)
lgσ|
Vk|
,where>
0 isaconstantandσ
= |
A|
isthesizeofthe70 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 2. Deterministic (left) and non-deterministic (right) finite automata accepting all substrings of ABRACADABRA. All states are accepting.
alphabetoflabels,thenwecanstoreLk in
s∈Ak|
Ls|
H0(
Ls)
+
o(
|
Vk|)
bits,whereH0(
Ls)
isthe0th-orderempiricalentropyof Ls.Informally,thismeansthat ifknowinghowwe cangettoa nodeink steps tellsusalotaboutthelabelsthatare
likelytobeontheedges leavingthatnode,thenwe cancompress Lkwell. Insome waysthisgeneralizesthewell-known
analyses[14,39]ofthecompressionachievableusingtheBurrows–WheelerTransformonasinglestring. 3. NFAs,WheelergraphsandFM-indexes
Suppose we haveneverheardthewords “SuffixArray” or“Burrows–Wheeler Transform”butwe wantto builda data structure supporting efficientqueries forthe substringsof,say, s
=
ABRACADABRA. Asimplesolution wouldbe tobuild the DeterministicFinite Automaton (DFA)acceptingall substringsof s,see Fig. 2(left). Althoughthere aretechniques to compactly representDFAs, such avenue willlikely lead toa data structure muchlarger than theO (
n log|
A|)
bitsof the originaltext.Amorespaceeconomicalalternative istobuildtheNondeterministicFinite Automaton(NFA)forthesamesetof sub-strings: Ithasalinearstructure,seeFig. 2(right),butbecauseofnondeterminism,processingofpatternsappearstobe a difficult task.It turnsout thatthisdifficulty isonlyamatter ofperspective:traversing theDFAbecomes muchsimplerif werecognizethatitisaWheelergraphindisguise.
The NFA of Fig. 2 is already a labeled directed graph. To make it a Wheeler graph we only need to eliminate the
-transitions(which violatetheproperty thatall theincomingedges toanode havethesamelabel), makeallthe states initial(sothelanguageisnotchanged),anddefineanorderingofthenodessuchthatthemonotonicityproperties(1)hold. Tothisend,weassociatetoeachnodeoftheNFAtheprefixofs thattakesustherewithoutan
-transitionandorderthe NFAnodesaccordingtotheright-to-leftlexicographic rankofsuchstrings,seeFig. 3(leftandcenter).Withthisdefinition the NFAwithout
-transitionsis aWheelergraph.Oncethenodeordering isestablishedwe candiscardtheprefixesand identifyeachnodewithitsrankintheordering,seeFig. 3(right).
IntheWheelergraphderivedfromtheNFAofFig. 3everynodehasout-degree1exceptfromnode 5,correspondingto thelongestprefix,thathasout-degreezero.Hence,intherepresentationoftheWheelergraphdescribedinSect.2,wecan getridofthebitarray O andinsteadinsertinposition5ofL asymbolnotoccurringelsewhereins,say$.Sinceallnodes havein-degree1exceptnode0(ignoringthesourcelessedgesindicatingthatallnodesareinitial)thereisnoneedtostore thebitarrayI either.Summingup,wecanrepresentandnavigatetheWheelergraphusingthestringL
=
ABDBC$RRAAAA, enriched withdatastructuresforrank/select operations,andthe arrayC[
1..|
A|]
.Sincethestring L coincideswiththelast columnoftheBWTmatrixofsR,thestring s reversed,andC is awell-known representationofthefirst columnofsuchmatrix,wehaveestablishedthefollowingresult.
Lemma5.Givenastrings,letNFA
(
s)
denotethecorrespondingNFAdescribedabove.TheFM-indexofsRisacompactrepresentation ofNFA(
s)
.TheLast-to-FirstandFirst-to-LastmapsoftheFM-indexcoincideswiththenavigationoperationsinNFA(
s)
.2
Fig. 3. FromtheNFA(left)totheWheelergraph(center)totheFM-index(right).Allstatesareinitialandacceptingandnumberedontherighttoshow thatthediagramisaWheelergraph.
TheattentivereadermayaskwhytheWheelergraphrepresentstheFM-indexofsR whilehistoricallytheFM-indexwas definedfors.ThereasonisthattheFM-indexwas derivedfromtheBWTofs.The consequenceisthat thesearchofthe patternmustbe done right-to-left:the infamous backward-search procedure[20].Here, we havedefinedtheNFA sothat thesearchisdoneleft-to-rightandthroughtheWheelergraphwehavethereforeobtainedtheFM-indexofsR.
Lemma 5 provides anew perspective onFM-indexes that maybe interesting in its ownright and, moreimportantly, suggestsawaywecan systematicallygeneralize theideas behind FM-indexestoindexingother kindsofdatathansingle strings.Usingtheconceptofpathcoherence,wenowproveamorepowerfulversionofthislemma:
Theorem6.Considerafinite-stateautomatonoveranalphabetA without
-transitionsandwitheitheroneinitialstateorallstates initial.IfitsstatediagramisaWheelergraphwithn nodesande edgesthenwecanstoreitin2
(
e+
n)
+
elog|
A|
+ |
A|
log e+
o(
n+
elog|
A|)
bitssuchthat,foranystringα
,inO (|
α
|
log|
A|)
timewecancomputethesetofstatesreachablefromtheinitialstatesonα
. Proof. Supposethestate diagramisa Wheelergraphbyan orderingπ
onits nodesso,byLemma 3,itispath coherent byπ
.Sinceeitheronestateisinitialorallare,theinitialstatesformaconsecutiverangeinπ
.ByDefinition 2,thenodes reachable fromtheinitial statesoneach prefix ofα
formaconsecutive interval.We canuse Lemma 4to mapfromthe intervalforeachprefixtotheintervalforthenextprefixinO (
log|
A|)
time.2
Theorem 6meansthat ifwehaveaproblemwecansolvewithafinite-stateautomaton,theneveniftheautomatonis non-deterministic wemaystill be abletoimplementitwithout theusual blowupsinspaceorquerytime. Intherestof thispaperweshowthatmanydatastructuresbasedonvariantsoftheBWTfitintothisframework.
WenoteasanasidethatFM-indexesusuallyincludeasampledsuffixarraywhichpermitsustolocateeachoccurrence ofapatternintheindexedstring.Extendingthisideatoautomataseems straightforward,bysamplingthenodes,butwe donotexplorethatinthispaper.
4. OtherWheelergraphsindisguise 4.1. Multi-stringBWTandpermutermindex
The first naturalgeneralization ofthe BWTis to extendit to acollection ofstrings. Thiswas done forthe first time in[36–38]wheretheauthorsdescribedareversiblemulti-stringtransformationinspiredbytheBWT,andshowedits effec-tivenessfordatacompressionandformeasuringstringsimilarity.
Inthecontextofpatternmatching,givenacollectionofstringss1
,
. . . ,
sd,inadditiontosubstringqueriesitisconvenientto offer alsothe possibility ofprefix/suffix queries. In a prefix/suffix query,given two substring
α,
β
we want tofind all strings si such that si isprefixed byα
andsuffixedbyβ
.As first observed in [21,22] withthe Compressed PermutermIndex,such queries can be solved by addinga specialsymbol $ to each stringandsearching forthe pattern
β
$α
inside a circular versionof each si$. Fig. 4(left) showstheNFA supporting circularqueries forthestrings AT$, HOT$, HAT$. Toeach nodewe naturally associatea circularshiftofoneofthe strings; ifwe sortthenodesaccordingto theright-to-left lexicographicorderofsuchshiftstheresultinggraphisaWheelergraph,seeFig. 4(right).
Since each node in the Wheeler graph has in-degree and out-degree 1, we can represent it with the approach of
Lemma 4, usingonly the arrays L and C . Note that the array L, L
=
AHHTTAOT$$$ in the example of Fig. 4, coincides withthemulti-stringBWTdefinedintermsofthe cyclicshiftsin[38].Instead,theCompressedPermutermIndexisbuilt computingthesingle-string BWToftheconcatenation s1$s2· · ·
$sd thatdoesnot supportnaturally thesearch forcircular72 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 4. Left:Afiniteautomatonacceptingallcircularsubstringsof AT$, HOT$, HAT$;allstatesareinitialandaccepting.Right:Sortingthestatesaccording totheassociatedcircularshiftsgivesusaWheelergraph.
Fig. 5. Afiniteautomatonacceptingallsubstringsof AAC, ABA, ACAA, BA and BC.Allstatesareinitialandacceptingandnumberedtoshowthatthe diagramisaWheelergraph.
applying theso-called jump2end function.Morerecentalgorithms usingcircularpatternsearch allmake useofthecyclic multi-stringBWT[2,7,8,30].
4.2. XBWTandtrierepresentation
Ifwe are onlyinterested inthe substringsofacollection s1
,
. . . ,
sd, andnotincircular matches,asmaller automatonthantheoneconsideredintheprevioussectionistheonederivedfromthetrie datastructure,seeFig. 5.Followingthelead from[16–18],weassociatetoeachnodethestringformedbythelabelsinthenode-to-rootupwardpath,andweorderall nodesaccordingtothelexicographicrankofsuchstrings.
The resultinggraphisaWheeler graphwiththe samenumberofnodesastheoriginal trie.Ina trieall nodesexcept theroothavein-degree1sointherepresentationofLemma 4we donotneedtostore thein-degreebinary array.Hence therepresentationconsistsoftheout-degreebinaryarray O ,thelabelsarray L andthecountarrayC .ForthetrieinFig. 5
wehave
O
=
001 0001 01 1 1 1 01 001 01 01 1L
=
AB ABC C A AC A AThis representation isessentially equivalent to the XBWT(eXtended BWT) introduced in [16] to representlabeled trees, withthearraysO and L correspondingrespectivelyto
S
last andS
α in[16].Notethatwecanrestrictthesearchtoprefixesbysettingtherootastheonlystartstate,andwecanrestrictthesearchtosuffixesbysettingasacceptingstatesonlythose correspondingtooneoftheinputstringswithoutaffectingourrepresentation.
4.3. deBruijngraphs
Bowe,Onodera,SadakaneandShibuya[6](seealso,e.g.,[5,10,33]) extendedtheXBWTfromtreestodeBruijn graphs, whicharewidelyusedinbioinformaticsfordenovoassembly,readcorrection,identifyinggeneticvariationsinapopulation,
Fig. 6. Afiniteautomatonacceptingallprefixesoflengthatleast3ofstringsinwhichtheonlytriplesare AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA andinwhich BAAA and CABA donotoccur.State1isinitial,allstatesatdistance3fromtheinitialstateareaccepting.Statesarenumberedtoshowthat thediagramisaWheelergraph.
C C T C – A – A A C C C – C T C A A A C C C C T C C A – A A C A C C T C C A A A C A C C T T – A T A A C – C C T T A T A – A C C C T – – – – A A C C C – – – C T A A C C
Fig. 7. Left:Analignmentofstrings CCTCAAACC, CCTCCAAACA, CCTTATAAC,and CCTAACC.Thealignmenthasbeenseparatedintocommonand non-commonregions.Right:ThealignmenttransformedforFMA.Thesuffixesmovedtothenon-commonregionsare CT forthefirstcommonregionand AC forthesecondone.
andotherapplications.Akth-orderde Bruijngraphforastringorsetofstrings containsa nodeforeachdistinct k-tuple
thatoccursinthosestrings,andan edge
(
u,
v)
ifthereisa(
k+
1)
-tupleinthestringsthatstarts withu andendswithv(whichimpliesthatv canbeobtainedfromu bydeletingu’sfirstcharacterandappendingacharacter).
TogetherwiththeGCSA,describedinSection4.5,Boweetal.’srepresentationisreallytheprototypicalBWT-baseddata structurethatrequiresustothinkintermsofgraphsinsteadofstrings.Themaindifferencebetweenrepresentingalabeled treeandrepresentingadeBruijngraphisthat,inagraph,nodescanhavein-degreemorethan 1,butthiscanbehandled usingthein-degreearray I asdescribedinLemma 4.
We canalso rederive Bowe etal.’srepresentationfrom Theorem 6: we build a finite-stateautomaton that acceptsall prefixesoflengthatleastk ofstringscontaining onlyk-tuples and
(
k+
1)
-tuplesfroma givenlist,by buildingatrie for thek-tuplesandthenaddingedgesconnectingtheleavesappropriately.Fig. 6showsanexampleforthetriples AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA, withtheedges for BAAA and CABA missing:thisautomaton acceptsall prefixesof lengthatleast3ofstringsinwhichtheonlytriplesare AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA andinwhich BAAA and CABA donotoccur.Bynumberingeachnodeaccordingtothelexicographicrankofthestringlabeling thenode-to-root path,thestatediagramisimmediatelyseentobeaWheelergraph.4.4. FM-indexofalignment
LetG beaWheelergraphundernodeordering
π
.Ifallincomingedgestonodesinrange[
i,
j]
havethesamelabel,we canmergetherangeintoasinglenode v,andtheresultinggraphG willstillbeaWheelergraph.IfgraphG hasanedge withlabela fromnodeu toanodeinrange[
i,
j]
,graphGwillhaveanedge(
u,
v)
withlabela.Similarly,ifgraphG hasanedgewithlabelafromanodeinrange
[
i,
j]
tonode w,graphG willhaveanedge(
v,
w)
withlabel a.Ifwehaveamulti-stringBWT,wecantransformitsWheelergraphintoamorecompactrepresentationofthestringsby mergingrangesofnodes.Thiscompactrepresentationmaycontainfalsepositives:pathlabelsthatcombinesubstringsfrom differentoriginal strings. The FM-indexofalignment (FMA) [40,41] hasan efficientprocedure fordetecting false positives, basedoncarefullychoosingthenodestomerge.
We start with an alignment of the strings, and separate the alignment into common regions Xi shared between all
strings,andnon-common regions Yi wheresome stringsaredifferent. Weassume thateach commonregionhasaproper
suffix Zi that doesnot occur anywhere else in the strings. Ifno such suffix exists, we merge the common region into
the neighboringnon-common regions. We transformthe alignment intwo steps. First we move the suffixes Zi intothe
followingnon-commonregionsYi.Thenwejustifyeachnon-commonregiontotheright,anduse XiandYitodenotethe
commonandnon-commonregionsinthetransformedalignment.SeeFig. 7foranexample.
The transformed alignment guides us in merging the Wheeler graph of the multi-string BWT into the FM-index of alignment. As we are buildingan FM-index for the original strings, we need to build a Wheeler graph for the reverse strings.Wemergenodescorrespondingtoalignedpositions,ifthenodesa) havein-degree 0;b) areina commonregion; orc) correspondtothesamesuffixofthenon-commonregion.SeeFig. 8foranexample.
74 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 8. Top:Wheelergraphofthemulti-stringBWTofthetransformedalignmentin
Fig. 7
.Nodeorderingisbasedonthelabelsinthenodes.Graynodes correspondtothecommonregions.Bottom:WheelergraphoftheFMAforthesamealignment.Thecolorsofeachedgemarktheoriginalstringsthat crosstheedge.Thelabelsofnon-mergednodesarethesameasinthetopgraphforconvenience.(Forinterpretationofthecolorsinthisfigure,thereader isreferredtothewebversionofthisarticle.)ByLemma 3,theWheelergraphofamulti-stringBWTispathcoherent.ToseethatthegraphremainsaWheelergraph aftermerging,wenotethat:
1. Sourcenodeswithin-degree0formaconsecutiverangebyDefinition 1.
2. AcommonregionXiisenteredeitherfromthesourcenodesorfromthefollowingnon-commonregionYi.Inthelatter case, thesetofnodesreachablewithstring ZR
i isthesetofnodeswithoutgoingedgestothecommonregion Xi.By Lemma 3,thesenodesformaconsecutiverange.Asthecommonregionisenteredfromaconsecutiverangeofnodes, we see by iterating Definition 2 that nodes corresponding to aligned positions in the region also form consecutive ranges.
3. Bytheprevious twocases,thenon-commonregionisenteredfromaconsecutiverangeofnodes.ByDefinition 2,we seethatthenodescorrespondingtoasuffixoftheregionformconsecutiveranges.
False positivesare paths that donot correspond to anyof theoriginal strings. Theycan occur when the pathcomes from anon-common region Yi+1,enters a commonregion Xi+1, andexitsinto anothernon-common region Yi. Assume that foreachnode v wehavestoredthesetofstrings Sv passingthrough thenode.Tocheckforfalsepositives, wetake
theintersectionofsets Sv overallnodesv onthepath.Iftheintersectionisnon-empty,itcontainsthestringscompatible
withthepath.If
(
u,
v)
isthe onlyoutgoingedgefromnode u,we have Su⊆
Sv.Similarly, if(
u,
v)
istheonlyincomingedgetonode v,wehaveSv
⊆
Su.Henceitisenoughtotaketheintersectiona)atthefinalnode;andb)atnodesu,wherethepathcrossesanedge
(
u,
v)
toanode v withmultipleincomingedges.ItisalsoenoughtostorethesetSu explicitlyifa)nodeu hasmultipleornooutgoingedges;orb)Su
=
Sv fortheonlyoutgoingedge(
u,
v)
.Let
α
beastring,letV α bethesetofnodesreachablebyfollowingpathswithlabelα
,andlet Sα bethesetofstrings compatiblewiththepaths.BecausethegraphisaDFA,thesetofreachablenodesismonotonicallydecreasing:|
V αa| ≤ |
V α|
and
|
Vaα| ≤ |
V α|
foranycharactera∈
A.If Zi isamoved substring,VZR ia≤
1 foranycharacter a∈
A,asthepaths can onlyendattherightmostnodeofthecommonregion Xi.Ifset VZRia isnon-empty,set SZiRa containsallstrings,asa Zi is
asubstringofallstrings.
When we search in FMA, we move from V α to V αa. We update the set of compatible strings S lazily. Initially the
set contains all strings. If
|
V α| >
1, there cannot be false positives. Either none of the paths enters a common region from anon-common region,orall such paths start within Zi andenter Xi through every incomingedge. Ineithercase, Sα=
v∈VαSv. If|
V αa| =
1 and the node v∈
V αa has multiple incoming edges, there is a risk of false positives. We thereforeupdateS←
S∩
v∈VαSv.WhenthesearchfinishesatV α,thesetofcompatiblestringsisS
∩
v∈VαSv. 4.5. GCSAWhile notall NFAs haveanordering ofnodesthatmakes themWheeler graphs,wecan useWheelergraphs toindex pathlabelsoflengthuptok inarbitrarygraphs[49].ThisistheideabehindGeneralizedCompressedSuffixArrays(GCSAs). WeemphasizethatWheelergraphscanbelargerthanthegraphstheyareusedtoindex.
Fig. 9. Top:ADFA.Second:A3rd-orderde BruijngraphforpathlabelsintheDFA.NodelabelsindicatenodeorderingintheWheelergraph.Wealsoshow thek-tuplescorrespondingtoeachnode(inreversetomatchthesortingorder)andthemapping f tothenodesoftheDFA.Third:Aprunedde Bruijn graphfortheDFA.Nodelabelsarecompatiblewiththesecondgraph.Graycolorindicatesnodeswherethemappingmustbestoredexplicitly.Bottom: A pathgraphbasedona4th-orderde Bruijngraph.
Definition7.LetG beanNFA,letGbeaWheelergraph,let
α
beastring,andletV α andVα be thesetsofnodesreachable withα
inG andG,respectively.GraphG isakth-orderpathgraph ofG forak>
0,ifthereisafunction f fromnodesofGtosetsofnodesofG suchthat V α
=
v∈Vα f(
v)
forall|
α
| ≤
k.We can usea kth-order de Bruijngraph ofpath labels ingraph G as a kth-order path graphof G. SeeFig. 9 foran example. Function f has the same role as the suffix array.As withthe sets of strings in FMA, we must store set f
(
u)
explicitly,if a)node u hasmultipleorno outgoingedges;orb) we cannot derive f(
u)
from f(
v)
fortheonly outgoing edge(
u,
v)
.WeoftennumberthenodesoftheNFAinawaysuchthat f(
u) = {
x−
1|
x∈
f(
v)}
,if(
u,
v)
istheonlyoutgoing edgefromu andtheonlyincomingedgetov.Withlargevaluesofk,de Bruijngraphsoftenhavemanyredundantnodeswhenweusethemaspathgraphs.If
|
α
| ≤
k,thenodes Vα reachable with
α
inthede Bruijn grapharetheoneswithα
asasuffix ofthecorresponding k-tuples.ByLemma 3,thenodesformaconsecutiverange.If f
(
v) =
f(
v)
forallv,
v∈
Vα , wecanmergethenodeswithoutaffecting reachability.GCSA2[49]usessuchprunedde Bruijngraphs tosavespacewhenindexingpathlabelsinanNFA.SeeFig. 9for anexample.The originalGCSA [50]considers a differentscenario.Instead ofindexingpaths oflength up tok inan arbitraryNFA, it indexespaths ofarbitrarylength in an acyclicDFA.Conceptually, GCSAconstruction searchesfor ade Bruijn graph G
thatisequivalent totheinputgraphG asaDFA,andthenusesthatde Bruijngraphasaninfinite-order pathgraphof G. Consecutive ranges of nodesare merged if f
(
v) =
f(
v)
for all nodes v and v in the range,even ifthe ranges do not correspondtosharedsuffixesofthek-tuples.SeeFig. 9(bottom)foranexample.4.6. PBWT,wavelettreesandwaveletmatrices
Another remarkablevariant of the BWT is Durbin’spositional BWT [12] (PBWT), which he introduced for haplotype matchingbuthasalsobeenusedforreconstructingancestralrecombinationgraphs[48].Givend stringsofequallengththe PBWTsupportsthematchingofsubstringsstartingataspecifiedpositioninthestrings.
76 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 10. A standard representation of a wavelet tree and its representation as a Wheeler graph.
Fig. 11. A standard representation of a wavelet matrix and its representation as a Wheeler graph with edge labels over a larger alphabet.
WecanviewthePBWTasaWheelergraphwiththesamestructureasthatofthemulti-stringBWT.Ifthemulti-string BWThasanedge
(
u,
v)
withlabela fromtextpositioni topositioni+
1,PBWTuses(
i+
1,
a)
asthelabel.Becausealledges frompositioni gotopositioni+
1,thepositionalcomponentisalwaysi+
1 foralloutgoingedgesinrange[(
i−
1)
d+
1,
id]
. Hence, we caninferthepositionfromtherangeandstore justthecharactera inthesuccinctrepresentation(Lemma 4). A generalization [44] ofPBWTrelaxes therequirements.Instead ofstoring stringsof equallength,we store thelabels of paths inagraph.Ifedge(
u,
v)
intheWheelergraphcorrespondstoedge(
u,
v)
withlabela intheunderlyinggraph,we use(
v,
a)
asthelabeloftheedge.Aswecannolongerinferthepositionalcomponent,wehavetostoreitexplicitly.Although introduced forcompletely differentapplications, the PBWT bearsa striking resemblanceto a datastructure calleda waveletmatrix,which Claude,Navarro andOrdóñez [11]introduced asa version ofthewavelettree [28] better suited forlarge alphabets.This suggeststhateven wavelettrees andmatricescan beviewed asWheelergraphs. Forthe sake ofbrevity, weassume thereaderis familiarwiththesedatastructureswith“myriadvirtues” [13,24],andnote only that,whilebuildingawavelettreeisanalogoustoamost-significant-bit-firstradixsort,buildingeitheraPBWTorawavelet matrixisanalogoustoaleast-significant-bit-firstradixsort.
ConsiderthewavelettreeshownontheleftinFig. 10.Sinceitisatree,wecanuseourconstructionfromSubsection4.2
to build an FSA (withone initial state)that acceptsall thebinary strings that are root-to-leaf paths inthe wavelettree. This doesnot store all theinformation inthe wavelet tree,however, sowe turnthe graphfrom an FSA intoa directed multi-graph, asshownonthe rightinFig. 10,whichisstill aWheelergraph.Thisgraphhassomeinteresting properties: e.g.,theorderoftheoutgoingedgesfromeachnodeencodethebitvectorstoredatthatnodeinthewavelettree;asinthe wavelettree,foreachinternalnode v excepttheroot, v’sin-degreeandout-degreeareequalanditsin-degreeisequalto thesumofthein-degreesofitsleafdescendants.Itisbeyondthescopeofthispapertoinvestigatethisrepresentationof wavelettreesasadatastructure,butitseems likeaninterestingdirectionforfutureresearch thatgoesbeyondthereach ofTheorem 6.
Now considerthewaveletmatrixshowontheleft inFig. 11.WecanturnitintoaWheelergraphif,asinthewavelet tree,wetransformitintoadirectedmulti-graphandthen,asinthePBWT,werenamethelabelc oneachedgewiththepair
(
i,
c)
,wherei isthelevelofthestartingnode,asshownontherightinFig. 11.Again,theWheelergraphclearlyencodes thewaveletmatrix,butweleaveasfutureworkinvestigatingthepossiblebenefitsofthisalternativerepresentation.5. Conclusionsandfuturework
We have defined Wheeler graphs to try to capture the ideas behind many of the variants of the BWT,and given a frameworkfordevelopingnewvariantsby solvingproblemswithfinite-state automatawhosestatediagramsareWheeler graphs.Wedidnotpursuethetopichere,butWheelergraphsseemabletocapturepropertiesalsoofstringtransformations notrelatedtopatternmatching:byreversingtheinequalityu
<
uin(1)wegetaslightlydifferentnotionofWheelergraph thatcanbeusedtosuccinctlyrepresentthevariantoftheBWTdefinedintermsofthealternatinglexicographicorder[26](seealso[27]foranancestorofthemulti-string BWT).
There has been so much work involving the BWT, however, that it would be surprising if one idea could subsume it all, and indeed some BWT-related results, such as the positional BWT, seemingly cannot be reasonably modeled by finite-state automata, evenif they canstill be viewed asWheeler graphs; other BWT-related results,such asGanguly et al.’sparameterizedBWT[25] andtheindexfororder-preservingmatchingin[23],seemnoteventobebasedonWheeler graphs atall.We havenot yeteven consideredbidirectional FM-indexes[3,31,32] andbidirectionalBWT-basedde Bruijn graphs[4].
Apartfromapplying andextending ourframework, we hope to develop algorithms to recognizeWheeler graphs effi-ciently,andtocharacterizeclassesoffinite-statediagramsthatareWheelergraphsorcanbeexpandedslightlytobecome Wheelergraphswithoutchangingthelanguageacceptedbytheautomata(althoughcharacterizingallsuchstate diagrams mightbedifficult).Inthisregardweobservethatnotallregularlanguageshavefinite-stateautomatawhosestatediagrams areWheelergraphs:e.g.,inanyfinite-stateautomatonforthelanguage
(
ax∗b)
|(
cx∗d)
,theremustbedisjointpathsforaxkbandcxkd andbothendsofalledgeswithlabelx mustbeinthesameorder,sotheremustbeseparate nodesforxib and xid forallvaluesof i.
Finally,wehope ournewperspective onBWTvariantsmakes themmoreaccessibletocomputer scientistsfromareas outsidestringalgorithmsanddatastructures.
References
[1]D.Adjeroh,T.Bell,A.Mukherjee,TheBurrows–WheelerTransform:DataCompression,SuffixArrays,andPatternMatching,SpringerScience&Business Media,2008.
[2]M.J.Bauer,A.J.Cox,G.Rosone,LightweightalgorithmsforconstructingandinvertingtheBWTofstringcollections,Theoret.Comput.Sci.483(2013) 134–148.
[3]D.Belazzougui,F.Cunial,J.Kärkkäinen,V.Mäkinen,VersatilesuccinctrepresentationsofthebidirectionalBurrows–Wheelertransform,in:European SymposiumonAlgorithms,Springer,2013,pp. 133–144.
[4]D.Belazzougui,T.Gagie,V.Mäkinen,M.Previtali,S.J.Puglisi,Bidirectionalvariable-orderdeBruijngraphs,in:LatinAmericanSymposiumonTheoretical Informatics,Springer,2016,pp. 164–178.
[5]T.Beller,E.Ohlebusch,EfficientconstructionofacompresseddeBruijngraph forpan-genomeanalysis,in:AnnualSymposiumonCombinatorial PatternMatching,Springer,2015,pp. 40–51.
[6]A.Bowe,T.Onodera,K.Sadakane,T.Shibuya,SuccinctdeBruijngraphs,in:Int.WorkshoponAlgorithmsinBioinformatics,Springer,2012,pp. 225–235. [7]N.R. Brisaboa, A.Cerdeira-Pena, A.Fariña, G. Navarro, Acompact RDFstoreusing suffix arrays,in: SPIRE, in:LNCS, vol. 9309, Springer,2015,
pp. 103–115.
[8]N.R.Brisaboa,A.Fariña,D.Galaktionov,M.A.Rodríguez,Compacttriprepresentationovernetworks,in:SPIRE,in:LNCS,vol. 9954,2016,pp. 240–253. [9]M.Burrows,D.Wheeler,ABlock-SortingLosslessDataCompressionAlgorithm,Tech.Rep.124,DigitalEquipmentCorporation,1994.
[10]R.Chikhi,A.Limasset,S.Jackman,J.T.Simpson,P.Medvedev,OntherepresentationofdeBruijngraphs,J.Comput.Biol.22 (5)(2015)336–352. [11]F.Claude,G.Navarro,A.Ordónez,Thewaveletmatrix:anefficientwavelettreeforlargealphabets,Inform.Sci.47(2015)15–32.
[12]R.Durbin,EfficienthaplotypematchingandstorageusingthepositionalBurrows–Wheelertransform(PBWT),Bioinformatics30 (9)(2014)1266–1272. [13]P.Ferragina,R.Giancarlo,G.Manzini,Themyriadvirtuesofwavelettrees,Inform.andComput.207(2009)849–866.
[14]P.Ferragina,R.Giancarlo,G.Manzini,M.Sciortino,Boostingtextualcompressioninoptimallineartime,J.ACM52(2005)688–713. [15]P.Ferragina,R.González,G.Navarro,R.Venturini,Compressedtextindexes:fromtheorytopractice,ACMJ.Exp.Algorithmics13(2009).
[16]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Structuringlabeledtreesforoptimalsuccinctness,andbeyond,in:Proc.46thIEEESymposiumon FoundationsofComputerScience,FOCS’05,2005,pp. 184–193.
[17]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,CompressingandsearchingXMLdataviatwozips,in:Proc.15thInternationalWorldWideWeb Conference,WWW’06,2006,pp. 751–760.
[18]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Compressingandindexinglabeledtrees,withapplications,J.ACM57(2009).
[19]P.Ferragina,G.Manzini,Opportunisticdatastructureswithapplications,in:Proc.41stIEEESymp.onFound.ofComputerScience,2000,pp. 390–398. [20]P.Ferragina,G.Manzini,Indexingcompressedtext,J.ACM52 (4)(2005)552–581.
[21]P.Ferragina,R.Venturini,Compressedpermutermindex,in:SIGIR,ACM,2007,pp. 535–542.
[22]P.Ferragina,R.Venturini,Thecompressedpermutermindex,ACMTrans.Algorithms7 (1)(2010)10:1–10:21.
[23]T.Gagie,G.Manzini,R.Venturini,Anencodingfororder-preservingmatching,in:Proc.25thAnnualEuropeanSymposiumonAlgorithms,ESA2017, DagstuhlPublishing,Germany,2017.
[24]T.Gagie,G.Navarro,S.J.Puglisi,Newalgorithmsonwavelettreesandapplicationstoinformationretrieval,Theoret.Comput.Sci.426(2012)25–41. [25]A.Ganguly,R.Shah,S.V.Thankachan,pBWT:achievingsuccinctdatastructuresforparameterizedpatternmatchingandrelatedproblems,in:Proc.
28thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’17,SIAM,2017,pp. 397–407.
[26] I.M.Gessel,A.Restivo,C.Reutenauer,Abijectionbetweenwordsandmultisetsofnecklaces,EuropeanJ.Combin.33 (7)(2012)1537–1546,
http://
dx.doi.org/10.1016/j.ejc.2012.03.016.[27]I.M.Gessel,C.Reutenauer,Countingpermutationswithgivencyclestructureanddescentset,J.Combin.TheorySer.A64 (2)(1993)189–215. [28]R.Grossi,A.Gupta,J.S.Vitter,High-orderentropy-compressedtextindexes,in:Proc.14thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’03,
SIAM,2003,pp. 841–850.
[29]R.Grossi, J.S.Vitter,Compressedsuffixarraysandsuffixtreeswith applicationstotextindexingand stringmatching,in:Proc.ofthe32ndACM SymposiumonTheoryofComputing,2000,pp. 397–406.
78 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
[31]G.Kucherov,K.Salikhov,D.Tsur,Approximatestringmatchingusingabidirectionalindex,Theoret.Comput.Sci.638(2016)145–158.
[32]T.W.Lam,R.Li,A.Tam,S.Wong,E.Wu,S.-M.Yiu,Highthroughputshortreadalignmentviabi-directionalBWT,in:IEEEInternationalConferenceon BioinformaticsandBiomedicine,BIBM’09,IEEE,2009,pp. 31–36.
[33]D.Li,C.-M.Liu,R.Luo,K.Sadakane,T.-W.Lam,Megahit:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinct de Bruijngraph,Bioinformatics31 (10)(2015)1674–1676.
[34]V.Mäkinen,Compactsuffixarray, in:Proc.ofthe11thSymposiumonCombinatorialPatternMatching,in:LNCS,vol. 1848,Springer-Verlag,2000, pp. 305–319.
[35]V.Mäkinen,D.Belazzougui,F.Cunial,A.I.Tomescu,Genome-ScaleAlgorithmDesign,CambridgeUniversityPress,2015.
[36]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,Anextensionoftheburrowswheelertransformandapplicationstosequencecomparisonanddata compression,in:CPM,in:LNCS,vol. 3537,Springer,2005,pp. 178–189.
[37]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,AnextensionoftheBurrows–Wheelertransform,Theoret.Comput.Sci.387 (3)(2007)298–312. [38]S.Mantaci,A.Restivo,M.Sciortino,AnextensionoftheBurrows–Wheelertransformtok words,in:DCC,IEEEComputerSociety,2005,p. 469. [39]G.Manzini,AnanalysisoftheBurrows–Wheelertransform,J.ACM48 (3)(2001)407–430.
[40]J.C.Na,H.Kim,S.Min,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignmentwithgaps,arXiv:1606.03897,2016.
[41]J.C.Na,H.Kim,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignment:acompressedindexforsimilarstrings,Theoret.Comput. Sci.638(2016)159–170.
[42]G.Navarro,CompactDataStructures:APracticalApproach,CambridgeUniversityPress,2016. [43]G.Navarro,V.Mäkinen,Compressedfull-textindexes,ACMComput.Surv.39 (1)(2007)2.
[44]A.M.Novak,E.Garrison,B.Paten,AgraphextensionofthepositionalBurrows–Wheelertransformanditsapplications,in:InternationalWorkshopon AlgorithmsinBioinformatics,Springer,2016,pp. 246–256.
[45]J.G.Saxe,ThePoemsofJohnGodfreySaxe,Houghton,MifflinandCo.,1881. [46] J.Seward,The bzip2homepage,
http://www.bzip.org
,1996.[47] J.Seward,The valgrind homepage,
http://valgrind.org
,2000.[48]V.Shchur,R.Durbin,TreeconsistentPBWTandtheirapplicationtoreconstructingancestralrecombinationgraphsanddemographicinference,in: PosterPresentedatthe19thAnnualInternationalConferenceonResearchinComputationalMolecularBiology(RECOMB),2015.
[49]J.Sirén,Indexingvariationgraphs,in:Proc.19thMeetingonAlgorithmEngineeringandExperiments,ALENEX’17,SIAM,2017,pp. 13–27.
[50]J.Sirén,N.Välimäki,V.Mäkinen,Indexinggraphsforpathquerieswithapplicationsingenomeresearch,IEEE/ACMTrans.Comput.Biol.Bioinform. 11 (2)(2014)375–388.