 Research
 Open access
 Published:
Pfpfm: an accelerated FMindex
Algorithms for Molecular Biology volume 19, Article number: 15 (2024)
Abstract
FMindexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that wordbased indexes often use fewer random accesses than characterbased indexes, and thus support faster searches. Since DNA lacks natural wordboundaries, however, it is necessary to parse it somehow before applying wordbased FMindexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting wordbased FMindexes support faster counting queries than standard FMindexes when patterns are a few thousand characters or longer. In this paper we show that using prefixfree parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARSCoV2 genomes. Hence, it seems our method accelerates the performance of count over all stateoftheart methods with a moderate increase in the memory. The source code for \(\texttt {PFPFM}\) is available at https://github.com/AaronHong1024/afm.
Introduction
The FMindex [3] is one of the most famous data structures in bioinformatics as it has been applied to countless applications in the analysis of biological data. Due to the longterm impact of this data structure, Burrows, Ferragina, and Manzini earned the 2022 ACM Paris Kanellakis Theory and Practice Award.^{Footnote 1} It is the data structure behind important read aligners—e.g., Bowtie [4] and BWA [5]—which take one or more reference genomes and build the FMindex for these genomes and use the resulting index to find short exact alignments between a set of reads and the reference(s) which then can be extended to approximate matches [4, 5]. Briefly, the FMindex consists of a sample of the suffix array (denoted as SA) and the Burrows–Wheeler transform (BWT) array. Given an input string S and a query pattern Q, count queries that answer the number of times the longest match of Q appears in S, can be efficiently supported using the BWT. To locate these occurrences a sampling of SA is used. Together the FMindex efficiently supports both count and locate queries. We mathematically define the SA and BWT in the next section.
There has been a plethora of research papers on reducing the size of the FMindex (see, e.g., [6,7,8]) and on speeding up queries. The basic query, count, returns the number of times a pattern Q appears in the indexed text S, but usually requires at least Q random accesses to the BWT of S, which are usually much slower than the subsequent computations we perform on the information those accesses return. More specifically, a count query for Q uses \(\text{ rank }\) queries at Q positions in the BWT; if we answer these using a single wavelet tree for the whole BWT, then we may use a random access for every level we descend in the wavelet tree, or \(\Omega (Q \log \sigma )\) random access in all, where \(\sigma \) is the size of the alphabet; if we break the BWT into blocks and use a separate wavelet tree for each block [7], we may need only one or a few random accesses per \(\text{ rank }\) query, but the total number of random accesses is still likely to be \(\Omega (Q)\). As far back as 2007, Ferragina and Fischer [1] addressed compressed indexes’ reliance on random access and demonstrated that wordbased indexes perform fewer random accesses than characterbased indexes:“The space reduction of the final wordbased suffix array impacts also in their query time (i.e. less random access binarysearch steps!), being faster by a factor of up to 3.”
Thus, one possibility of accelerating the random access to genomic data—where it is widely used—is to break up the sequences into words or phrases. In light of this insight, Deng, Hon, Köppl and Sadakane [2] in 2022 applied a grammar [9] that factorizes S into phrases based on the leftmost Stype suffixes (LMS) [10]. (Crescenzi et al. [11] proposed a similar idea earlier.) Unfortunately, one round of that LMS parsing leads to phrases that are generally too short, so they obtained a speedup only when Q was thousands of characters. The open problem was how to control the length of phrases with respect to the input to get longer phrases that would enable larger advances in the acceleration of the random access.
Here, we apply the concept of prefixfree parsing to the problem of accelerating count in the FMindex. Prefixfree parsing uses a rolling hash to first select a set of strings (referred to as trigger strings) that are used to define a parse of the input string S; i.e., the prefixfree parse is a parsing of S into phrases that begin and end at a trigger string and contain no other trigger string. All unique phrases are lexicographically sorted and stored in the dictionary of the prefixfree parse, which we denote as \(\text{ D }\). The prefixfree parse can be stored as an ordered list of the phrases’ ranks in \(\text{ D }\). Hence, prefixfree parsing breaks up the input sequence into phrases, whose size is more controllable by the selection of the trigger strings. This leads to a more flexible acceleration than Deng et al. [2] obtained.
We assume that we have an input string S of length n. Now suppose we build an FMindex for S, an FMindex for the parse \(\text{ P }\), and a bitvector \(\text{ B }\) of length n with 1’s marking characters in the BWT of S that immediately precede phrase boundaries in S, i.e., that immediately precede a trigger string. We note that all the 1 s are bunched into at most as many runs as there are distinct trigger strings in S. Also, as long as the ranks of the phrases are in the same lexicographic order as the phrases themselves, we can use a rank query on the bitvector to map from the interval in the BWT of S for any pattern starting with a trigger string to the corresponding interval in the BWT of \(\text{ P }\), and vice versa with a select query. This means that, given a query pattern Q, we can backward search for Q character by character in the FMindex for S until we hit the left end of the rightmost trigger string in Q, then map into the BWT of \(\text{ P }\) and backward search for Q phrase by phrase until we hit the left end of the leftmost trigger string in Q, then map back into the BWT of S and finish backward searching character by character again.
We implement this method, which we refer to as \(\texttt {PFPFM}\), and extensively compare against the FMindex implementation in sdsl [12], RLCSA [13], RLFM [6, 14], and FIGISS [2] using sets of SARSCoV2 genomes taken from the NCBI website, and the Genome Reference Consortium Human Build 38 with varying query string lengths. When we compare \(\texttt {PFPFM}\) to FMindex in sdsl using 100,000 SARSCoV2 genomes, we witnessed that \(\texttt {PFPFM}\) was able to perform between 2.1 times and 2.8 times more queries. In addition, \(\texttt {PFPFM}\) was between 64.38% and 74.12%, 59.22% and 78.23%, and 49.10% and 90.70% faster than FIGISS, RLCSA, and RLFM, respectively, on 100,000 SARSCoV2 genomes. We evaluated the performance of \(\texttt {PFPFM}\) on the Genome Reference Consortium Human Build 38, and witnessed that it was between 3.86 and 7.07, 2.92 and 18.07, and 10.14 and 25.46 times faster than RLCSA, RLFM, and FIGISS, respectively. With respect to construction time, \(\texttt {PFPFM}\) had the most efficient construction time for all SARSCoV2 datasets and was the second fastest for Genome Reference Consortium Human Build 38. All methods used less than 60 GB for memory for construction on the SARSCoV2 datasets, making the construction feasible on any entry level commodity server—even the build for the 100,000 SARSCoV2 dataset. Construction for the Genome Reference Consortium Human Build 38 required between 26 GB and 71 GB for all methods, with our method using the most memory. In summary, we developed and implemented a method for accelerating the FMindex, and achieved an acceleration between 2 and 25 times, with the greatest acceleration witnessed with longer patterns. Thus, accelerated FMindex methods—such as the one developed in this paper—are highly applicable to finding very long matches (125 to 1000 in length) between query sequences and reference databases. As reads get longer and more accurate (i.e., Nanopore data), we will soon be prepared to align long reads to reference databases with efficiency that surpasses traditional FMindex based alignment methods. The source code is publicly available at https://github.com/AaronHong1024/afm.
Preliminaries
Basic definitions
A string S of length n is a finite sequence of symbols \(S = S[0..n1] = S[0] \cdots S[n1]\) over an alphabet \(\Sigma = \{c_1, \ldots , c_{\sigma } \}\). We assume that the symbols can be unambiguously ordered. We denote by \(\varepsilon \) the empty string, and the length of S as S. Given a string S, we denote the reverse of S as rev(S), i.e., \(rev(S) = S[n1] \cdots S[0]\).
We denote by S[i..j] the substring \(S[i] \cdots S[j]\) of S starting in position i and ending in position j, with \(S[i.. j] = \varepsilon \) if \(i > j\). For a string S and \(0 \le i < n\), S[0..i] is called the ith prefix of S, and \(S[i..n1]\) is called the ith suffix of S. We call a prefix S[0..i] of S a proper prefix if \(0 \le i < n1\). Similarly, we call a suffix \(S[i.. n1]\) of S a proper suffix if \(0< i < n\).
Given a string S, a symbol \(c \in \Sigma \), and an integer i, we define \(S.\text{ rank}_c(i)\) (or simply rank if the context is clear) as the number of occurrences of c in \(S[0..i1]\). We also define \(S.\text{ select}_c(i)\) as \(\min (\{ j1 \mid S.\text{ rank}_c(j) = i \}\cup \{ n \})\), i.e., the position in S of the ith occurrence of c in S if it exists, and n otherwise. For a bitvector \(B[0..n1]\), that is a string over \(\Sigma = \{0,1\}\), to ease the notation we will refer to \(B.\text{ rank}_1(i)\) and \(B.\text{ select}_1(i)\) as \(B.\text{ rank }(i)\) and \(B.\text{ select }(i)\), respectively.
SA, BWT, and backward search
We denote the suffix array [15] of a given a string \(S[0..n1]\) as \(\text{ SA}_S\), and define it to be the permutation of \(\{0,\ldots ,n1\}\) such that \(S[\text{ SA}_S[i]..n1]\) is the ith lexicographical smallest suffix of S. We refer to \(\text{ SA}_S\) as \(\text{ SA }\) when it is clear from the context. For technical reasons, we assume that the last symbol of the input string is \(S[n1] = \$\), which does not occur anywhere else in the string and is smaller than any other symbol.
We consider the matrix \(\text{ W }\) containing all sorted rotations of S, called the BWT matrix of S, and let F and L be the first and the last column of the matrix. The last column defines the BWT array, i.e., \(\text{ BWT }= \textsf {L}\). Now let \(\textsf{C}[c]\) be the number of suffixes starting with a character smaller than c. We define the LFmapping as \(\text{ LF }(i,c) = \textsf{C}[c] + \textsf{BWT}.\text{ rank}_c(i)\) and \(\text{ LF }(i) = \text{ LF }(i,\textsf{BWT}[i])\). With the LFmapping, it is possible to reconstruct the string S from its BWT. It is in fact sufficient to set an iterator \(s = 0\) and \(S[n1] = \$\) and for each \(i = n2, \ldots , 0\) do \(S[i] = \text{ BWT }[s]\) and \(s = \text{ LF }(s)\). The LFmapping can also be used to support count by performing the backward search, which we now describe.
Given a query pattern Q of length m, the backward search algorithm consists of m steps that preserve the following invariant: at the ith step, p stores the position of the first row of \(\text{ W }\) prefixed by Q[i, m] while q stores the position of the last row of \(\text{ W }\) prefixed by Q[i, m]. To advance from i to \(i1\), we use the LFmapping on p and q, \(p = \textsf{C}[c] + \textsf{BWT}.\text{ rank}_c(p)\) and \(q = \textsf{C}[c] + \text{ BWT }{}.\text{ rank}_c(q + 1)  1\).
FMindex and count queries
Given a query string \(Q[0..m1]\) and an input string \(S[0..n1]\), two fundamental queries are: (1) count which counts the number of occurrences of Q in S; (2) locate which finds the location of each of these matches in S. Ferragina and Manzini [3] showed that, by combining \(\text{ SA }\) with the BWT, both count and locate can be efficiently supported. Briefly, backward search on the BWT is used to find the lexicographical range of the occurrences of Q in S; the size of this range is equal to count. The SA positions within this range are the positions where these occurrences are in S.
Prefixfree parsing
As we previously mentioned, the PrefixFree Parsing (PFP) takes as input a string \(S[0..n1]\), and positive integers w and p, and produces a parse of S (denoted as \(\text{ P }\)) and a dictionary (denoted as \(\text{ D }\)) of all the unique substrings (or phrases) of the parse. Parameter w defines the length of the trigger strings and parameter p influences the rollinghash function. We briefly go over the algorithm for producing this dictionary \(\text{ D }\) and parse \(\text{ P }\). First, we assume there exists two symbols, say \(\#\) and \(\$\), which are not contained in \(\Sigma \) and are lexicographically smaller than any symbol in \(\Sigma \). Next, we let T be an arbitrary set of wlength strings over \(\Sigma \) and call it the set of trigger strings. As mentioned before, we assume that \(S[n1]=\$\) and consider S to be cyclic, i.e., for all i, \(S[i] = S[i \bmod n]\). Furthermore, we assume that \(\$S[0..w2]=S[n1..n+w2] \in T\), i.e., the substring of length w that begins with \(\$\) is a trigger string.
We let the dictionary \(\text{ D }= \{d_1,.., d_{\big \text{ D }\big }\}\) be a (lexicographically sorted) maximum set of substrings of S such that the following holds for each \(d_i\): i) exactly one proper prefix of \(d_i\) is contained in T, ii) exactly one proper suffix of \(d_i\) is contained in T, iii) and no other substring of \(d_i\) is in T. These properties allow for the SA and BWT to be constructed since the lexicographical placement of each rotation of the input string can be identified unambiguously from D and P [16,17,18]. An important consequence of the definition is that \(\text{ D }\) is prefixfree, i.e., for any \(i \ne j\), \(d_i\) cannot be a prefix of \(d_j\).
Since we assumed \(S[n1..n+w2] \in T\), we can construct \(\text{ D }\) by scanning \(S' = \$S[0..n2]S[n1..n+w2]\) to find all occurrences of T and adding to \(\text{ D }\) each substring of \(S'\) that starts and ends at a trigger string being inclusive of the starting and ending trigger string. We can also construct the list of occurrences of \(\text{ D }\) in \(S'\), which defines the parse \(\text{ P }\).
We choose T by a KarpRabin fingerprint f of strings of length w. We slide a window of length w over \(S'\), and for each length w substring r of \(S'\), include r in T if and only if \(f(r) \equiv 0\pmod p\) or \(r=S[n1..n+w2]\). Let \(0=s_0< \cdots < s_{k1}\) be the positions in \(S'\) such that for any \(0 \le i < k\), \(S'[s_i..s_i+w1] \in T\). The dictionary is \(\text{ D }= \{ S'[s_i..s_{i+1}+w1]\mid i=0,\ldots , k1\}\), and the parse is defined to be the sequence of lexicographic ranks in \(\text{ D }\) of the substrings \(S'[s_0..s_{1}+w1], \ldots , S'[s_{k2}..s_{k1}+w1]\).
As an example, suppose we have , where the trigger strings are highlighted in red, blue, or green. It follows that we have \( \text{ D }= \{ \mathtt {\$AGAC}, \mathtt {AC\$A}, \texttt{ACGAC}, \mathtt {ACT\#AGATAC}, \mathtt {ACT\#AGATTC}, \texttt{TCGAGAC}\} \) and \(\text{ P }= 0, 2, 3, 4, 5, 2, 1\). In our experiment, as illustrated in Fig. 3, we observed that increasing the value of w usually decreases the average phrase length. Conversely, increasing p usually increases the average phrase length. We note, however, that these trends may vary in realworld applications.
Methods
We now make use of the prefixfree parsing reviewed in section "Prefixfree parsing" to build a wordbased FMindex in a manner in which the lengths of the phrases can be controlled via the parameters w and p. To explain our data structure, we first describe the components of our data structure, and then explain how to support count queries in a manner that is more efficient than the standard FMindex.
Data structure design
It is easiest to explain our twolevel design with an example, so consider a text
of length \(n = 41\) that is terminated by a special endofstring character $ lexicographically less than the rest of the alphabet. Suppose we parse S using \(w = 2\) and a KarpRabin hash function such that the trigger strings occurring in S are AA, CG and TA. We consider S as cyclic, and we have \(\$ S[0..w  2] = \texttt {\$ T}\) as a special trigger string, so the dictionary \(\text{ D }\) is
with the phrases sorted in lexicographic order. (Recall that phrases consecutive in S overlap by \(w = 2\) characters.) If we start parsing at the $, then the prefixfree parse for S is
where each element (or phrase ID) in \(\text{ P }\) is the lexicographic rank of the phrase in \(\text{ D }\).
Next, we consider the BWT matrix for \(\text{ P }\). Figure 1 illustrates the BWT matrix of P for our example. We note that since there is only one $ in S, it follows that there is only one 0 in \(\text{ P }\); we can regard this 0 as the endofstring character for (a suitable rotation of) P corresponding to $ in S. If we take the ith row of this matrix and replace the phrase IDs by the phrases themselves, collapsing overlaps, then we get the lexicographically ith cyclic shift of S that start with a trigger string, as shown on the right of the figure. This is one of the key insights that we will use later on.
Lemma 1
The lexicographic order of rotations of \(\text{ P }\) correspond to the lexicographic order of their corresponding rotations of S.
Proof
The characters of \(\text{ P }\) are the phrase IDs that act as metacharacters. Since the metacharacters inherit the lexicographic rank of their underlying characters, and due to the prefixfreeness of the phrases, the suffix array of \(\text{ P }\) permutes the metacharacters of \(\text{ P }\) in the same way as the suffix array of S permutes the phrases of S. This means that the order of the phrases in the \(\text{ BWT }\) of S is the same as the order of the phrase IDs in \(\text{ P }\). \(\square \)
Next, we let \(B [0..n  1]\) be a bitvector marking these cyclic shifts’ lexicographic rank among all cyclic shifts of S, i.e., where they are among the rows of the BWT matrix of S. Figure 2 shows the SA, BWT matrix and BWT of S, together with B; we highlight the BWT—the last column of the matrix—in red, and the cyclic shifts from Fig. 1 are highlighted in blue. We note that B contains at most one run of 1’s for each distinct trigger string in S, so it is usually highly runlength compressible in practice, While this compressibility is partly due to the generally small size of \(\text{ D }\) observed in repetitive data, it is also influenced by the limited number of distinct trigger strings. For example, for 200 copies of of chromosome 19 the size of \(\text{ D }\) was 0.16 GB and with 2200 additional copies the size of \(\text{ D }\) was still less than 1 GB, i.e., 0.56 GB [19]. This suggests a compact representation, especially in repetitive sequences where distinct strings of length w are fewer, as approximately a 1/p fraction of these strings will be trigger strings.
In addition to the bitvector, we store a hash function h on phrases and a map M from the hashes of the phrases in \(\text{ D }\) to those phrases’ lexicographic ranks, which are their phrase IDs; M returns \(\textrm{NULL}\) when given any other key, where \(\textrm{NULL}\) semantically represents an invalid ID@. Therefore, in total, we build the FMindex for S, the FMindex for \(\text{ P }\), the bitvector B marking the cyclic rotations, the hash function h on the phrases and the map M. For our example, suppose
and \(M (x) = \textrm{NULL}\) for any other value of x.
If we choose the range of h to be reasonably large then we can still store M in space proportional to the number of phrases in \(\text{ D }\) with a reasonably constant coefficient and evaluate \(M (h (\cdot ))\) in constant time with high probability, but the probability is negligible that \(M (h (\gamma )) \ne \textrm{NULL}\) for any particular string \(\gamma \) not in \(\text{ D }\). This means that in practice we can use \(M (h (\cdot ))\) as a membership dictionary for \(\text{ D }\), and not store \(\text{ D }\) itself.
Query support
Next, given the data structure that we define above, we describe how to support count queries for a given pattern Q. We begin by parsing Q using the same KarpRabin hash we used to parse S, implying that we will have all the same trigger strings as we did before and possibly additional ones that did not occur in S. However, we will not consider Q to be cyclic nor assume an endofstring symbol that would assure that Q starts and ends with a trigger string.
If Q is a substring of S, then, since Q contains the same trigger strings as its corresponding occurrence in S, the sequence of phrases induced by the trigger strings in Q must be a substring of the sequence of phrases of S. Together with the prefix and suffix of Q that are a suffix and prefix of the phrases in S to the left and right of the shared phrases, we call this the partial encoding of Q, defined formally as follows.
Definition 1
(partial encoding) Given a substring S[i..j] of S, the partial encoding of S[i..j] is defined as follows: If no trigger string occurs in S[i..j], then the partial encoding of S[i..j] is simply S[i..j] itself. Otherwise, the partial encoding of S[i..j] is the concatenation of: (1) the shortest prefix \(\alpha \) of S[i..j] that does not start with a trigger string and ends with a trigger string, followed by (2) the sequence of phrase IDs of phrases completely contained in S[i..j], followed by (3) the shortest suffix \(\beta \) of S[i..j] that begins with a trigger string and does not end with a trigger string.
So the partial encoding partitions S[i..j] into a prefix \(\alpha \), a list of phrase IDs, and a suffix \(\beta \). If S[i..j] begins (respectively ends) with a trigger string, then \(\alpha \) (respectively \(\beta \)) is the empty string.
Parsing Q can be done in time linear in the length of Q.
Lemma 2
We can represent M with a data structure taking space (in words) proportional to the number of distinct phrases in \(\text{ D }\). Given a query pattern Q, this data structure returns \(\textrm{NULL}\) with high probability if Q contains a complete phrase that does not occur in S. Otherwise (complete phrases of Q occur in S), it returns the partial encoding of Q. In either case, this query takes O(Q) time.
Proof
We keep the KarpRabin (KR) hashes of the phrases in \(\text{ D }\), with the range of the KR hash function mapping to \([1..n^3]\) so the hashes each fit in \(O(\log n)\) bits. We also keep a constanttime map (implemented as a hash table with a hash function that is perfect for the phrases in \(\text{ D }\)) from the KR hashes of the phrases in \(\text{ D }\) to their IDs, that returns \(\textrm{NULL}\) given any value that is not a KR hash of a phrase in \(\text{ D }\). We set M to be the map composed with the KR hash function.
Given Q, we scan it to find the trigger strings in it, and convert it into a sequence of substrings consisting of: (a) the prefix \(\alpha \) of Q ending at the right end of the first trigger string in Q; (b) a sequence of PFP phrases, each starting and ending with a trigger string with no trigger string in between; and (c) the suffix \(\beta \) of Q starting at the left end of the last trigger string in Q.
We apply M to every complete phrase in (b). If M returns \(\textrm{NULL}\) for any complete phrase in (b), then that phrase does not occur in S, so we return NULL; otherwise, we return \(\alpha \), the sequence of phrase IDs M returned for the phrases in (b), and \(\beta \).
Notice that, if a phrase in Q is in S, then M will map it to its lexicographic rank in \(\text{ D }\); otherwise, the probability the KR hash of any particular phrase in Q but not in \(\text{ D }\) collides with the KR hash of a phrase in \(\text{ D }\), is at most \(n / n^3 = 1 / n^2\). It follows that, if Q contains a complete phrase that does not occur in S, then we return \(\textrm{NULL}\) with high probability; otherwise, we return Q’s partial encoding. \(\square \)
Corollary 1
If we allow O(Q) query time with high probability, then we can modify M to always report \(\textrm{NULL}\) when Q contains a complete phrase not in S.
Proof
We augment each KarpRabin (KR) hash stored in the hash table with the actual characters of its phrase such that we can check, character by character, whether a matched phrase of Q is indeed in \(\text{ D }\). In case of a collision we recompute the KR hashes of \(\text{ D }\) and rebuild the hash table. That is possible since we are free to choose different KarpRabin fingerprints for the phrases in \(\text{ D }\). \(\square \)
Continuing from our example above where the trigger strings are AA, CG and TA, suppose we have a query pattern Q,
we can compute the parse of Q to obtain the following parse string
Next, we use \(M (h (\cdot ))\) to map the complete phrases of this parse of Q to their phrase IDs—which are their \(\text{ rank }\) values in \(\text{ D }\). If any complete phrase maps to \(\textrm{NULL}\) then we know Q does not occur in T. Using our example, we have the partial encoding
Next, we consider all possible cases matching Q with an occurrence in S. First, we consider the case that the last substring \(\beta \) in our parse of Q ends with a trigger string, which implies that it is a complete phrase. Here, we can immediately start backward searching for the parse of Q in the FMindex for \(\text{ P }\). Next, if \(\beta \) is not a complete phrase then we backward search for \(\beta \) in the FMindex for S. If this backward search for \(\beta \) returns nothing then we know Q does not occur in S. If the backward search for \(\beta \) returns an interval in the BWT of \(\text{ P }\) that is not contained in the BWT interval for a trigger string then \(\beta \) does not start with a trigger string so \(Q = \beta \) and we are done backward searching for Q.
Finally, we consider the case when \(\beta \) is a proper prefix of a phrase and the backward search for \(\beta \) returns a \(\text{ BWT}_S\) interval contained in the \(\text{ BWT}_S\) interval for a trigger string. In our example, \(\beta = \texttt {TAT}\) and our backward search for \(\beta \) in the FMindex for S returns the interval \(\text{ BWT}_S [31..32]\), which is the interval for the trigger string TA. Next, we use B to map the interval for \(\beta \) in the \(\text{ BWT}_S\) to the interval in the \(\text{ BWT}_{\text{ P }}\) that corresponds to the cyclic shifts of S starting with \(\beta \).
Lemma 3
We can store, in space (in words) proportional to the number of distinct trigger strings in S, a data structure B with which,

Given the lexicographic range of suffixes of S starting with a string \(\beta \) such that \(\beta \) starts with a trigger string and contains no other trigger string, in \(O(\log \log n)\) time we can find the lexicographic range of suffixes of \(\text{ P }\) starting with phrases that start with \(\beta \);

Given a lexicographic range of suffixes of \(\text{ P }\) such that the corresponding suffixes of S all start with the same trigger string, in \(O(\log \log n)\) time we can find the lexicographic range of those corresponding suffixes of S.
Proof
Let \(B[0..n  1]\) be a bitvector with 1 s marking the lexicographic ranks of suffixes of S starting with trigger strings. There are at most as many runs of 1 s in B as there are distinct trigger strings in S, so we can store B in space proportional to that number and support rank and select operations on it in \(O(\log \log n)\) time (e.g. with the data structure of [20]).
If \(\text{ BWT}_S [i..j]\) contains the characters immediately preceding, in S, occurrences of a string \(\beta \) that starts with a trigger string and contains no other trigger strings, then \(\text{ BWT}_{\text{ P }} [B.\text{ rank}_1 (i)..B.\text{ rank}_1 (j)]\) contains the phrase IDs immediately preceding, in \(\text{ P }\), the IDs of phrases starting with \(\beta \).
If \(\text{ BWT}_{\text{ P }} [i..j]\) contains the phrase IDs immediately preceding, in \(\text{ P }\), suffixes of \(\text{ P }\) such that the corresponding suffixes of S all start with the same trigger string, then \(\text{ BWT}_S [B.\text{ select}_1 (i + 1).. B.\text{ select}_1 (j + 1)]\) contains the characters immediately preceding the corresponding suffixes of S.
The correctness follows from Lemma 1. \(\square \)
Continuing with our example, mapping \(\text{ BWT}_S [31..32]\) to \(\text{ BWT}_{\text{ P }}\) yields the following interval:
as shown in Fig. 1. Starting from this interval in \(\text{ BWT}_{\text{ P }}\), we now backward search in the FMindex for \(\text{ P }\) for the sequence of complete phrase IDs in the parse of Q. In our example, we have the interval \(\text{ BWT}_{\text{ P }} [4..5]\) which yields the following phrase IDs: \(2\ 4\ 3\ 1\).
If this backward search in the FMindex for \(\text{ P }\) returns nothing, then we know Q does not occur in S. Otherwise, it returns the interval in \(\text{ BWT}_{\text{ P }}\) corresponding to cyclic shifts of S starting with the suffix of Q that starts with Q’s first complete phrase. In our example, if we start with \(\text{ BWT}_{\text{ P }} [4..5]\) and backward search for \(2\ 4\ 3\ 1\) then we obtain \(\text{ BWT}_{\text{ P }} [2]\), which corresponds to the cyclic shift
of S that starts with the suffix
of Q that is parsed into \(2, 4, 3, 1, \texttt {TAT}\).
To finish our search for Q, we use B to map the interval in \(\text{ BWT}_{\text{ P }}\) to the corresponding interval in the \(\text{ BWT}_S\), which is the interval of rows in the BWT matrix for S which start with the suffix of Q we have sought so far. In our example, we have that \(\text{ BWT}_{\text{ P }} [2]\) maps to
We note that our examples contain BWT intervals with only one entry because our example is so small, but in general they are longer. If the first substring \(\alpha \) in our parse of Q is a complete phrase then we are done backward searching for Q. Otherwise, we start with this interval in \(\text{ BWT}_S\) and backward search for \(\alpha \) in the FMindex for S, except that we ignore the last w characters of \(\alpha \) (which we have already sought, as they are also contained in the next phrase in the parse of Q).
In our example, \(\alpha = \texttt {CAGAA}\) so, starting with \(\text{ BWT}_S [2]\) we backward search for \(\texttt {CAG}\), which returns the interval \(\text{ BWT}_S [14]\). As shown in Fig. 2,
indeed starts with
This concludes our explanation of count.
To conclude, we give some intuition as to why we expect that our twolevel FMindex is faster in practice than standard backward search. Following the reasoning of Deng et al. [2], on the one hand, standard backward search takes linear time in the length of Q and usually uses at least one random access to the BWT of S per character in Q. On the other hand, prefixfree parsing Q, like the LMSparsing of Deng et al., takes linear time but does not use random access to S or the BWT of S; backward search in the FMindex of S is the same as standard backward search but we use it only for the first and last substrings in the parse of Q. Backward search in the FMindex for P is likely to use about \(\lg \text{ D }\) random accesses for each complete phrase in the parse of Q: the BWT of \(\text{ P }\) is over an effective alphabet whose size is the number of phrases in \(\text{ D }\). Therefore, a balanced wavelet tree to support rank on that BWT should have depth about \(\lg \text{ D }\) and we should use at most about one random access for each level in the tree.
In summary, if we can find settings of the prefixfree parsing parameters w and p such that

Most query patterns will span several phrases,

Most phrases in those patterns are fairly long,

\(\lg \text{ D }\) is significantly smaller than those phrases’ average length, then the extra cost of parsing Q should be more than offset by using fewer random accesses.
Results
We implemented our algorithm and measured its performance against all known competing methods. We ran all experiments on a server with AMD EPYC 75F3 CPU with Red Hat Enterprise Linux 7.7 (64 bit, kernel 3.10.0). The compiler was g++ version 12.2.0. The running time and memory usage was recorded by Snakemake benchmark facility [21]. We set a memory limitation of 128 GB and a time limitation of 24 h.
Datasets We used the following datasets. First, we considered sets of SARSCoV2 genomes taken from the NCBI website. We used three collections of 25, 000, 50, 000, and 100, 000 SARSCoV2 genomes from EMBLEBI’s COVID19 data portal. Each collection is a superset of the previous. We denote these as SARS25k, and SARS50k, SARS100k. Next, we considered a single human reference genome, which we denote as GRCh38, downloaded from NCBI. We report the size of the datasets as the number of characters in each in Table 1. We denote n as the number of characters.
Implementation We implemented our method in C++ 11 using the sdsllite library [12] and extended the prefixfree parsing method of Oliva, whose source code is publicly available here https://github.com/marcooliva/pfp. The source code for \(\texttt {PFPFM}\) is available at https://github.com/AaronHong1024/afm.
Competing methods We compared \(\texttt {PFPFM}\) against the following methods the standard FMindex found in sdsllite library [12], RLCSA [13], RLFM [6, 14], Bowtie [4], Bowtie2 [4, 22, 23] and FIGISS [2]. The source code of RLCSA and FIGISS is publicly available, while RLFM is provided only as an executable. We performed the comparison by selecting 1000 strings at random of the specified length from the FASTA file containing all the genomes, performing the count operation on each query pattern, and measuring the time usage for all the methods under consideration.
As a side note, FIGISS and RLCSA only support count queries where the string is provided in an input text file. More specifically, the original FIGISS implementation supports counting with the entire content of a file treated as a single pattern. To overcome this limitation, we modified the source code to enable the processing of multiple query patterns within a single file. In addition to the time required for answering count, we measured the time and memory required to construct the data structure.
Acceleration versus baseline
In this subsection, we compare \(\texttt {PFPFM}\) versus the standard FMindex in sdsl with varying values of window size (w) and modulo value (p), and varying the length of the query pattern. We calculated the number of count queries performed per CPU second with \(\texttt {PFPFM}\) versus the standard FMindex. We generated heatmaps that illustrate the number of count queries of \(\texttt {PFPFM}\) versus sdsl for various lengths of query patterns, namely, 125, 250, 500, and 1000. We performed this for both SARSCoV2 set of genomes and GRCh38 human reference genome. Figure 3 shows the resulting heatmaps for SARS100K. As shown in this figure, \(\texttt {PFPFM}\) was between 2.178 and 2.845 times faster than the standard FMindex with the optimal values of w and p. In particular, an optimal performance gain of 2.6, 2.3, 2.2, and 2.9 was witnessed for pattern lengths of 125, 250, 500, and 1,000, respectively. The (w, p) pairs that correspond to these results are (6, 50), (6, 30), (8, 50), and (8, 50). Additionally, as depicted in Fig. 4, which focuses on the GRCh38 dataset, the speed of \(\texttt {PFPFM}\) ranges from 1.672 to 2.943 times faster than that of the standard FMindex when optimal values of w and p are used. For pattern lengths of 125, 250, 500, and 1,000, the acceleration factors achieved by \(\texttt {PFPFM}\) are 1.96, 1.81, 2.45, and 2.94, corresponding to these lengths. The specific (w, p) pairs for these improvements are (4, 50) for the 125 pattern length, (8, 50) for 250, (4, 50) for 500, and (6, 40) for the 1000 length. As detailed in Table 1, these outcomes were obtained under conditions of comparable memory usage and constructing time.
Results on SARSCoV2 genomes
We used the optimal parameters that were obtained from the previous experiment for this section. We constructed the index using these parameters for each SARSCoV2 dataset and assessed the time consumption for performing 1000 count queries using all competing methods and \(\texttt {PFPFM}\). We illustrate the result of this experiment in Fig. 5, where \(\texttt {PFPFM}\) consistently exhibits the lowest time consumption. For the SARS25K dataset, the time consumption of FIGISS was between 451% and 568% higher than our method. And the time consumption of RLCSA and RLFM was between 780% and 1598%, and 842% and 1705% more than \(\texttt {PFPFM}\), respectively. The performance of FIGISS surpasses that of RLFM and RLCSA when using the SARS25k dataset; however for the larger datasets FIGISS and RLCSA converge in their performance. Neither method was substantially better than the other. In addition, on the larger datasets, when the query pattern length was 125 and 250, RLFM performed better than RLCSA and FIGISS but was slower for the other query lengths. Hence, it is very clear that \(\texttt {PFPFM}\) accelerates the performance of count over all stateoftheart methods.
The gap in performance between \(\texttt {PFPFM}\) and the competing methods increased with the dataset size. For SARS50K, FIGISS, RLCSA and RLFM were between 3.65 and 13.44, 3.65 and 16.08, and 4.25 and 12.39 times slower, respectively. For SARS100K, FIGISS, RLCSA and RLFM were between 2.81 and 3.86, 2.45 and 4.59, and 1.96 and 10.75 times slower, respectively.
Next, we consider the time and memory required for construction, which is given in Table 1. Our experiments revealed that all methods used less than 60 GB of memory on all SARSCoV2 datasets; \(\texttt {PFPFM}\) used the most memory with the peak being 54 GB on the SARS100K dataset. Yet, \(\texttt {PFPFM}\) exhibited the most efficient construction time across all datasets for generating the FMindex, and this gap in the time grew with the size of the dataset. More specifically, for the SARS100K dataset, \(\texttt {PFPFM}\) used 71.04%, 65.81%, and 73.41% less time compared to other methods. In summary, \(\texttt {PFPFM}\) significantly accelerated the count time, and had the fastest construction time. All methods used less than 60 GB, which is available on most commodity servers. In comparison with Bowtie [4] and Bowtie 2 [22, 24], our study finds a notable tradeoff for highly repetitive datasets; while Bowtie and Bowtie2 are more memoryefficient, it significantly increases processing time. In our experiments, Bowtie required at least ten times more time in constructing the index than our approach. We note that these methods have significant larger capability than our methods so this comparison should approached codicillary.
We observe in these experiments that the \(\texttt {PFPFM}\) algorithm manifests a moderately larger index size compared to other algorithms. This can be attributed to the \(\texttt {PFPFM}\) algorithm’s methodology of storing the Fmindex for both string \(\texttt {S}\) and parse \(\text{ P }\), the suffix array (\(\text{ SA }\)) for \(\texttt {S}\), and a bitvector B. Of these, the \(\text{ SA }\) contributes the most to the overall size. In an effort to mitigate this, the \(\text{ SA }\) was substituted with a Compressed Suffix Array (\(\texttt {CSA}\)), as delineated in the work of Grossi et al. (2014) [12]. This substitution significantly diminishes the index size. Consequently, this modified approach has been designated as \(\texttt {PFPFMCSA}\). As indicated in Table 1, when applied to the SARSCoV2 dataset, the \(\texttt {PFPFMCSA}\) algorithm requires more memory and time for construction. However, it notably reduces the index size to onethird of what is observed with the original \(\texttt {PFPFM}\) algorithm.
We next assessed the query times of the \(\texttt {PFPFMCSA}\) algorithm in comparison with other algorithms, which is shown in Fig. 6. For the GRCH38 dataset, due to the computation time of \(\texttt {PFPFMCSA}\) exceeding two days, we did not record the data. In other datasets, the \(\texttt {PFPFMCSA}\) showed superior performance, outpacing all other algorithms. Specifically, relative to the \(\texttt {PFPFM}\) algorithm, \(\texttt {PFPFMCSA}\) was faster by 39.31% and 29.28%.
Thus, in conclusion, we see a tradeoff between memory usage and construction time but note that this work contributes to the growing interested indexing data structures in bioinformatics. As novel implementations of construction algorithms for compressed suffix arrays are developed, they can be integrated in our method.
Results on human reference genome
After measuring the time and memory usage required to construct the data structure across all methods using the GRCh38 dataset, we observed that \(\texttt {PFPFM}\) exhibited the second most efficient construction time but used the most construction space (71 GB vs. 26 GB to 45 GB). More specifically, \(\texttt {PFPFM}\) was able to construct the index between 1.25 and 1.6 times faster than the FIGISS and RLFM.
Next, we compare the performance of \(\texttt {PFPFM}\) against other methods by performing 1000 count queries on, and illustrate the results in Fig. 7. Our findings demonstrate that \(\texttt {PFPFM}\) consistently outperforms all other methods. The RLCSA method performs better than RLFM and FIGISS when the pattern length is over 125 but is still 3.9, 6.2, 3.4, and 7.1 times slower than \(\texttt {PFPFM}\). Meanwhile, the RLFM method exhibits a steady increase in time usage, and it is 2.9, 14.2, 12.8, and 18.07 times slower than \(\texttt {PFPFM}\). It is worth noting that the FIGISS grammar is less efficient for nonrepetitive datasets, as demonstrated in the research by Akagi et al. [25], which explains its (worse) performance on GRCh38 versus the SARS100K dataset. Hence, FIGISS is 10.1, 25.5, 13.6, and 14.8 times slower than \(\texttt {PFPFM}\). These results are in line with the performance of our previous results, and demonstrate that \(\texttt {PFPFM}\) has both competitive construction memory and time, and achieves a significant acceleration. Additionally, it is important to highlight that Bowtie exhibits higher efficiency in processing nonrepetitive datasets. Despite its minimal memory requirements and smaller index size, Bowtie’s processing is notably timeconsuming.
Conclusion and future work
Hong et al. [26] recently gave a method for computing LZ77 parses quickly using PFP and, at least in theory, we can do the same. The key idea is that storing a data structure supporting rangeminimum queries (RMQs) over the suffix array makes an FMindex for the string S partially persistent: to check whether a pattern Q occurs in \(S [0..i  1]\) for some given i, we can search for Q in the FMindex for S, perform an RMQ over the suffixarray interval for Q, and check that the smallest value j such that \(S [j..j + Q  1] = Q\) has \(j + Q  1 \le i  1\). In fact, if we store the FMindex for the reverse of S, use rangemaximum queries instead of rangeminimum queries, and check the suffix array after searching for every character of Q, then we can efficiently find the longest prefix of Q that occurs in \(S [0..i  1]\). This allows us to compute efficiently the LZ77 parse of S. We are now working to compare how this approach compares to Hong et al.’s.
Notice that, when the query pattern Q starts and ends with trigger strings, we can perform the whole search in the FMindex for the parse P and need not use the FMindex for the string S at all. (See also related discussion in the arXiv preprint [27].) In fact, we are also now working to replace the FMindex for S by other data structures, in all cases. If Q starts with a trigger string but does not end with one, then instead of searching in the FMindex for S, we can find the lexicographic range of phrases in the dictionary \(\text{ D }\) starting with the suffix of Q starting at the start of the rightmost trigger string. Once we have that range, we can begin the search in the FMindex for P with the corresponding range in the BWT of P. Finally, if Q neither starts nor ends with a trigger string, then we can use 2dimensional range reporting on a grid with a point (x, y) whenever the colexicographically xth phrase in \(\text{ D }\) appears in P before the lexicographically yth suffix of P (with the phrase and suffix overlapping at the trigger string). Specifically, we

1.
find for the lexicographic range of phrases in \(\text{ D }\) starting with the suffix of Q starting at the start of the rightmost trigger string,

2.
start a search in the FMindex for P from the corresponding range in the BWT of P,

3.
find the colexicographic range of phrases in \(\text{ D }\) ending with the prefix of Q ending at the end of the leftmost trigger string,

4.
use 2dimensional range search on the grid to find all the substrings T of S in which the prefix of T ending at the end of T’s leftmost trigger string matches the corresponding prefix of Q, and the suffix of T starting at the start of T’s leftmost trigger string matches the corresponding suffix of Q — meaning T matches Q.
Availability of data and materials
In our experiments, two datasets were employed: GRCH38 and SARSCoV2. The GRCH38 dataset can be accessed at https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/, while the SARSCoV2 dataset is available at https://www.ncbi.nlm.nih.gov/nuccore/1798174254.
References
Ferragina P, Fischer J. Suffix arrays on words. In: Ma B, Zhang K, editors. Proceedings of the 18th Annual Symposium Combinatorial Pattern Matching (CPM). London: Springer; 2007. p. 328–39.
Deng JJ, Hon WK, Köppl D. Sadakane K, FMindexing grammars induced by suffix sorting for long patterns. In: Deng JJ, editor. Proceedings of the IEEE Data Compression Conference (DCC). Snowbird: IEEE; 2022. p. 63–72.
Ferragina P, Manzini G. Indexing compressed text. J ACM. 2005;52:552–81.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25–25.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWAMEM. Cornell Univ. 2013. https://doi.org/10.48550/arXiv.1303.3997.
Mäkinen V, Navarro G. Succinct suffix arrays based on runlength encoding. In: Apostolico A, Crochemore M, Park K, editors. Proceedings of the annual symposium on combinatorial pattern matching. Jeju Island: Springer; 2005. p. 45–56.
Gog S, Kärkkäinen J, Kempa D, Petri M, Puglisi SJ. Fixed block compression boosting in fmindexes: theory and practice. Algorithmica. 2019;81:1370–91.
Gagie T, Navarro G, Prezza N. Fully functional suffix trees and optimal text searching in BWTruns bounded space. J ACM. 2020;67(1):1–54.
Nunes DSN, Louza FA, Gog S, AyalaRincón M, Navarro G. A grammar compression algorithm based on induced suffix sorting. In: Nunes DSN, editor. 2018 Data Compression Conference. IEEE: Snowbird; 2018. p. 42–51. https://doi.org/10.1109/DCC.2018.00012.
Nong G, Zhang S, Chan WH. Two efficient algorithms for linear time suffix array construction. IEEE Trans Comput. 2011;60(10):1471–84. https://doi.org/10.1109/TC.2010.188.
Crescenzi P, Lungo AD, Grossi R, Lodi E, Pagli L, Rossi G. Text sparsification via local maxima. Theor Comput Sci. 2003;304(1–3):341–64.
Gog S, Beller T, Moffat A, Petri M. From theory to practice: plug and play with succinct data structures. In: Gog S, editor. Proceedings of the 13th symposium on experimental algorithms (SEA). Copenhagen: Springer; 2014. p. 326–37.
Siren J. Compressed suffix arrays for massive data. In: Karlgren J, Tarhio J, Hyyro H, editors. Proceedings of the 16th International Symposium String Processing and Information Retrieval (SPIRE). Berlin: Springer; 2009. p. 63–74.
Mäkinen V, Navarro G. Runlength FMindex. In: Proceedings of the DIMACS Workshop: “The BurrowsWheeler Transform: Ten Years Later”. 2004; pp. 17–19.
Manber U, Myers GW. Suffix arrays: a new method for online string searches. SIAM J Comput. 1993;22(5):935–48.
Boucher C, Gagie T, Kuhnle A, Langmead B, Manzini G, Mun T. Prefixfree parsing for building big BWTs. Algorithms Mol Biol. 2019;14(1):13–11315.
Boucher C, Gagie T, Kuhnle A, Manzini G. Prefixfree parsing for building big BWTs. Algorithms Biol (WABI). 2018. https://doi.org/10.1186/s1301501901485.
Mun T, Kuhnle A, Boucher C, Gagie T, Langmead B, Manzini G. Matching reads to many genomes with the rindex. J Comput Biol. 2020;27(4):514–8. https://doi.org/10.1089/cmb.2019.0316.
Oliva M, Gagie T, Boucher C. Recursive prefixfree parsing for building big BWTs. In: Oliva M, editor. 2023 data compression conference (DCC). IEEE: Snowbird; 2023. p. 62–70.
Golynski A. Optimal lower bounds for rank and select indexes. 2006;387:370–81. https://doi.org/10.1007/11786986_33.
Mölder F, Jablonski KP, Letcher B, Hall MB, TomkinsTinch CH, Sochat V, Forster J, Lee S, Twardziok SO, Kanitz A, et al. Sustainable data analysis with Snakemake. F1000Research. 2021;10:33.
Langmead B, Wilks C, Antonescu V, Charles R. Scaling read aligners to hundreds of threads on generalpurpose processors. Bioinformatics. 2018;35(3):421–32. https://doi.org/10.1093/bioinformatics/bty648.
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):25. https://doi.org/10.1186/gb2009103r25.
Akagi T, Köppl D, Nakashima Y, Inenaga S, Bannai H, Takeda M. Grammar index by induced suffix sorting. In: Lecroq T, Touzet H, editors. Proceedings of the 28th international symposium on string processing and information retrieval (SPIRE). Cham: Springer; 2021. p. 85–99.
Hong A, Rossi M, Boucher C. LZ77 via prefixfree parsing. In: Hong A, editor. The proceedings of the symposium on algorithm engineering and experiments (ALENEX). Philadelphia: SIAM; 2023. p. 123–34.
Deng JJ, Hon W, Köppl D, Sadakane K. FMindexing grammars induced by suffix sorting for long patterns. Snowbird: IEEE; 2021.
Funding
The funding details for the authors are as follows: 1. Aaron Hong and Marco Oliva: Supported by the NIH/NHGRI grant R01HG011392 to Ben Langmead. Supported by the NSF/BIO grant DBI2029552 to Christina Boucher. 2. Dominik Köppl: Supported by the JSPS KAKENHI Grant Numbers JP21K17701, JP22H03551, and JP23H04378. 3. Hideo Bannai: Supported by the JSPS KAKENHI Grant Number JP20H04141. 4. Christina Boucher: Supported by the NIH/NHGRI grant R01HG011392 to Ben Langmead. Supported by the NSF/BIO grant DBI2029552 to Christina Boucher. Supported by the NSF/SCH grant INT2013998 to Christina Boucher. Supported by the NIH/NIAID Grant R01AI14180 to Christina Boucher. 5. Travis Gagie: Supported by the NIH/NHGRI grant R01HG011392 to Ben Langmead. Supported by the NSERC grant RGPIN071852020 to Travis Gagie. Supported by the NSF/BIO grant DBI2029552 to Christina Boucher.
Author information
Authors and Affiliations
Contributions
AH and MO implemented the software, downloaded the data, and ran all the experiments. CB and TG oversaw the software development and experiments AH, MO, TG and CB drafted the paper. All authors helped write and edit the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hong, A., Oliva, M., Köppl, D. et al. Pfpfm: an accelerated FMindex. Algorithms Mol Biol 19, 15 (2024). https://doi.org/10.1186/s13015024002608
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015024002608