Fast phylogenetic inference from typing data

Background Microbial typing methods are commonly used to study the relatedness of bacterial strains. Sequence-based typing methods are a gold standard for epidemiological surveillance due to the inherent portability of sequence and allelic profile data, fast analysis times and their capacity to create common nomenclatures for strains or clones. This led to development of several novel methods and several databases being made available for many microbial species. With the mainstream use of High Throughput Sequencing, the amount of data being accumulated in these databases is huge, storing thousands of different profiles. On the other hand, computing genetic evolutionary distances among a set of typing profiles or taxa dominates the running time of many phylogenetic inference methods. It is important also to note that most of genetic evolution distance definitions rely, even if indirectly, on computing the pairwise Hamming distance among sequences or profiles. Results We propose here an average-case linear-time algorithm to compute pairwise Hamming distances among a set of taxa under a given Hamming distance threshold. This article includes both a theoretical analysis and extensive experimental results concerning the proposed algorithm. We further show how this algorithm can be successfully integrated into a well known phylogenetic inference method, and how it can be used to speedup querying local phylogenetic patterns over large typing databases.


Introduction
The evolutionary relationships between different species or taxa are usually inferred through known phylogenetic analysis techniques. Some of these techniques rely on the inference of phylogenetic trees, which can be computed from DNA or Protein sequences, or from allelic profiles where the sequences of defined loci are abstracted to categorical indexes. The most popular method is Mul-tiLocus sequence typing (MLST) [1] that typically uses seven 450-700 bp fragments of housekeeping genes for a given species. Phylogenetic trees are also used in other contexts, such as to understand the evolutionary history of gene families, to allow phylogenetic foot-printing, to trace the origin and transmission of infectious diseases, or to study the co-evolution of hosts and parasites [2,3].
In traditional phylogenetic methods, the process of phylogenetic inference starts with a multiple alignment of the sequences under study that is then corrected using models of DNA or Protein evolution. Tree-building methodologies can then be applied on the resulting distance matrix. These methods rely on some distancebased analysis of sequences or profiles [4].
Distance-based methods for phylogenetic analysis rely on a measure of genetic evolution distance, which is often defined directly or indirectly from the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches. A first step of these methods is to compute this distance between all pairs of sequences. The simplest approach is to use the Hamming distance, also known as observed p-distance, defined as the number of positions at which two aligned sequences differ. Note that the Hamming distance between two

Open Access
Algorithms for Molecular Biology *Correspondence: aplf@ist.utl.pt 3 INESC-ID Lisboa, Rua Alves Redol 9, 1000-029 Lisboa, Portugal Full list of author information is available at the end of the article sequences underestimates their true evolutionary distance and, thus, a correction formula based on some model of evolution is often used [2,4]. Although distance-based methods not always produce the best tree for the data, usually they also incorporate an optimality criterion into the distance model for getting more plausible phylogenetic reconstructions, such as the minimum evolution criterion [5], the least squares criterion [6] or the clonal complexes expansion and diversification [7]. Nevertheless, this category of methods are much faster than Maximum likelihood or Bayesian inference methods [8], making them excellent choices for the primary analysis of large data sets.
Most of the distance-based methods are agglomerative methods. They start with each sequence being a singleton cluster and, at each step, they join two clusters. The iterative process stops when all sequences are part of a single cluster, resulting in a phylogenetic tree. At each step the candidate pair is selected taking into account the distance among clusters as well as the optimality criterion chosen to adjust it.
The computation of a distance matrix (2D array containing the pairwise distances between the elements of a set) is a common first step for distance-based methods, such as eBURST [9], goeBURST [10], Neighbor Joining [11] and UPGMA [12]. This particular step dominates the running time of most methods, taking �(md 2 ) time in general, d being the number of sequences or profiles and m the length of each sequence or profile. For largescale datasets this running time may be quite problematic. And nowadays, with the mainstream use of High Throughput Sequencing, the amount of data being accumulated in typing databases is huge. It is common to find databases storing thousands of different profiles for a single microbial species, with each profile having thousands of loci [13,14].
However, depending on application, on the underlying model of evolution and on the optimality criterion, it may not be strictly necessary to be aware of the complete distance matrix. There are methods that continue to provide optimal solutions without a complete matrix. For such methods, one may still consider a truncated distance matrix and several heuristics, combined with final local searches through topology rearrangements, to improve the running time [6]. The goeBURST algorithm, one of our use cases in this article, is an example of a method that can work with truncated distance matrices by construction, i.e., one needs only to know which pairs are at Hamming distance at most k.

Our results
We propose here an average-case O(md)-time and O(md) -space algorithm to compute the pairs of sequences, among d sequences of length m, that are at distance at most k, when k < (m−k−1)·log σ log md , where σ is the size of the sequences alphabet. We support our result with both a theoretical analysis and an experimental evaluation on synthetic and real datasets of different data types (MLST, cgMLST, wgMLST and SNP). We further show that our method improves goeBURST, and that we can use it to speedup querying local phylogenetic patterns over large typing databases.
A preliminary version of this paper was presented at the Workshop on Algorithms in Bioinformatics (WABI) 2017 [15].

Closest pairs in linear time
Let P be the set of profiles (or sequences) each of length m, defined over an integer alphabet , (i.e., � = {1, . . . , m O(1) }), with d = |P| and σ = |�|. Let also H : P × P → {0, . . . , m} be the function such that H(u, v) is the Hamming distance between profiles u, v ∈ P. Given an integer threshold 0 < k < m, the problem is to compute all pairs u, v ∈ P such that H(u, v) ≤ k, and the corresponding H(u, v) value, faster than the �(md 2 ) time required to compute naïvely the complete distance matrix for the d profiles of length m.
We address this problem by indexing all profiles P using the suffix array (denoted by SA) and the longest common prefix (denoted by LCP) array [16]. We rely also on a range minimum queries (RMQ) data structure [17,18] over the LCP array (denoted by RMQ LCP ). The problem is then solved in three main steps: 1. Index all profiles using the SA data structure. 2. Enumerate all candidate profile pairs given the maximum Hamming distance k. 3. Verify each candidate profile pair by checking if the associated Hamming distance is no more than k. Table 1 summarizes the data structures and strategies followed in each step. Profiles are concatenated and indexed using SA. Depending on the strategy to be used, we further process the SA and build the LCP array and pre-process it for fast RMQ. This allows for enumerating candidate profile pairs and computing distances faster. In what follows, we detail the above steps and show how the data structures are used to improve the overall running time.  13:4 Step 1: Profile indexing Profiles are concatenated and indexed in an SA in O(md) time and space [19,20]. Let us denote this string by s. Since we only need to compute the distances between profiles that are at Hamming distance at most k, we can conceptually split each profile into k non-overlapping blocks of length L = ⌊ m k+1 ⌋ each. It is then folklore knowledge that if two profiles are within distance k, they must share at least one such block of length L. Our approach is based on using the SA of s to efficiently identify matching blocks among profile pairs. This lets us quickly filter in candidate profile pairs and filter out the ones that can never be part of the output.

Step 2: Candidate profile pairs enumeration
The candidate profile pairs enumeration step provides the pairs of profiles that do not differ in more than k positions, but it may include spurious pairs. Since SA is an ordered structure, a simple solution is to use a binary search approach. For each block of each profile, we can obtain in O(L log n) time, where n = md, all the suffixes that have that block as a prefix. If a given match is not aligned with the initial block, i.e., it does not occur at the same position in the respective profile, then it should be discarded. Otherwise, a candidate profile pair is reported. This searching procedure is done in O(dkL log n) = O(n log n) time.
Another solution relies on computing the LCP array: the longest common prefix between each pair of consecutive elements within the SA. This information can also be computed in O(n) time and space [21]. Since SA is an ordered structure, for the contiguous suffixes s i , s i+1 , s i+2 of s, with 0 ≤ i < n − 2, we have that the common prefix between s i and s i+1 is at least as long as the common prefix of s i and s i+2 . By construction, it is possible to get the position of each suffix in the corresponding profile in constant time. Then, we cluster the corresponding profiles of contiguous pairs if they have an LCP value greater than or equal to L and they are also aligned. This clustering procedure can be done in O(kd 2 ) time.

Step 3: Pairs verification
After getting the set of candidate profile pairs, a naïve solution would be to compute the distance for each pair of profiles by comparing them in linear time, i.e., O(m) time. However, if we compute the LCP array of s, we can then perform a sequence of O(k) RMQ over the LCP array for checking if a pair of profiles is at distance at most k. These RMQ over the LCP array correspond to longest common prefix queries between a pair of suffixes of s. Since after a linear-time pre-processing over the LCP array, RMQ can be answered in constant time per query [17], we obtain a faster approach for computing the distances. This alternative approach takes O(k) time to verify each candidate profile pair instead of O(m) time.

Average-case analysis
Algorithm 1 below details the solution based on LCP clusters; and Theorem 1 shows that this algorithm runs in linear time on average using linear space. We rely here on well-known results concerning the linear-time construction of the SA [19,20] and the LCP array [21], as well as the linear-time pre-processing for the RMQ data structure [18].
In what follows, LCP[i], i > 0, stores the length of the longest common prefix of suffixes s i−1 and s i of s, and RMQ LCP (i, j) returns the index of the smallest element in the subarray LCP[i . . . j] in constant time [18]. We rely also on some auxiliary subroutines; let L = ⌊ m k+1 ⌋: , the starting position of the suffix s i within a profile. Then this subroutine returns ℓ/L if ℓ is multiple of L, and −1 otherwise. HD( p i , p j , ℓ) Given two profiles p i and p j which share a substring of length L, starting at index ℓL, this subroutine computes the minimum of k and the Hamming distance between p i and p j . This subroutine relies on RMQ LCP to find matches between p i and p j and, hence, it runs in O(k) time since it can terminate after k mismatches.

Theorem 1 Given d profiles of length m each over an integer alphabet of size
required for constructing the SA and the LCP arrays for s and the RMQ data structure over the LCP array is O(md).
Let us denote by B the total number of blocks over s and by L the block length. We set L = ⌊ m k+1 ⌋ and thus we have that B = d⌊ m L ⌋. Let us also denote by C a maximal set of indices over x satisfying the following: 1. The length of the longest common prefix between any two suffixes of s starting at these indices is at least L; 2. both of these suffixes start at the starting position of a block; Algorithm 1: Algorithm using LCP clusters.
1 Input: A set P of d profiles of length m each; an integer threshold 0 < k < m.
2 Output: The set X of distinct pairs of profiles that are at Hamming distance at most k, i.e., be the string of length n = md obtained after concatenating the d profiles, and L = m k+1 . Construct the SA S for s, the LCP array for s and RMQ LCP . Initialize a hash table HT to track verified pairs. 4 Candidate pairs enumeration: Pairs enumeration: 3. and both indices correspond to the starting position of the ith block in their profiles.
This can be done in O(md) time using the LCP array (lines [7][8][9][10][11][12][13][14][15][16][17]. Processing all such sets C (lines 21-27) requires total time where PROC i,j is the time required to process a pair i, j of elements of a set C, and Pairs is the sum of |C| 2 over all such sets C. We have that PROC i,j = O(k) by using RMQ over the LCP array. Additionally, by the stated assumption on the d profiles, the expected value for Pairs is no more than Bd σ L : we have B blocks in total and each block can only match at most d other blocks by the conditions above. Hence, the algorithm requires on average the following running time Let us analyze this further to obtain the relevant condition on k. We have the following: Since 0 < k < m by hypothesis, we have the following: By some simple rearrangements we have that: Consequently, in the case when the algorithm requires O(md) time on average. The extra space usage is clearly O(md).

Use case 1: goeBURST algorithm
The distance matrix computation is a main step in distance-based methods for phylogenetic inference. This step dominates the running time of most methods, taking �(md 2 ) time, for d sequences of length m, since it must compute the distance among all sequence pairs. But for some methods, or when we are only interested in local phylogenies for sequences or profiles of interest, one does not need to know all pairwise distances for reconstructing a phylogenetic tree. The problem addressed in this article was motivated by the goeBURST algorithm [10], our use case 1. goeBURST is one of such methods for which one must know only the pairs of sequences that are at Hamming distance at most k. The solution proposed here can however be extended to other distancebased phylogenetic inference methods, that rely directly or indirectly on Hamming distance computations. Note that most methods either consider the Hamming distance or its correction accordingly to some formula based on some model of evolution [2,4]. In both cases we must start by computing the Hamming distance among sequences, but not necessarily all of them [6]. The underlying model of goeBURST is as follows: a given genotype increases in frequency in the population as a consequence of a fitness advantage or of random genetic drift, becoming a founder clone in the population; and this increase is accompanied by a gradual diversification of that genotype, by mutation and recombination, forming a cluster of phylogenetic closely-related strains. This diversification of the "founding" genotype is reflected in the appearance of genetic profiles differing only in one housekeeping gene sequence from this genotype-single locus variants (SLVs). Further diversification of those SLVs will result in the appearance of variations of the original genotype with more than one difference in the allelic profile, e.g., double and triple locus variants (DLVs and TLVs).
The problem solved by goeBURST can be stated as a graphic matroid optimization problem and, hence, it follows a classic greedy approach [22]. Given the maximum Hamming distance k, we can define a graph G = (V , E), where V = P (set of profiles) and The main goal of goe-BURST is then to compute a minimum spanning forest for G taking into account the distance H and a total order on links. It starts with a forest of singleton trees (each sequence/profile is a tree). Then it constructs the optimal forest by adding links connecting profiles in different trees in increasing order accordingly to the total order, similarly to what is done in the Kruskal's algorithm [23]. In the current implementation, a total order for links is implicitly defined based on the distance between sequences, on the number of SLVs, DLVs, TLVs, on the occurrence frequency of sequences, and on the assigned sequence identifier. With this total order, the construction of the tree consists of building a minimum spanning forest in a graph [23], where each sequence is a node and the link weights are defined by the total order. By construction, the pairs at distance δ will be joined before the pairs at distance δ + 1.

Use case 2: querying typing databases
A related problem is querying typing databases for similar typing profiles. Given a set P of d profiles of length m each, a profile u not necessarily in P but with the same length m as those in P, and k such that 0 < k < m, the problem is to find all profiles v ∈ P such that H (u, v) ≤ k . One may be also interested on local phylogenetic patterns, but those can be inferred from found profiles using for instance the goeBURST algorithm.
Once we define the value for k, we can address this problem as follows. We index all d profiles in the database as before in linear time O(md), and given a query profile u, we enumerate all candidate profiles v. We then verify as before all candidate pairs and we return only those satisfying H(u, v) ≤ k.
For indexing set P, we make use of the suffix tree data structure. The suffix tree T (x) of a string x is a compact trie representing all suffixes of x. It is known that the suffix tree of a string of length n, over an integer alphabet, can be computed in time and space O(n) [24]. For integer alphabets, in order to access the children of an explicit node of the suffix tree by the first letter of their edge label in O(1) time, we make use of perfect hashing [25].
By using the suffix tree we find candidate matches through forward search: spelling blocks of u from the root. Specifically, given the k + 1 non-overlapping blocks of length L = ⌊ m k+1 ⌋ of u, we search (without reporting) for each one of them in O(L) time. Since we have k + 1 blocks, it takes O(kL) = O(m) time to search for all k + 1 blocks of u. Finally, we can verify and report all candidate profiles v ∈ P as detailed in Algorithm 2.
Although, in the worst case, Algorithm 2 runs in time O(md + m log md), as we may have d matches at most, we can prove a similar average case as in Theorem 1. By the stated assumption on the profiles, the expected value for the number of profiles matching u is no more than B σ L : we have B blocks in total and each block can only match at most one other block in u (since they must be aligned; line 8).

Algorithm 2:
Moreover, since we are not relying on the LCP array in this case (profile u is not indexed), the verification step (line 12) takes O(m) time using letter comparisons. Hence, the algorithm requires on average the following running time Let us analyze this further to obtain the relevant condition on k. We have the following: By some simple rearrangements we have that: Consequently, in the case when the algorithm requires O(m) time on average. This algorithm was implemented using a suffix array and then integrated in INNUENDO Platform, which is publicly available [24]. The INNUENDO Platform is an infrastructure that provides the required framework for data analyses from bacterial raw reads sequencing data quality insurance to the integration of epidemiological data and visualization. As such, rapid methods for classification and search for closely related strains are a necessity for quick navigation through the platform database entries. More information about the project can be found at its website [25].
As a starting point and for the purpose of this study, a subset of 2312 wgMLST profiles of Escherichia coli retrieved from Enterobase [13] were included in the INNUENDO database as well as their ancillary data and predefined core-genome cluster classification. Two tabseparated files containing the wgMLST and cgMLST profiles for the E. coli strains were also created to allow storing information on the currently available profiles and for updating with profiles that will become available upon the platform analyses.
One of two index files are used depending on the type of search we want to perform: classification or search for k-closest. The cgMLST index file is used for strain classification, which relies on a nomenclature designed for the cgMLST profiles. As such, and since a pre-classification was performed on the database of E. coli strains, we continued using it for comparison purposes. However, when searching for the k-closest profiles, we take into consideration all targets available in the wgMLST profiles using the wgMLST index file for a higher discriminatory power. Each time a new profile is generated from the platform, it requires classification. The INNUENDO Platform performs the classification step based on the approach described in our "Use case 2: querying typing databases" with a given maximum of k differences over core genes. It uses the cgMLST index file for the search since the classification is constructed based on those number of loci. If the method returns at least one match, it classifies the new profile with the classification of the closest. If not, a new classification is assigned. A new entry is then added to the INNUENDO database as well as to the cgMLST and wgMLST profiles files and the index files are updated.
In the case of the search for the k-closest, it is useful to define the input data for visualization methods according to a defined number of differences on close strains. For each profile used as input for the search, the method searches for the k-closest strains considering at most k differences among all wgMLST loci. Since duplicate matches can occur between the profiles used for each search, the final file used as input for the visualization methods is the intersection of the results of the k-closest profiles between each input strain. The set of strain identifiers are then used to query the INNUENDO database to get the profiles and ancillary data to be sent to PHY-LOViZ Online [26] for further analysis, namely with the goeBURST algorithm.
The drawback of using this method for classification and search is the need for rebuilding the index each time there is a new profile, which will depend on the number of profile entries on the database. Nevertheless, the number of updates is rather smaller compared to the number of queries and the index can be build in the background, with search functionalities still using the old index during the process. In our implementation, the index and related data structures are serialized in secondary memory and they are accessed by mapping them into memory. The implementation of the underlying tool is made publicly available [27].
The above described approaches in combination with the features offered by the INNUENDO Platform allow microbiologists to quickly and efficiently search for strains close to their strain of interest, allowing a more targeted, focused and simple visualization of results.

Experimental evaluation
We evaluated the proposed approach to compute the pairs of profiles at distance at most k using both real and synthetic datasets. We used real datasets obtained through different typing schemas, namely whole-genome multi-locus sequence typing (wgMLST) data, coregenome multi-locus sequence typing (cgMLST) data, and single-nucleotide polymorphism (SNP) data. Table 2 summarizes the real datasets used. We should note that wgMLST and cgMLST datasets contain sequences of integers, where each column corresponds to a locus and different values in the same column denote different alleles. Synthetic datasets comprise sets of binary sequences of variable length, uniformly sampled, allowing us to validate our theoretical findings. We implemented both versions described above in the C programming language: one based on binary search over the SA; and another one based on finding clusters in the LCP array. Since allelic profiles can be either string of letters or sequences of integers, we relied on libdivsufsort library [28] and qsufsort code [29,30], respectively. For RMQ over the LCP array, we implemented a fast well-known solution that uses constant time per query and linearithmic space for pre-processing [17].
All tests were conducted on a machine running Linux, with an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40 GHz (8 cores, cache 32 KB/4096 KB) and with 32 GB of RAM. All binaries where produced using GCC 5.3 with full optimization enabled.

Synthetic datasets
We first present results with synthetic data for different values of d, m and k. All synthetic sequences are binary sequences uniformly sampled. Results presented in this section were averaged over ten runs and for five different sets of synthetic data.
The bound proved in Theorem 1 was verified in practice. For k satisfying the conditions in Theorem 1, the running time of our implementation grows almost linearly with n, the size of the input. We can observe in Fig. 1 a growth slightly above linear. Since we included the time for constructing the SA, the LCP array and the RMQ data structure, with the last one in linearithmic time, that was expected.
We also tested our method for values of k exceeding the bound shown in Theorem 1. For d = m = 4096 and a binary alphabet, the bound for k given in Theorem 1 is no more than ⌊m/(2 log m)⌋ = 170. For k above this bound we expect that proposed approaches are no longer competitive with the naïve approach. As shown in Fig. 2, for k > 250 and k > 270 respectively, both limits above the predicted bound, the running time for both computing pairwise distances by finding lower and higher bounds in the SA, and by processing LCP based clusters, becomes slower than the running time of the naïve approach.
In Fig. 3 we have the running time as a function of the number d of profiles, for different values of m and for k satisfying the bound given in Theorem 1. The running time for the naïve approach grows quadratically with d, while it grows linearly for both computing pairwise distances by finding lower and higher bounds in the SA, and by processing LCP based clusters. Hence, for synthetic data, as described by Theorem 1, the result holds.

Real datasets
For each dataset in Table 2, we ranged the threshold k accordingly and compared the approaches discussed in "Methods" section with the naïve approach that computes the distance for all sequence pairs. Results are provided in Table 3.
In most cases, the approach based on the LCP clusters is the fastest up to two orders of magnitude compared to the naïve approach. As expected, in the case when data are not uniformly random, our method works reasonably well for smaller values of k than the ones implied by the bound in Theorem 1. As an example, the upper bound on k for C. jejuni would be around 200, but the running time for the naïve approach is already better for k = 64. We should note however that the number of candidate profile pairs at Hamming distance at most k is much higher than the expected number when data are uniformly random. This tells us that we can design a simple hybrid scheme that chooses a strategy (naïve or the proposed method) depending on the nature of the input data. It seems also to point out clustering effects on profile dissimilarities,   which we may exploit to improve our results. We leave both tasks as future work for the full version of this article.
We incorporated the approach based on finding lower and higher bounds in the SA in the implementation of goeBURST algorithm, discussed in "Methods" section. We did not incorporate the approach based on the LCP clusters as the running time did not improve much as observed above. Since running times are similar to those reported in Table 3, we discuss only the running time for C. jejuni. We need only to index the input once. We can then use the index in the different stages of the algorithm and for different values of k. In the particular case of goeBURST, we use the index twice: once for computing the number of neighbors at a given distance, used for untying links according to the total order discussed in the description of goeBURST algorithm in methods section, and a second time for enumerating pairs at distance below a given threshold. Note that the goeBURST algorithm does not aim to link all nodes, but to identify clonal complexes (or connected components) for a given threshold on the distance among profiles [10]. In the case of C. jejuni dataset, and for k = 52, the running time is around 36 s, while the naïve approach takes around 115 s, yielding a threefold speedup. In this case we get several connected components, i.e., several trees, connecting the most similar profiles. We provide the tree for the largest component in Fig. 4, where each node represents a profile. The nodes are colored according to one of the loci for which profiles in this cluster differ. Note that this tree is optimal with respect to the criterion used by the goeBURST algorithm, not being affected by the threshold on the distance. In fact, since this problem is a graphic matroid, the trees found for a given threshold will be always subtrees of the trees found for larger thresholds [22]. Comparing this tree with other inference methods is beyond the scope of this article; the focus here was on the faster computation of an optimal tree under this model.
In many studies, the computation of trees based on pairwise distances below a given threshold, usually small compared with the total number of loci, combined with ancillary data, such as antibiotic resistance and host information, allows microbiologists to uncover evolution patterns and study the mechanisms underlying the transmission of infectious diseases [31].

Conclusions
Most distance-based phylogenetic inference methods rely directly or indirectly on Hamming distance computations. The computation of a distance matrix is a common first step for such methods, taking �(md 2 ) time in general, with d being the number of sequences or profiles and m the length of each sequence or profile. For largescale datasets this running time may be problematic; however, for some methods, we can avoid to compute allpairs distances [6].
We addressed this problem when only a truncated distance matrix is needed, i.e., one needs to know only which pairs are at Hamming distance at most k. This problem was motivated by the goeBURST algorithm [10], which relies on a truncated distance matrix by construction.  4 The tree inferred for the largest connected component found with k = 52 for the C. jejuni dataset. Image produced by PHYLOViZ [35]