The majority of multiple sequence alignment (MSA) methods use some form of progressive alignment [1–7]. In progressive alignment the usual first step is to compute a pair-wise distance matrix which is then used to make a so called guide tree, in order to determine the order of alignment of the input sequences. The computation of the distance matrix requires *N* (*N* - 1)/2 pair-wise comparisons, *N* being the number of sequences. Construction of the guide tree, usually has an additional time complexity of
(*N*^{2}) to
(*N*^{3}), depending on the algorithm used and its implementation. The complexity of these steps can become prohibitive when *N* becomes very large e.g. when *N* is in the tens of thousands. There are very few multiple alignment programs that can handle datasets of this size, with MUSCLE and MAFFT being the most familiar [6, 7]. Some of the most accurate multiple sequence alignment methods can only routinely handle sequences numbering in the hundreds [4, 8, 9]. The explosive growth in the number of sequences coming from genomic studies means that the ability to cluster and align greater numbers of sequences is becoming even more important. For example, the Ribosomal Database Project [10] Release 10 consists of more than a million sequences.

In order to make very large guide trees, the first issue is the sheer number of distance calculations. For example, with 100,000 sequences, we need to compute approximately 5 billion distances to construct a complete distance matrix as needed by standard implementations of Neighbor-Joining [11] or UPGMA [12]. Even if the sequences are short, and pair-wise distance calculations can be done relatively quickly, say at a rate of 5000*s*^{-1}, this still requires of the order of 1 million seconds (11.57 days) of CPU time. Just to store the distance matrix is then difficult as it will take up of the order of 20 GB of disk space and/or memory.

There are some shortcuts that can be taken to reduce the number of distance calculations needed for clustering. For example, a recent paper by Katoh and Toh [13] introduced the PartTree heuristic, which could rapidly build a very rough guide tree from an initial small number of seed sequences, using a very fast 6-mer pair-wise distance function and a divisive clustering algorithm with an average time complexity of
(*N* log *N*). This algorithm was incorporated into the MAFFT suite of multiple sequence alignment programs [14]. They reported that this heuristic allowed the rapid clustering and alignment of approximately 60,000 sequences in only a few minutes. When used for a progressive alignment this considerable enhancement in speed came at a cost of several percent in alignment accuracy, as benchmarked on the Pfam database of aligned protein families [15].

In this study, we look at data embedding methods [16, 17] for rapidly calculating guide trees. Our goal is to associate the sequences with a set of vectors in some *t*-dimensional *embedding space*. Embedding is done in such a way that the positioning of the vectors in the space reflects the relationships between the original sequences as best as possible. Having embedded a set of sequences, the distances between the vectors will be much faster and cheaper to calculate than distances computed using typical sequence alignment methods which require
(*L*^{2}) to
(*L*^{3}) time, *L* being the sequence length [18].

Several methods for embedding biological sequences have already been applied to protein sequences. For example, the Linial-London-Rabinovich (LLR) algorithm [16] takes a number of subsets of sequences randomly from the input dataset. Each individual sequence in the dataset is then associated with a vector whose elements are the distances between that sequence and the reference subsets (here, 'distance' is defined to be the minimum distance between sequence and subset). The number and size of the reference subsets only depends on *N*, the number of sequences, such that each embedded vector will be of dimensionality *t* = (log_{2}*N*)^{2}. This algorithm was reported to offer close distance preservation in the embedded space, and was successfully applied to 38,000 sequences from the Swiss-Prot database [19], revealing many natural biological groupings. However, the original implementation meant that
(*N*^{2}) pair-wise distances had to be computed. SparseMap [17] was proposed as a heuristic LLR variant which was applied in much the same way as the original, but contains some heuristics to speed up the embedding process, reducing the number of pair-wise distances that had to be computed from
(*N*^{2}) to
(*Nt*).

The reference groups in both LLR and SparseMap are generated randomly, meaning that a different embedding is found after each run. For testing purposes, this means the average result from several runs should therefore be considered when comparing methods. When applying UPGMA to the outputs from SparseMap embeddings and using these clusterings as guide trees for multiple alignments we found (results not shown) considerable differences between runs, and these differences increase as more divergent sequences are included. For these reasons we introduced SeedMap [20] which is a simplification of SparseMap which uses the same reference sequences in every run and some heuristics to make further increases in speed. SeedMap was found to be capable of producing very fast embeddings of datasets numbering in the 10s of thousands of sequences.

In this paper we look at the use of variations on SeedMap specifically for making guide trees for multiple alignment. We name the resulting method mBed and make it available with routines for sequence input and options for the output of embedded vectors or guide trees. This area of application requires high speed and moderate memory requirements for routine use by biologists. Thus, we have tried to find a method that is as simple and fast as possible while losing as little accuracy as possible compared to the use of a full distance matrix. We test accuracy using standard multiple alignment benchmarking methods [21, 22]. We demonstrate the accuracy of mBed guide trees by comparing these to randomised guide trees and to guide trees directly calculated by ClustalW [5]. We also compared the accuracy of the guide trees to those from MAFFT and PartTree [7, 13]. We demonstrate the scalability of the method by applying it to a set of 380,000 tRNA sequences. Finally, we show a useful by-product of the embedding process where we can easily generate ordinations of large numbers of sequences using Principal Coordinates Analysis (PCoA/PCOORD) or Multi-Dimensional Scaling (MDS) [23].