Core column prediction for protein multiple sequence alignments
 Dan DeBlasio^{1, 2}Email authorView ORCID ID profile and
 John Kececioglu^{1}
DOI: 10.1186/s1301501701023
© The Author(s) 2017
Received: 18 January 2017
Accepted: 28 February 2017
Published: 19 April 2017
Abstract
Background
In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in socalled core columns of the goldstandard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known threedimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted.
Results
We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearestneighbor classification from machine learning, except we transform nearestneighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a largescale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.
Keywords
Multiple sequence alignment Core blocks Alignment accuracy Accuracy estimation Parameter advising Machine learning RegressionBackground
The accuracy of a multiple sequence alignment computed on a benchmark set of input sequences is usually measured with respect to a reference alignment that represents the goldstandard alignment of the sequences. For protein sequences, reference alignments are often determined by structural superposition of the known threedimensional structures of the proteins in the benchmark. The accuracy of a computed alignment is then defined to be the fraction of pairs of residues aligned in the socalled core columns of the reference alignment that are also present in columns of the computed alignment. Core columns represent those in the reference that are deemed to be reliable, which can be objectively defined as those columns containing a residue from every input sequence such that the pairwise distances between these residues in the structural superposition of the proteins are all within some threshold (typically a few angstroms). In short, given a known reference alignment whose columns are labeled as either core or noncore, we can determine the accuracy of any other computed alignment of its proteins by evaluating the fraction of aligned residue pairs from these core columns that are recovered.
For a column in a computed alignment, we can also define the coreness value for the column to be the fraction of its aligned residue pairs that are in core columns of the reference alignment. (Note this definition of column coreness is fully objective when core columns are identified through automated superposition of known protein structures, as done for example in the PALI [1] benchmark suite.) A coreness value of 1 means the column of the computed alignment corresponds to a core column of the reference alignment.

select the best result from an ensemble of different aligners, naturally yielding a new ensemble aligner, which can be far more accurate than any of its individual aligners [5].

mask out unreliable regions of the alignment before computing an evolutionary tree, to boost the quality of phylogeny reconstruction; or

improve an alignment accuracy estimator by concentrating its evaluation function on columns of higher predicted coreness, thereby boosting the performance of parameter advising.
In this paper, we develop for the first time a column coreness predictor for protein multiple sequence alignments. Our approach to predicting coreness is similar in some respects to nearestneighbor classification from machine learning, except we transform nearestneighbor distance into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a largescale linear programming problem. We leverage our new coreness predictor to yield an improved alignment accuracy estimator, and evaluate its performance by applying the improved estimator to the task of parameter advising in multiple sequence alignment.
Related work
To our knowledge, this is the first fully general attempt to directly predict the coreness of columns in computed protein alignments. Tools are available that assess the quality of columns in a multiple alignment, and can be categorized into: (a) those that only identify columns as unreliable, for removal from further analysis; and (b) those that compute a column quality score, which can be thresholded. Tools that simply mask unreliable columns of an alignment include GBLOCKS [6], TrimAL [7], and ALISCORE [8]. Popular qualityscore tools are Noisy [9], ZORRO [10], TCS [11], and GUIDANCE [12].
Our experiments compare our coreness predictor to TCS and ZORRO: the most recent tools that provide quality scores, as opposed to masking columns. Among the other qualityscore tools listed above, Noisy has been shown to be dominated by GUIDANCE, which is in turn dominated by ZORRO. (GUIDANCE also requires four or more sequences, which excludes many benchmarks.) Below we briefly summarize the approaches behind TCS and ZORRO.
TCS (short for “transitive consistency score”) extends an earlier approach of COFFEE [13]. For a pair i, j of residues that are aligned in a column and that come from sequences A and B, the support for aligned pair i, j is the sum of the scores of all pairwise alignments of every other sequence C versus A and B, where the pairwise alignments involving C are constrained to align i and j to a common residue of C, and where this sum is normalized so support is in the range [0, 1]. The TCS score for a column is then the average support of its aligned residue pairs.
ZORRO uses an evolutionary tree over the alignment’s sequences to determine a weight for each sequence pair. The length of each edge in the tree is apportioned among the sequence pairs whose tree paths include that edge; the total amount of edge length apportioned to a given sequence pair yields a weight for that pair, where these weights also take into account both an estimate of the evolutionary distance between sequences (estimated by the length of the tree path between them), and the correlation between sequence pairs (estimated by the length of overlap in the paths between the pairs). The ZORRO score for a column is then the weighted sum, over the column’s aligned residue pairs, of the probability of emitting the residue pair’s amino acids by a pair hidden Markov model, times the weight of the residue pair’s corresponding sequence pair.
In contrast to the quality scores of TCS and ZORRO, we directly predict column coreness. Our approach is also not dependent on the choice of an alignment scoring scheme as in TCS, or the choice of hidden Markov model emission probabilities as in ZORRO.
Plan of the paper
"Learning a coreness predictor" section describes how we learn our coreness predictor. We then explain how we use predicted coreness to improve accuracy estimation for protein alignments. "Assessing the coreness predictor" section evaluates our approach to coreness prediction by applying the improved accuracy estimator to alignment parameter advising. Finally, we conclude and give directions for further research.
Learning a coreness predictor
To describe how we learn a column coreness predictor, we first discuss our representation of alignment columns, and our grouping of consecutive columns into window classes. We then present our regression function for predicting coreness, which transforms the nearestneighbor distance from a window to a class into a coreness value. Following this we explain how to learn the window distance function by solving a largescale linear programming problem. Finally we show that the resulting window distances satisfy the triangle inequality, which enables the use of data structures for metricspace nearestneighbor search when evaluating the regression function.
Representing alignment columns
The information used by our coreness predictor, beyond the multiple sequence alignment itself, is an annotation of its protein sequences by predicted secondary structure (which can be obtained in a preprocessing step by running the sequences through a standard protein secondary structure prediction tool such as PSIPRED [14]). When inputting a column from such an annotated alignment to our coreness predictor, we need a column representation that, while capturing the association of amino acids and predicted secondary structure types, is also independent of the number of sequences in the column. This is necessary as our predictor will be trained on example alignments of particular sizes, yet the resulting predictor must apply to alignments with arbitrary numbers of sequences.
Let \(\Sigma \) be the 20letter amino acid alphabet, and \(\Gamma = \{\alpha , \beta , \gamma \}\) be the secondary structure alphabet, corresponding respectively to types \(\alpha \) helix, \(\beta \) strand, and other (also called coil). We encode the association of an amino acid \(c \in \Sigma \) with its predicted secondary structure type \(s \in \Gamma \) by an ordered pair (c, s) that we call a state, from the set \(Q = (\Sigma \times \Gamma ) \,\cup \, \{\xi \}\). Here \(\xi = (\texttt {},\texttt {})\) is the gap state, where the dash symbol \(\hbox{`'}\not \in \Sigma \) is the alignment gap character.
Classes of column windows
In protein benchmarks, a column of a reference alignment is labeled core if the residues in that column are all sufficiently close in the structural superposition of the known threedimensional structures of the proteins. The folded structure around a residue is not simply a function of the amino acid of the residue itself, or its secondary structure type, but is also a function of nearby residues in the protein. Consequently, to predict the coreness of a column in a computed alignment, we need contextual information from nearby columns of the alignment. We gather this additional context around a column by forming a window of consecutive columns centered on the given column. Formally, a window W of width \({w \ge 1}\) is a sequence of \(2w \!+\! 1\) consecutive column profiles \(C_{w} \cdots C_{1} C_0 C_{+1} \cdots C_{+w}\) centered around profile \(C_0\).
We define the following set of window classes \(\mathcal C\), depending on whether the columns in a labeled training window are known to be core or noncore in the reference alignment. (When later extracting training windows from a computed alignment that has a known reference alignment, we will label a column in a computed alignment as core iff its true coreness value—namely, the fraction of its residue pairs that are in core columns of the reference alignment—is above a fixed threshold.) We denote a column labeled core by C, and a column labeled noncore by N. For window width \(w \!=\! 1\) (which has three consecutive columns), such labeled windows correspond to strings of length 3 over alphabet \(\{\texttt {C}, \texttt {N}\}\). The three classes of core windows are CCC, CCN, NCC; the three classes of noncore windows are CNN, NNC, NNN. (A window is considered core or noncore depending on the label of its center column. We exclude windows NCN and CNC, as these almost never occur in reference alignments.) Together these six classes comprise set \(\mathcal C\). We call the five classes with at least one core column C in the window, structured classes; the one class with no core columns is the unstructured class, denoted by \(\bot = \texttt {NNN}\).
The coreness regression function
We learn a coreness predictor by fitting a regression function that first measures the similarity between a column’s window and training examples of windows with known coreness, and then transforms this similarity into a coreness value.
 (1)
(Find distance to closest class) Across all labeled training windows, in all structured window classes, find the training window that has smallest classspecific distance to W. Call this closest window V, its class c, and their distance \(\delta = d_c(V,W)\).
 (2)
(Transform distance to coreness) If class c is a core class, return the coreness value given by transform function \(f_\text {core}(\delta )\). Otherwise, return value \(f_\text {non}(\delta )\).
We next explain how we efficiently find distance \(\delta \), and then describe the transform functions f.
Finding the distance to a class
To find the distance of a window W to a class c, we need to find the nearest neighbor of W among the set of training windows \(T_c\) in class c, namely \(\mathop {\mathrm {argmin}}_{V \in T_c} \bigl \{ d_c(V,W) \bigr \}\). Finding the nearest neighbor through exhaustive search by explicitly evaluating \(d_c(V,W)\) for every window V can be expensive when \(T_c\) is large (and cannot be avoided in the absence of exploitable properties of function \(d_c\)).
When the distance function is a metric, for which the key property is the triangle inequality (namely that \(d(x,z) \,\le \, d(x,y) + d(y,z)\) for any three objects x, y, z), faster nearest neighbor search is possible. In this situation, in a preprocessing step we can first build a data structure over the set \(T_c\), which then allows us to perform faster nearest neighbor searches on \(T_c\) for any query window W. One of the best data structures for nearest neighbor search under a metric is the cover tree of Beygelzimer, Kakade and Langford [16]. Theoretically, cover trees permit nearest neighbor searches over a set of n objects in \(O(\log n)\) time, after constructing a cover tree in \(O(n \log n)\) time, assuming that the intrinsic dimension of the set under metric d has a socalled bounded expansion constant [16]. (For actual data, the expansion constant can be exponential in the intrinsic dimension.) In our experiments, for nearest neighbor search we use the recentlydeveloped dispersion tree data structure of Woerner and Kececioglu [17], which in extensive testing on scientific data is significantly faster in practice than cover trees.
We build a separate dispersion tree for each structured window class \(c \in {\mathcal C}  \{\bot \}\) over its training set \(T_c\) using its distance function \(d_c\) in a preprocessing step. To find the nearest neighbor to window W over all training windows \({\mathcal T} = \bigcup _{c} T_c\) we then perform a nearest neighbor search with W on the dispersion tree for each structured class c, and merge these \({\mathcal C}1\) search results by picking the one with smallest distance to W.
Transforming distance to coreness
 (a)
take the examples whose nearest neighbor is from one of the three core classes,
 (b)
sort these examples by their observed nearestneighbor distance,
 (c)
at each observed distance \(\delta \), collect all \(k \!\ge \! 1\) examples whose distance equals \(\delta \), the \(\ell \) successive examples whose distance is below \(\delta \), and the \(\ell \) successive examples above \(\delta \), where count \(\ell \) is fixed for the fitting process, and
 (d)
compute the average truecoreness value of these \(k + 2\ell \) examples, and associate this average value with distance \(\delta \).
To predict coreness for a window from a computed alignment, again we (1) find its nearestneighbor distance \(\delta \) among all training windows from structured classes, and (2) transform this distance to coreness by returning \(f_\text {core}(\delta )\) if the nearest neighbor is from a core class and \(f_\text {non}(\delta )\) otherwise.
Learning the distance function by linear programming
We now describe the linear program used to learn the distance functions on column windows. Again we divide the window classes \(\mathcal C\) into two categories: the structured classes, containing windows centered on core columns, or centered on noncore columns that are flanked on at least one side by core columns; and the unstructured class, containing windows of only noncore columns. We again denote this unstructured class of completely noncore windows by \(\bot \in {\mathcal C}\). The linear program learns a classspecific distance function \(d_c\) for each structured window class \(c \,\in \, {\mathcal C}  \{\bot \}\).

If \(\delta \,\le \, \tau \), the central column of window W is declared to be “core” or “noncore” depending on whether structured class c is respectively core or noncore.

Otherwise, window W is deemed to be in the unstructured noncore class \(\bot \), and its central column is declared “noncore.”
To construct the linear program, we partition the training set \(\mathcal T\) of labeled windows by window class: subset \(T_c \subseteq {\mathcal T}\) contains all training windows of class \(c \in {\mathcal C}\). We then form a smaller training sample \({S_c \subseteq T_c}\) for each class c by choosing a random subset of \(T_c\) with a specified cardinality \(S_c\).
For a sample training window \(W \in S_c\) we identify other windows \(V \in T_c\) from the same class c in the full training set that are close to W (under a default distance \(\widetilde{d}_c\)). We call these close windows V from the same class c, targets. Similarly for \(W \in S_c\) we identify other windows \(U \in T_b\) from a different class \(b \ne c\) in the full training set that are also close to W (under \(\widetilde{d}_b\)). We call these other close windows U from a different class b, impostors (paralleling the terminology of Weinberger and Saul [19]).
We call these sets of windows that are close to a given window W the neighborhood \({\mathcal N}_c(W,i)\) of W for a structured class \(c \,\in \, {\mathcal C}  \{\bot \}\), which denotes the set of inearestneighbors to W (not including W) from training set \(T_c\) under the classspecific default distance function \(\widetilde{d}_c\). (The default distance function that we use in our experiments is described later.)

pulls in targets \(V \,\in \, {\mathcal N}_c(W,i)\), by making \(d_c(V,W)\) small, and

pushes away impostors \({U \,\in \, {\mathcal N}_b(W,i)}\) for \(b \ne c\), by making \(d_b(U,W)\) large.
In the target neighborhood \({\mathcal N}_c(W,k)\) above, count k specifies the number of targets for each sample window W. In our experiments we use a small number of targets, with \(k = 2\) or 3.
We also have impostor constraints for each completely noncore window \({W \in T_\bot }\) and each core window \(V \in {\mathcal N}_b(W,\ell )\) from each structured core class b (as we do not want W to be considered core), which are of the same form as inequalities (3) and (4) above.
In the impostor neighborhood \({\mathcal N}_b(W,\ell )\) above, count \(\ell \) specifies the number of impostors for each sample window W. We use a large number of impostors \(\ell \approx 100\) in our experiments. Having a single impostor error variable \(f_W\) per sample window W (versus a target error variable \(e_{VW}\) for every W and target V) allows us to use a large count \(\ell \) while still keeping the number of variables in the linear program tractable.
To summarize, the variables of the linear program are the substitution scores \(\sigma _{c,i}(p,q)\), the error variables \(e_{VW}\) and \(f_W\), and the threshold variable \(\tau \). For n total training sample windows, k targets per sample window, m window classes of width w, and aminoacid alphabet size s, this is \(\Theta (k n + s^2 w m)\) total variables. The main constraints are the target constraints, impostor constraints, and triangle inequality constraints. For \(\ell \) impostors per sample window, this is \({\Theta \bigl ( (k + \ell m)n + s^3 w m \bigr )}\) total constraints. We ensure that solving the linear program is tractable by controlling the number k of targets, the number \(\ell \) of impostors, and the total size n of the training sample.
Ensuring the triangle inequality
We now show that the distance functions obtained by solving the above linear program obey the triangle inequality.
Theorem 1
(Triangle Inequality on Window Distances) The class distance functions \(d_c\) obtained by solving the linear program satisfy the triangle inequality.
Proof
In short, \({d_c(U,W) \,\le \, d_c(U,V) + d_c(V,W)}\) for all windows U, V, W, so the triangle inequality holds on distances \(d_c\). \(\square \)
Since window distances satisfy the triangle inequality, we can use fast data structures for metricspace nearestneighbor search to evaluate the coreness predictor.
Applying coreness to accuracy estimation
The Facet alignment accuracy estimator [3] is a linear combination of efficientlycomputable feature functions of an alignment that are positively correlated with true accuracy. As mentioned earlier, the true accuracy of a computed alignment is measured only with respect to core columns of the reference alignment. We leverage our coreness predictor to improve the Facet estimator by: (1) creating a new feature function that attempts to directly estimate true accuracy, and (2) concentrating the evaluation of existing feature functions on columns with high predicted coreness.
Creating a new coreness feature
Our new feature function on alignments, which we call Predicted Alignment Coreness, is similar to the socalled totalcolumn score sometimes used to measure alignment accuracy. Predicted Alignment Coreness counts the number of columns in the alignment that are predicted to be core, by taking a window W around each column, and counting the number of windows whose predicted coreness value \(\chi (W)\) exceeds a threshold \(\kappa \). This count of predicted core columns in the given alignment is normalized by an estimate of the number of true core columns in the unknown reference alignment of the sequences.

aggregate measures of the lengths of sequences in \(\mathcal S\), namely their minimum, mean, and maximum length;

averages over all pairs of sequences in \(\mathcal S\) of the ratio of their longestcommonsubsequence length divided by an aggregate measure of the lengths of the pair of sequences (which can be viewed as forms of “percent identity”);

averages over all pairs of sequences of the ratio of their difference in sequence length divided by an aggregate length measure (forms of “percent indel”); and

averages over all pairs of sequences of the ratio of aggregate length measures for the pair (forms of “relative indel”).
The final fitted function \(L(\mathcal {S})\) that we use for the new Predicted Alignment Coreness feature is given later.
Augmenting existing features by coreness

Secondary Structure Blockiness \(F_\texttt {BL}\) uses secondary structure predictions on the alignment’s proteins obtained from PSIPRED [14], and returns the maximum total score of an optimal packing of secondary structure blocks in the alignment, normalized by the total number of residue pairs in the alignment’s columns, where: a block is an interval of columns together with a subset of the sequences such that all residues in the block have the same secondary structure prediction, a packing is a set of blocks whose column intervals are all disjoint, and the score of a block is the total number of pairs of residues within the columns in the block. (So an optimal packing maximizes the number of pairs of residues in the alignment’s columns that are covered by blocks of consistent predicted secondary structure.) We create a new augmented feature \(F'_\texttt {BL}\) by weighting the number of residue pairs for a column by the column’s predicted coreness value.

Secondary Structure Identity \(F_\texttt {SI}\) is the fraction of residue pairs in columns of the computed alignment that share the same predicted secondary structure. We create a new feature \(F'_\texttt {SI}\) by weighting counts of column residue pairs by their column’s predicted coreness.

Amino Acid Identity \(F_\texttt {AI}\) is the fraction of column residue pairs that share the same aminoacid equivalence class. The augmented feature \(F'_\texttt {AI}\) weights residue pairs by their column’s predicted coreness.

Average Substitution Score \(F_\texttt {AS}\) is the average BLOSUM62 score [20] of all column residue pairs, with BLOSUM similarity scores scaled to the range [0, 1]. The augmented feature \(F'_\texttt {AS}\) weights this average by the column’s predicted coreness.

Secondary Structure Agreement \(F_\texttt {SA}\) uses predicted secondary structure confidences from PSIPRED (the confidence that a residue is in each of the three secondary structure states) to estimate the probability that each column residue pair shares the same secondary structure state, in a weighted window centered on each pair, and averages these estimates over all pairs.

Gap Open Density \(F_\texttt {GO}\) is the fraction of gap characters (‘’) in the alignment that start a run of such characters.

Gap Extension Density \(F_\texttt {GE}\) is the fraction of alignment entries that are gap characters (‘’).
Assessing the coreness predictor
We evaluate our new approach to coreness prediction, and its use in accuracy estimation for alignment parameter advising, through experiments on a collection of protein multiple sequence alignment benchmarks. A full description of the benchmarks, and the universe of parameter choices for parameter advising, is given in [3].
Briefly, the benchmarks in our experiments consist of reference alignments of protein sequences largely induced by structurally aligning their known threedimensional folded structures. We use the BENCH benchmark suite of Edgar [21], supplemented by a selection from the PALI benchmark suite of Balaji et al. [1]. Our full benchmark collection consists of 861 reference alignments.
We use 12fold crossvalidation to assess both column classification with our coreness predictor, and parameter advising with our augmented accuracy estimator. To correct for the overabundance of easytoalign benchmarks when assessing parameter advising, we bin the benchmarks according to difficulty, measured by the true accuracy of their alignment computed by the Opal aligner [22, 23] under its default parameter setting. We ensure folds are balanced in their representation of benchmarks from all difficulty bins. For each fold, we generate a training set and testing set of example alignments by running Opal on each benchmark for each parameter choice from a fixed universe of 243 parameter settings.
Constructing the coreness predictor
We first discuss results on learning the distance functions for the coreness predictor, and then discuss results on fitting its transform functions.
Learning the distance functions
To keep the size manageable of the linear program that we solve to learn the window class distance functions \(d_c\), we use a training sample of 2000 total windows representing the structured classes. We find targets and impostors for windows from the training sample by performing nearestneighbor searches in training sets that for each fold have 4000 windows from each structured window class. For each window from the training sample, the linear program uses 2 targets, and 150 impostors from each window class. For testing the accuracy of our learned coreness predictor, we use testing sets of 2000 total windows representing all classes (including the unstructured class). The windows for these training samples, training sets, and testing sets are drawn from corresponding training and testing example alignments.
We form the initial sets of targets and impostors for the linear program by either: (1) performing nearestneighbor searches using a default distance function whose positional substitution score is a convex combination of (a) the VTML200 substitution score on the states’ amino acids (transformed to a dissimilarity value in the range [0, 1]), and (b) the identity function on the states’ secondary structure types, with positions weighted so the center column has twice the weight of its flanking columns; or (2) randomly sampling windows from the appropriate classes to choose targets and impostors, as further discussed below.
Core column classifier areaunderthecurve (AUC) for training and testing data with iterated distance learning
Iteration  

1  2  3  4  5  6  7  8  9  10  
Training  88.7  94.3  99.2  99.4  99.5  99.7  99.6  99.7  99.4  99.7 
Testing  84.7  81.3  80.8  80.0  83.4  82.5  81.2  80.9  80.3  81.6 
Note that the training AUC steadily increases for the first four iterations, and then oscillates around a high plateau. This does not translate, however, into an improvement in the testing AUC, which actually drops and then oscillates at a much lower level.
While iterating distance learning markedly improves this core column classifier on the training examples, it is overfitting, and does not generalize well to testing examples. This may be due to the smaller training sample and training sets used to reduce the time for solving the linear program.
Interestingly, we found that using random examples from appropriate window classes for the target and impostor sets led to much better generalization. Specifically, this achieved a training and testing AUC of 86.0 and 88.6, respectively. Accordingly, in the remainder of the paper we use distance functions obtained by solving the linear program with random target and impostor sets, and no iteration, when assessing results on parameter advising.
Transforming distance to coreness
Interestingly, when a column’s window is sufficiently far away from all structured classes (including core and noncore classes), the green \(f_\text {core}\) and \(f_\text {non}\) logistic curves both converge to a predicted coreness around 33% (which roughly agrees with the blue and red empirical average coreness curves).
Improving parameter advising
A parameter advisor and has two components: (1) an accuracy estimator, which estimates the accuracy of a computed alignment, and (2) an advisor set, which is a set of candidate assignments of values to the aligner’s parameters. The advisor picks the choice of parameter values from the advisor set for which the aligner yields the computed alignment of highest estimated accuracy.
In our parameter advising experiments, we assess the true accuracy of the multiple sequence alignment tool Opal [22, 23] combined with an advisor that uses the accuracy estimator Facet [3] (the best estimator for parameter advising in the literature), augmented by our new coreness predictor as well as by two other columnquality tools: TCS [11] and ZORRO [10]. We compare these advising results against prior approaches using for the estimator both the original unmodified Facet as well as TCS (the nextbest estimator for parameter advising in the literature). We also compare against augmenting Facet by true coreness, which represents the unattainable limit reached by a perfect coreness predictor.
These experiments focus on advising for the Opal aligner, as it is an ideal test bed for studying parameter advising: in contrast to other aligners, at each node of the guide tree during progressive alignment, Opal computes subalignments that are optimal with respect to the given parameter choice for the sumofpairs scoring function with affine gap costs [24].

estimatorindependent oracle sets, which are learned for a conceptual oracle advisor that has access to true accuracy for its estimator, by solving an integer linear program to achieve optimal advising accuracy on training data; and

estimatoraware greedy sets, which are learned for a specific concrete estimator by a greedy approximation algorithm that guarantees nearoptimal training accuracy, and which tend to perform better than oracle sets in practice when used with their concrete estimator.
The augmented Facet estimator
We use our coreness predictor to modify the Facet accuracy estimator by including the new Predicted Alignment Coreness feature, and augmenting existing feature functions by coreness. We learned coefficients for these feature functions, as well as all the features originally in Facet, using the differencefitting technique described in [3].

the new feature: Predicted Alignment Coreness \(F_\texttt {AC}\) ;

two features augmented by our coreness predictor: Secondary Structure Identity \(F'_\texttt {SI}\) , and Secondary Structure Blockiness \(F'_\texttt {BL}\) ; and

five original unaugmented features: Secondary Structure Agreement \(F_\texttt {SA}\) , Secondary Structure Identity \(F_\texttt {SI}\) , Secondary Structure Blockiness \(F_\texttt {BL}\) , Gap Extension Density \(F_\texttt {GE}\) , and Gap Open Density \(F_\texttt {GO}\) .
The coreness feature normalizer
The normalizer function \(L(\mathcal {S})\) used in Predicted Alignment Coreness (feaure \(F_\texttt {AC}\)), gives an estimate of the number of core columns in the unknown reference alignment of sequences \(\mathcal S\), and is a linear combination of basic measures of \(\mathcal S\), as described earlier. Optimal coefficients for this linear combination are learned by minimizing its \(L_1\)norm with the true number of core columns in training benchmarks.

\(\ell _\text {min}, \, \ell _\text {avg}, \, \ell _\text {max}\) are respectively the minimum, average, and maximum sequence lengths in \(\mathcal S\);

\(p_\text {min}, \, p_\text {avg}, \, p_\text {max}\), similar to percent identity measures, are the longestcommonsubsequence length for each pair of sequences normalized by respectively the minimum, average, and maximum sequence length for the pair, averaged over all pairs;

\(q_\text {min}, \, q_\text {avg}\) are quotients of respectively the minimum or average sequence length for each pair of sequences divided by the maximum length for the pair, averaged over all pairs; and

\(r_\text {min}, \, r_\text {avg}\) are ratios of the difference in sequence lengths for each pair of sequences divided by respectively the minimum and average sequence length for the pair, averaged over all pairs.
Performance on parameter advising

the augmented Facet estimator (“Facet/predicted”),

the original unaugmented Facet estimator (“Facet/none”),

Facet augmented by TCS (“Facet/TCS”),

Facet augmented by ZORRO (“Facet/ZORRO”), and

Facet augmented by true coreness (“Facet/true”).

TCS (the nextbest estimator for advising in the literature).
Performance when expanding the parameter choices
Figures 6 and 7 show parameter advising performance using oracle and greedy advisor sets, respectively. In both figures, the horizontal axis is advisor set cardinality (the number of different parameter choices available to the advisor), while the vertical axis is advising accuracy for testing folds (the true accuracy on testing benchmarks of the aligner combined with the parameter advisor), uniformly averaged across bins. The curves show performance with the Opal aligner [22, 23]. For reference, the default alignment accuracy for three other popular aligners, MAFFT [27], MUSCLE [28], and Clustal Omega [29], is also shown with dashed horizontal lines.
Figure 6 shows that on oracle advisor sets, Facet/predicted compared to Facet/none boosts the average accuracy of parameter advising by more than 3%. This increase is in addition to the improvement of Facet over TCS.
Figure 7 shows that on greedy advisor sets, Facet/predicted boosts advising accuracy as well: for example, at cardinality 7, by more than 1%. (Note that accuracies for the greedy set curves are already higher than for oracle sets.) Up to cardinality 7, the accuracy for Facet/predicted is about halfway between Facet/none and Facet/true (the unattainable perfect coreness predictor). Interestingly, Facet/TCS and Facet/ZORRO actually have worse accuracy than Facet/none.
Performance when generalizing to new data
While with greedy advisor sets, using predicted coreness to augment Facet does boost advising accuracy, a larger improvement might be realized by pairing with a better approach to finding estimatoraware advisor sets than greedy setfinding. The boost in accuracy we observe may actually be limited by the methods we are using to find advisor sets for the improved estimator. As an indication, Fig. 8 contrasts average training and testing accuracy for advising with Facet/predicted on greedy sets: specifically, the accuracy using greedy sets that were learned on training benchmarks when these same sets are applied to testing benchmarks. The upper dashed curve is average training advising accuracy, while the lower solid curve is testing accuracy. The drop between these curves indicates greedy setfinding is overfitting to training data, and not generalizing well to testing data. With better generalization, we might also continue to get improved performance at set sizes beyond cardinality 7, where greedy advisor sets currently plateau.
Performance within difficulty bins
Advising accuracy within difficulty bins for greedy sets of cardinality 7 is shown earlier in Fig. 2. In this bar chart, for the bin at each difficulty on the horizontal axis, advising accuracy averaged over just the benchmarks in the bin is shown on the vertical axis. The final chart on the right gives accuracy averaged across all bins. On difficult benchmarks, Facet/predicted boosts the accuracy of Facet/none by more than 3%. Note also how close Facet/predicted is to Facet/true: the advising accuracy is already quite close to what could be achieved augmenting with a perfect coreness predictor.
Conclusion
Further research
A key issue left to explore is how to improve the generalization of distancefunction learning and greedy advisorset learning. Currently both tend to overfit to training data, resulting in a loss of testing accuracy. One way to address overfitting in distance learning would be to lower the number of substitutionscore parameters in the learned distance functions by using reduced proteinsequence alphabets with aminoacid equivalence classes, which should aid generalization.
Another very promising research direction is to apply the improved accuracy estimator to ensemble multiple sequence alignment [5], where the estimator is used to pick the alignment output by an ensemble of sequence aligners. Any improvement in the estimator should yield further accuracy boosts for ensemble alignment.
Declarations
Authors' contributions
Conceptualization of the research, data analysis, and writing of the manuscript by JK and DD. Coding and implementation by DD. Work of DD originally performed at the University of Arizona. Both authors read and approved the final manuscript.
Acknowledgements
The authors thank the reviewers for their helpful comments. An earlier conference version of this paper appeared as [30].
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
The benchmarks used for training and testing, as well as the original version of the Facet estimator, are available at http://facet.cs.arizona.edu.
Funding
JK and DD were supported at the University of Arizona by US National Science Foundation Grant IIS1217886 to JK. DD was also partially supported at Carnegie Mellon University by NSF Grant CCF1256087, NSF Grant CCF131999, NIH Grant R01HG007104, and Gordon and Betty Moore Foundation Grant GBMF4554, to Carl Kingsford. Open access charges for this paper were covered by the University of Arizona Open Access Publishing Fund.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Balaji S, Sujatha S, Kumar SSC, Srinivasan N. PALI—a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res. 2001;29(1):61–5.View ArticlePubMedPubMed CentralGoogle Scholar
 DeBlasio DF, Wheeler TJ, Kececioglu JD. Estimating the accuracy of multiple alignments and its use in parameter advising. In: Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB); 2012. pp. 45–59.
 Kececioglu J, DeBlasio D. Accuracy estimation and parameter advising for protein multiple sequence alignment. J Comput Bio. 2013;20(4):259–79.View ArticleGoogle Scholar
 DeBlasio DF. Parameter Advising for Multiple Sequence Alignment. PhD dissertation. Department of Computer Science, The University of Arizona; 2016.
 DeBlasio D, Kececioglu J. Ensemble multiple sequence alignment via advising. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB). 2015; pp. 452–461. doi:10.1145/2808719.2808766.
 Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17(4):540–52.View ArticlePubMedGoogle Scholar
 CapellaGutierrez S, SillaMartinez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in largescale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3.View ArticlePubMedPubMed CentralGoogle Scholar
 Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B. Parametric and nonparametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front Zool. 2010;7(10):1–10.Google Scholar
 Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol. 2008;3(1):7.View ArticlePubMedPubMed CentralGoogle Scholar
 Wu M, Chatterji S, Eisen JA. Accounting for alignment uncertainty in phylogenomics. PLoS ONE. 2012;7(1):30288.View ArticleGoogle Scholar
 Chang JM, Tommaso PD, Notredame C. TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014;31:1625–37.View ArticlePubMedGoogle Scholar
 Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 2015;43(W1):7–14.View ArticleGoogle Scholar
 Notredame C, Holm L, Higgins DG. COFFEE: an objective function for multiple sequence alignments. Bioinformatics. 1998;14(5):407–22.View ArticlePubMedGoogle Scholar
 Jones DT. Protein secondary structure prediction based on positionspecific scoring matrices. J Mol Biol. 1999;292(2):195–202.View ArticlePubMedGoogle Scholar
 Durbin R, Eddy SR, Krogh A, Mitchison G. Biological Sequence Analysis: Probablistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press; 1998.View ArticleGoogle Scholar
 Beygelzimer A, Kakade S, Langford J. Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning (ICML); 2006.
 Woerner A, Kececioglu J. Faster metricspace nearestneighbor search using dispersion trees. In preparation. 2017.
 Jones E, Oliphant T, Peterson P, et al. SciPy: open source scientific tools for Python; 2001. http://www.scipy.org/.
 Weinberger KQ, Saul LK. Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res. 2009;10:207–44.Google Scholar
 Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89(22):10915–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Edgar RC. BENCH. drive5.com/bench. 2009.
 Wheeler TJ, Kececioglu JD. Multiple alignment by aligning alignments. Bioinformatics. 2007;23(13):559–68 (Proceedings of ISMB).View ArticleGoogle Scholar
 Wheeler TJ, Kececioglu JD. Opal: software for sumofpairs multiple sequence alignment. 2012. opal.cs.arizona.edu.
 Kececioglu J, Starrett D. Aligning alignments exactly. In: Proceedings of the 8th Conference on Research in Computational Molecular Biology (RECOMB); 2004. pp. 85–96.
 DeBlasio DF, Kececioglu JD. Learning parameter sets for alignment advising. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB); 2014. pp. 230–9.
 DeBlasio DF, Kececioglu JD. Learning parameteradvising sets for multiple sequence alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics; 2017. (To appear)
 Katoh K, Kuma KI, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005;33(2):511–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Edgar RC. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 2004;5(113):1–19.Google Scholar
 Sievers F, et al. Fast, scalable generation of highquality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7(1):1–539.Google Scholar
 DeBlasio D, Kececioglu J. Predicting core columns of protein multiple sequence alignments for improved parameter advising. In: Proceedings of the 16th Workshop on Algorithms in Bioinformatics (WABI); 2016. pp. 77–89.