TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column

Background  As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed. Results  We present algorithms to estimate the evolutionary time (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_{\text {MRS}}$$\end{document}tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_{\text {MRS}}$$\end{document}tMRS values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma$$\end{document}σ of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_{\text {MRS}}$$\end{document}tMRS by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations. Conclusions  We have presented algorithms to estimate the evolutionary time (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$t_{\text {MRS}}$$\end{document}tMRS) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species. Electronic supplementary material The online version of this article (10.1186/s13015-019-0158-3) contains supplementary material, which is available to authorized users.


Background
As sequenced genomes continue to accumulate, a very rich source for discovering detailed evolutionary information grows. The UCSC genome browser provides multiple genome alignments for 100 vertebrate species, including humans (the multiz100way track) [1][2][3].
In previous decades, multiple DNA alignments are often used to reconstruct species trees and ancestral nucleotide states [4] and many algorithms and softwares are developed for such purposes. Some of the most used algorithms include Neighbor-Joining algorithm [5] and maximal likelihood method [4] and Bayesian Markov chain Monte Carlo method [6]. These algorithms usually assume evolutionary models that each nucleotide stochastically mutates over evolutionary time, and output the most consistent phylogenetic tree from possible (2n − 3)!! rooted or (2n − 5)!! unrooted trees for n-species. On the other hand, since the species tree of 100 vertebrates of multiz100way are basically resolved from the previous studies [7], finding functional genomic sites rather than determining the phylogenetic tree is becoming more important application as the use of multiple genomic alignments in recent years.
As it is difficult to visually inspect functional regions from 100-species alignments, computing genome-wide summary statistics is very important. Measuring the strength of negative or positive selection is among the most popular analyses for screening functional regions of genomes [8][9][10][11][12][13][14]. These statistics are computed using probabilistic models that model the stochastic processes of DNA mutations along phylogenetic species trees, which are used in tree reconstruction [4][5][6], and detect genomic regions that show smaller or larger mutation rates using likelihood ratio tests or similar probabilistic computations.
Such statistics have advantages over simpler statistics that do not assume a particular evolutionary model, such as nucleotide frequency of alignment columns and pairwise mismatch rates. By using a phylogenetic tree, we can appropriately count the number of ancestral mutations that are widespread within extant species. Further, stochastic processes can account for multiple nucleotide mutations whose effects are not negligible when we study evolutionarily distant species. However, only conservation/divergence measures are not sufficient to extract all evolutionarily important events from potential 4 100 nucleotide patterns of a 100-species alignment column.
In this study, we develop algorithms to compute three statistics, t MRS , σ , and q, for each column of a multiple genome alignment based on an evolutionary model that is similar to those described above. t MRS is the evolutionary time to the most recent substitution event that occurred along the lineage of a given target species in the phylogenetic tree. Since the confidence in estimated t MRS values varies markedly among alignment columns depending on gap fractions and complexity of nucleotide patterns (see Fig. 1 for explanation), we also compute the standard deviation σ of t MRS . Further, we compute the probability q that there is no mutation in the target lineage because the estimated t MRS value has no meaning in such cases. By filtering out sites with non negligible probability of nucleotide conservation over the entire target lineage based on q, we can remove highly conserved sites. By comparing t MRS with speciation time points, we can categorize sites by the groups of species that share mutation effects with the target species. Such detailed information is difficult to obtain from conservation measures. Our algorithms can be a very useful tool for screening the genomic sites that may have been involved in the acquisition of unique biological features by target species.
In the next section, we describe our algorithms to compute t MRS and data processing procedures. We first explain the t MRS algorithm on a single edge of phylogenetic tree, and then generalize it to account for the entire tree. The algorithms for computing σ and q are described in Additional file 1 as they are very similar to that of t MRS . In the result section, we empirically show the correctness of our algorithms by posterior sampling of mutation history. We also show that our algorithm is fast enough to be applied to the entire human genome, and that t MRS statistic is very different from other statistics to detect evolutionary conservation/divergence of genomic sites. We then apply our algorithms to the multiz100way dataset and investigate distributions of t MRS in different genomic contexts. In particular, we investigate the correlation between t MRS distribution on the bidirectionally transcribed enhancers and tissue specificity of enhancer activities and found that brain-specific transcribed enhancers are threefold enriched in the density of t MRS that located in the human lineage.

Method
We first derive formulas for t MRS and other variables for an edge of a phylogenetic tree, and then describe how to generalize them into statistics for the entire phylogenetic tree.

Single edge case
A continuous-time Markov model for nucleotide sequence evolution can be defined by a differential equation that determines the time evolution of the probability of observing each nucleotide: where p(a|b, t) represents the probability of observing base a at time t conditioned on base b being observed at time zero; Nuc = {A, C, G, T } = {1, 2, 3, 4} represents the set of nucleotides; δ ij represents the Kronecker delta, which is 1 if i = j and is 0 otherwise; and R = {R ij } represents the substitution rate matrix. The solution is given by a matrix exponential, which can be numerically computed by using the eigenvalue decomposition of rate matrix R = U U −1 ( � = diag( 1 , . . . , 4 ) ) as follows [15], Similar to the scalar exponential function, a matrix exponential has an infinite product representation, where Q = (I + tR/N ) . The matrix Q satisfies the condition of a transition matrix of a discrete Markov process for sufficiently large N, and our formula for t MRS can be derived via this connection to the discrete model. In the second equation, � N (a, b) is the set of all paths X along discrete time points 0, . . . , N such that X = {X k ∈ Nuc|k = 0, . . . , N , X N = a, X 0 = b} . Then, the summand of the second equation can be interpreted as the probability of substitution history P(X N , X N −1 , . . . , X 1 |X 0 ) . In the discrete model, the random variable T MRS that represents the time to the most recent substitution is given by where I(·) is the indicator function. Note that in the first equation, we define T MRS = t if path X has no substitution at all. In the second line, we used , and the two terms in the second line mostly cancel out to give the third line. Then, the expected value t MRS of T MRS is given by where Q D is the diagonal part of Q.
In order to take the continuum limit ( N → ∞ ), we use formulas such as where R D is the diagonal part of rate matrix R. By using these formulas, t MRS can be computed using the following formulas where R = U U −1 and � = diag( 1 , . . . , 4 ) is an eigenvalue decomposition of rate matrix R. The formulas for the standard deviation σ of T MRS and probability q of no substitution can be derived in similar manners and given by The derivation of each above formula is described in Additional file 1.

Strand symmetric rate matrix
Let a c be the complementary nucleotide of nucleotide a. A rate matrix R is strand symmetric if it satisfies R a c b c = R ab for all a, b ∈ Nuc [16]. Strand non-symmetric rate matrices such as the general time reversible (GTR)  In the left figure, we expect the last substitution occurred between node x and y, and t MRS will be around t 1 to t 1 + t 2 . In the middle figure, the pattern of alignment column is not simple, and the state of node y can be either A or G. Therefore, the inferred t MRS will have a large variance between t 1 to t 1 + t 2 + t 3 . In the right figure, there is an ambiguous nucleotide in the column. In such cases, the inferred t MRS value is the same as that inferred from only three species, and the confidence will accordingly be lower than when all four nucleotides are known model generally produce different posterior expectation values if we take the complement of an alignment column. Since there is no specific strand direction in intergenic regions and the existence of two different expectation values for a single genomic site complicates the downstream analyses, we use the most general, 6-parameter strand symmetric rate matrix. Table 1 shows the parametrization of rate matrices of strand symmetric model and GTR model. The parameters are optimized together with the edge lengths of the phylogenetic tree using the maximum likelihood method. We optimize the parameters using a LBFGSB gradient descent package [17], where we compute the gradient of likelihood function exactly using a inside-outside algorithms as described in Refs. [4,18,19].
where Z(Y ) represents the likelihood of alignment column Y, π represents the equilibrium distribution for rate matrix R, Y (L ′ ) represents the partial alignment column for a subset of leaf nodes L ′ ∈ L ( L : the set of all leaf nodes), a = Y (c 0 ) , b k represents the sibling node of c k−1 with parent node c k , t n represents the edge length between node n and its parent node, L(n) the descendant leaves of node n, and X n represents the random variable that represents the nucleotide type at node n. The inside variable α(n, i) = P(Y (L(n))|X n = i) represents the probability of emitting partial alignment column Y (L(n)) given the state at node n is fixed to i. See Fig. 2 for the relations between tree nodes and dynamic programming variables. Because the range of integration is localized only in the kth edge in the above equation, we can compute t MRS using a dynamic programming algorithm (Algorithm 1). In

Phylogenetic tree case
To extend our algorithm to phylogenetic trees, we specify a target species that corresponds to a leaf node of a tree and consider the path from the leaf node to the root node. Each internal node along the path corresponds to the last common ancestor (concestor) [20] of the target species and some extant species. Let C = c 0 , . . . , c M be the set of concestors with c M being the root node and c 0 being the leaf node of the target for convenience. Further, let s i be the fraction of path length between the leaf and c k , let s kl = (s l − s k ) , and let t be the total path length from the target leaf to the root. Then, the corresponding formula of Eq. 1 is obtained by dividing the integration range into sub-intervals between neighboring concestors and inserting the probabilities {γ k } that emit partial alignment columns that are descendants of the sister branch of each concestor (see Fig. 2 ] ji represents the probability of transition j ← i after time t without any substitution. κ(i, j) is defined by β(n, i) = P(Y (L\L(n)), X Pa(n) = i) is called an outside variable and represents the probability of emitting alignment nucleotides other than the descendants L(n) of node n with a constraint that the state of the parent node Pa(n) is fixed to i. The inside and outside variables are computed by using the inside and outside algorithms [4,19] resembling the use of forward-backward algorithms in linear hidden Markov models. α D (c k , i) is the probability that emits the partial alignment column Y (L(n)) with no substitution along the target lineage up to concestor node c k , given the state of node n is fixed to i. Similar algorithms can be derived for the standard deviation σ and the probability of no mutation q as described in Additional file 1.

Alignment gaps and ambiguous characters
We treat gap and ambiguous nucleotide characters of non-target leaves as missing characters; we sum the probabilities of all possible nucleotide patterns in computation. Then, the probability condition indicates that the estimated values are the same as those computed from the reduced phylogenetic tree and alignment columns after removal of gaps and ambiguous characters and the corresponding edges in the tree. This increases the standard deviation σ of estimates t MRS . On the other hand, we do not consider the sites if the character of the target is a gap or an ambiguous character.

Software availability
We implemented our algorithms in the C++ language. The resulting software ('TMRS') is available at our website [21].

Dataset and data processing
We downloaded the MAF-formatted Multiz100way multiple alignment files from the UCSC genome browser site, which consists of multiple genome alignments of 100 vertebrate species, including the human genome version hg38. We also downloaded the phylogenetic tree data from the PhyloP track, whose edge lengths are trained using fourfold degenerate (4d) sites of RefSeq genes under the general time reversible model. We used the topology of the PhyloP phylogenetic tree as it is, and trained only the edge lengths of the tree as well as the rate parameters of the strand symmetric model. For this, we collected alignment columns at human 4d sites based on gene annotations of the Ref-Gene track from the UCSC site, following Siepel et al. [8] and Pollard et al. [9]. The reason for using 4d sites is the higher quality of alignments and higher coverage of distant species in the alignments [8,9], though they may be subject to various evolutionary constraints. In order to investigate the uncertainty of trained parameters, we randomly sampled 100 sets of 4d sites from about three million 4d sites in the human genome such that each has a given number of sites, ranging from 1 to 10 5 . We generated an alignment of concatenated genomic alignment columns, and trained parameters based on the maximum likelihood method [22], using the LBFGS-B gradient descent package [17].
For studying differences in t MRS distributions among genes, we sampled 100,000 alignment columns from intergenic, CDS, 3′UTR, and 5′UTR sequences based on 'Gencode v24 Basic' track gene models from the UCSC site [3].
Anderson et al. [23] identified genomic elements called transcribed enhancers in human and other genomes, where short RNAs are produced by bidirectional transcription as a result of chromatin openings. From the FANTOM5 enhancer atlas site [24], we downloaded the coordinates of transcribed enhancers and the list of tissue and cell specific enhancers where bidirectional transcription occurs in a tissue and/or cell-specific manner.

Parameter optimization and performance tests
We trained rate matrix {R ij } and tree edge lengths {t k } from genomic multiple alignment columns sampled from 4d sites. We trained 100 sets of parameters with random initial points from 100 sets of random-sampled alignment columns. Figure 3 (top left) shows the distributions of pairwise relative differences of trained parameters θ = {R ij }, {t k } for each number of alignment columns. Here, the relative difference between two parameters θ 1 and θ 2 is defined by |θ 1 − θ 2 |/max(|θ 1 |, |θ 2 |) with |v| being the Euclidean norm. It shows the trained parameter converges very well as increasing the number of alignment columns. Figure 3 (top right) shows the Pearson correlation coefficient with the tree edge lengths provided in the PhyloP track of the UCSC genome browser, which was computed using the general time reversible model [9]. It shows concordant tree edge lengths (correlation coefficient > 0.9 ) are learned despite the differences in rate matrix models. Figure 3 (bottom) shows the distributions of the tree path lengths from the leaf node of humans to its concestor nodes using parameters trained with 100,000 alignment columns. As the variance among training sets is very small, we use their mean values as the times to concestors before the present and do not consider the widths of distributions. Table 2 shows the mean rate matrix and equilibrium distribution. The average transition-transversion rate ratio is about 2.7 in this model (see Section 6 in Additional file 1 for the computation). In the following results, we use 100 sets of parameters that are trained from 100,000 alignment columns and take averages of t MRS , σ , and q computed for each parameter set. In Fig. 4, we compared ( t MRS , σ , q) computed by our algorithms with the corresponding values obtained from posterior sampling of mutation histories along the phylogenetic tree in order to numerically check the correctness of our algorithms. It shows the relative errors between two values monotonically decrease as the sample size and the fineness of discretization increases. Table 3 shows the runtimes of our C++ implementation. We used a single ES-2670 v3 2.3 GHz core as the computational platform. As the t MRS , σ , and q values of each alignment columns are independently computable, our algorithms can deal with the entire human genome with reasonable time using a compute cluster.

Comparison with other statistical measures
To show the significance of our algorithms, we compared the accuracy with two possible methods of estimating t MRS and q. The first method (termed 'reconstruction') uses the ancestral reconstruction. In this method, we first set the nucleotide state of each concestor node c k to the base a c k with the maximal posterior probability: Then, we return the middle point of the edge between nodes c k−1 and c k as t MRS where c k is the most recent concestor whose reconstructed nucleotides differ from that of the target species a c k = Y (c 0 ) . We set q = 1 if there is no such c k and we set q = 0 otherwise. The second method (termed 'alignment') to infer t MRS only considers nucleotides of extant species: we return the middle point of the edge between nodes c k−1 and c k as t MRS where c k is the most recent concestor such that partial alignment column Y (L(c k ) , which are descendants of c k , contain different nucleotide from the target nucleotides . Similarly to the 'reconstruction' method, We set q = 1 if there is no such c k and we set q = 0 otherwise.
To compare the accuracy of our algorithm with these approximate algorithms, we simulated evolutionary history and alignment column of base mutation using forward simulation using the phylogenetic model of the previous section. We masked nucleotide positions where there are gap or ambiguous characters in sampled multiz100way alignments in order to imitate the gap patterns of real alignments. Details of the simulation algorithm is described in Section 6 of Additional file 1. As a result, we obtained 100,000 alignment columns of 100 species with 'true' annotation of t MRS and q ∈ {0, 1}. Figure 5a shows accuracies of predicting the absence of mutation along the target lineage. The x-axis is the fraction of positives in the dataset which was controlled by varying threshold of q. Since the 'reconstruction' and 'alignment' methods assign only binary q values, only a single point is plotted for each. The y-axis represents the ratio of false positives in all the positive predictions (i.e.

Table 2 Trained rate matrix and equilibrium distribution
The elements of rate matrix R Symmetric and its equilibrium frequency π are shown. Parameter variables correspond to matrix R Symmetric in Table 1. We averaged the parameters optimized using 100,000 sampled alignment columns in the 4d sites. Due to the symmetry of rate matrix, complementary nucleotides have the same equilibrium frequency

Substitution type Parameter Rate
Nucleotide Equilibrium frequency π  False Discovery Rate, FDR). It shows that FDR monotonically decreases with decreasing q threshold, indicating the correctness of our algorithm for q. It also indicates that the accuracies of absence call of reconstruction and alignment methods are similar to that of our algorithm with positive fraction 0.5 and 1.0, respectively. Figure 5b shows the mean errors of predicted t MRS relative to the total length of target lineage for each positive fraction. The error mostly decreases with stricter thresholds for our method, while reconstruction and alignment methods show more than 10% errors on average. Table 4 shows numerical values of FDR and mean error for several q threshold. Since the mean error of t MRS is less than 5% of the total length of target lineage, we will use threshold q = 0.01 in the analyses in the following sections. Table 5 shows the comparison of t MRS and other statistical measures computed from genomic alignments. We used the same alignment columns in the previous paragraph but with filtering with threshold q < 1 for true q values. For this dataset, we computed Spearman's correlation coefficients with true t MRS and other indicators:

Table 4 Effects of filtering by probability q of no mutation
We computed a few statistical measures for the simulation dataset obtained by forward sampling of base mutation history. The first column represents the threshold values q threshold . 'positive fraction' represents the fraction of alignment columns with q < q threshold . 'FDR for no mutation' represents the fraction of the alignment columns that have no mutation along the target lineage but satisfy q < q threshold . '% error of t MRS ' represents the mean % error of estimated t MRS relative to the total edge length of the target lineage  t MRS (q < 0.01, 0.1, 1) represents our algorithms with a few filtering criteria of q. 'reconstruction' and 'alignment' are the approximate methods described above with filtering based on q values computed by their respective method. 'entropy' represents the information entropy of base frequency of alignment column. 'pairwise' represents the ratio of the number of identical bases in n(n − 1)/2 possible base pairs of n bases in the alignment column. 'phastcons' represents the posterior probability of conserved region computed by PhastCons [8]. 'phylop' represents the negative p-value of conservation computed by PhyloP [9]. 'gerp' represents the estimated number of 'rejected mutations' compute by Gerp++ [10]. The table shows small correlation of conservation measures (entropy, pairwise, phastcons, phylop, gerp) with t MRS and very high correlation of estimated t MRS with strict filtering criterion q < 0.01 . It shows our algorithms can accurately extract distinct evolutionary information which is difficult to extract with previous conservation measures.

Genomic distribution of t MRS
We computed the time to the most recent substitution t MRS , its standard deviation σ , and the probability q that there is no substitution for alignment columns uniformly sampled from the human genome. The scatter plot of t MRS and q values (Fig. 6 (top left)) shows the probability of no substitution q tends to increase with increasing t MRS . However, the distribution is broad depending on the nucleotide patterns of alignment columns, and a nonzero fraction of sites have deep ancestral substitutions (i.e., large t MRS and small q) within the Homo-Vertebrate lineage. The scatter plot of σ and q ( Fig. 6 (right)) shows that the probability of no substitution q is very small if  Figure 6 (bottom left) shows the density of q for each annotated genomic region. Compared to Intergenic, Intron, 3′UTR, and 5′UTR, CDS regions have a large fraction of sites with a high probability of no mutation, indicating many ancestral nucleotides that were fixed before the appearance of the vertebrate concestor. Since computed t MRS values have less meaning if q is large, we filtered out sites with q > 0.01 and plotted the distributions of t MRS values for the remaining sites (Fig. 6  (bottom right)). There are several peaks because some sites are guaranteed to experience the last substitution between specific interval of concestors. All regions have the highest peak around t MRS ∼ 0.1 , which is between the Simiiformes and Primate concestors. CDS regions have a large peak around t MRS ∼ 0.36 , which corresponds to between the Eutheria and Theria concestors.

Concestor interval of the last substitution event
We are generally interested in the substitutions that are associated with the evolution of unique features in the species that inherited them. In this respect, we want to know in which interval between two speciation events (i.e., between two concestor nodes) each t MRS is located. In order to simplify the presentation, we reduced the concestor nodes from the full 19 concestors of the PhyloP tree to eight as shown in Table 6 and Fig. 7 in the following analyses of concestor intervals. Since the estimated t MRS values can have a large standard deviation σ , we consider intervals between all pairs of concestor nodes: Homo-Hominoidea, Homo-Mammalia, Mammalia-Vertebrata, etc. Then, we assign a concestor interval I to a site if q < 0.01 and if I is the smallest interval that contains a confidence interval [t MRS − 2σ , t MRS + 2σ ] . Only about 4% of sites were assigned to any concestor interval by this method. Figure 8 (top) shows the frequency distribution of genomic sites that are assigned to some concestor interval, which shows that many sites are assigned to concestor intervals Hominoidea-Euarchontoglires, Hominoidea-Eutheria, or Homo-Euarchontoglires. Figure 8 (bottom) shows the same frequency distributions for each category of annotated genomic regions. The distributions, except that of CDS, are similar to each other. On the other hand, CDS regions have many deep ancestral intervals.

Tissue-concestor interval correlations for transcribed enhancers
Andersson et al. [23] identified genomic elements called transcribed enhancers in the human genome and other genomes where short RNAs are produced by bidirectional transcription as a result of chromatin opening. They showed transcribed enhancers often overlap with protein-binding marks such as ChIP-seq peaks or protein-binding motifs. They are also enriched in disease-associated single nucleotide polymorphisms (SNPs). Many transcribed enhancers are tissue-specific in that bidirectional transcription of short RNAs occurs frequently in specific tissues. They showed the expressions of a number of genes are well explained by those of a few transcribed enhancers upstream of the genes. Thus, we can see tissue specific enhancer activities for these transcribed enhancers. In the  Table 6 for the numerical values of evolutionary time  enhancer atlas site [24], tissue-specific enhancers are annotated by using the UBERON tissue anatomy ontology and Cell Ontology [25,26]. For example, 41 diverse tissues were assigned to 10-1335 differentiallyexpressed enhancers (see Additional file 1: Table S2) [24]. Using these data, we studied the tissue and concestor interval of the last substitution event as an example of screening evolutionarily important events that affected life designs of extant organisms. We computed (t MRS , σ , q) for each site of the transcribed enhancer regions, filtered out the sites with q > 0.01 , and associated concestor intervals as described above. For each concestor interval, we list the enhancers that contain sites associated with the interval. We used the hypergeometric test to determine if the sites corresponding to a specific concestor interval are significantly enriched for the enhancers transcribed in a specific tissue type. Table 7 shows tissues that have the top five most significant p-values for some concestor interval (more details are discussed in Section 7 of Additional file 1). We find that the brain and Homo-Vertebrata interval association has the most significant p-value and Homo-Vertebrata sites are enriched threefold in brain-associated enhancers relative to expectations. The second tissue was meninx, which is also associated with the nervous system (Table 7). Figure 9 shows a few sampled alignment columns in a brain-specific enhancer, which are assigned to the Homo-Vertebrata interval. Alignment columns that have three or more nucleotides suggest there are some substitutions along the Homo-Vertebrata lineage, but the patterns of nucleotide types and the number of gaps makes it difficult to determine at what time point the substitution occurs. Thus, the assigned intervals are the most ambiguous for these alignment columns.

Tissue-concestor interval correlations for genes
We studied the correlation between the tissue-specificity and concestor intervals for genes in a similar manner as for transcribed enhancers. See Section 9 in Additional file 1 for detailed description of the method. Table 8 shows the top three tissues that have genes with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. As compared to the corresponding Table 7 for transcribed enhancers, deeply ancestral intervals appear in the table, indicating the high level of conservation of exonic sequences. On the other hand, fold enrichment of concestor intervals are smaller than in transcribed enhancers which make it more difficult to infer the impact of the most recent mutations on the life design of extant species than in the case of transcribed enhancers.

Conclusions
We have developed algorithms to infer the time t MRS to most recent substitution in the lineage from a given target species to the root of a phylogenetic tree. In order to filter out highly conserved sites and ambiguous sites where the confidence of estimated t MRS is low, we also compute the probability q of no mutation and the standard deviation σ of t MRS . We computed these variables efficiently using dynamic programming algorithms on the phylogenetic tree such that the algorithms can be applied to multiple genomic alignments with 100 species. We have empirically checked the correctness of our algorithms by posterior sampling of mutation histories on the tree. Our algorithms are exact under the assumptions of the model: genome evolution follows a site-independent continuous-time Markov process along the phylogenetic tree. Our results also depend on the quality of Multiz alignment, which was debated previously [27]. Although alignment errors can be Table 7 Tissue-concestor interval correlation for transcribed enhancers The top three tissues that have transcribed enhancers with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. The sorting order is based on Z-scores that are based on the hypergeometric test and indicate the significance of enrichment of specific concestor intervals in tissue-specific enhancers. ' − log 10 (p-value) ' is the minus log10 p-value of the test computed using the phyper() function in the R programming language. 'Enrichment' is the fold enrichment within the concestor interval relative to the expected occurrence by random sampling. 'Observed' is the number of transcribed enhancers that have both attributes of the Tissue and Interval columns. see Section 8 in Additional file 1 about the use of Z-scores for ranking tissues less influential if the corresponding leaf nodes are far from the target lineage, the incomplete coverage of sequenced genomes directly affects the number of sites whose t MRS can be determined with confidence. We expect that the number of sites with confident t MRS value will increase as the coverage of genome sequences improve in the future.
We have applied our tool to 100-species multiple genome alignments with human target and obtained a frequency spectrum of concestor intervals that categorized the time points at which the last substitutions occurred. Furthermore, we studied the correlation between the frequency of concestor intervals  Fig. 9 Examples of alignment columns. The figure shows the examples of alignment columns that include the concestor interval Homo-Vertebrata in the transcribed enhancer regions and show brain-specific RNA transcription. The y-axis represents nine example alignment columns and x-axis represents nucleotides of each column, in which gaps, ambiguous nucleotides, and unaligned regions are shown as blank. The species are aligned such that it conforms phylogenetic trees and sorted such that species more evolutionarily distant from humans are placed on the right Table 8 Tissue-concestor interval correlation for genes The top three tissues that have genes with sites corresponding to specific concestor intervals are shown. Within each tissue, the top three concestor intervals are shown. The sorting order is based on Z-scores that are based on the hypergeometric test and indicate the significance of enrichment of specific concestor intervals in tissue-specific genes. ' − log 10 (p-value) ' is the minus log10 p-value of the test computed using the phyper() function in the R programming language. 'Enrichment' is the fold enrichment within the concestor interval relative to the expected occurrence by random sampling. 'Observed' is the number of Entrez genes that have both attributes of the tissue and interval columns