In this section we develop a dynamic programming (DP) algorithm to solve the optimization variant of PQ-Tree Search (Problem 5). Our algorithm receives as input an instance of PQ-Tree Search \((T, S, h, d_T, d_S)\), where h is a substitution scoring function. Our default assumption is that deletions are not penalized, and therefore \(\delta\) (the deletion cost function) is not given as input. The case where deletions are penalized is described in Sect. 2 of Additional file 1. The output of the algorithm is a one-to-one mapping, \({\mathcal {M}}{}\), that yields the best (highest scoring) derivation of T to a substring of S with up to \(d_T\) deletions from T and up to \(d_S\) deletions from the substring, and the score of that derivation. With a minor modification, the output can be extended to include a one-to-one mapping for every substring of S and the derivations that they yield.
Brief overview
On a high level, our algorithm consists of three components: the main algorithm, and two other algorithms that are used as procedures by the main algorithm. Apart from an initialization phase, the crux of the main algorithm is a loop that traverses the given PQ-tree, T. For each internal node x, it calls one of the two other algorithms: P-mapping (given in " P-node mapping: the algorithm" section) and Q-mapping. These algorithms find and return the best derivations from the subtree of T rooted in x, T(x), to substrings of S, based on the type of x (P-node or Q-node). So, the main algorithm solves PQ-Tree Alignment for all substrings of S that start at a specific index. Then, the scores of the derivations are stored in the DP table. The outline of the algorithm is exemplified in Fig. 4.
We now give a brief informal description of the main ideas behind our P-mapping and Q-mapping algorithms. Our P-mapping algorithm is inspired by an algorithm described by van Bevern et al. [43] to solve the Job Interval Selection problem. Our problem differs from theirs mainly in its control of deletions. Intuitively, in the P-mapping algorithm we consider the task at hand as a packing problem, where every child of x is a set of intervals, each corresponding to a different substring. The objective is to pack non-overlapping intervals such that for every child of x at most one interval is packed. Then, the algorithm greedily selects a child \(x'\) of x and decides either to pack one of its intervals (and which one) or to pack none (in which case \(x'\) is deleted). The Q-mapping algorithm is similar to the classical problem of sequence alignment with bounded gaps and therefore will not be elaborated in the paper. It is deferred to the supplementary material (see Sect. 1, Additional file 1).
In the following sections, we describe the main algorithm ("The main algorithm" section) and the P-mapping algorithm ("P-node mapping" section). Afterwards, the time complexity of the algorithm is analyzed and compared to that of a naïve algorithm ("Complexity analysis of the PQ-tree search aalgorithm" section). The modifications necessary for penalizing deletions are deferred to the supplementary material (see Sect. 2, Additional file 1).
The main algorithm
We now delve into more technical details. The algorithm (whose pseudocode is given in 1) constructs a 4-dimensional DP table \({\mathcal {A}}{}\) of size \(m' \times n \times d_T+1 \times d_S+1\), where \(m'=m+m_p+m_q\) is the number of nodes in T. The purpose of an entry of the DP table, \({\mathcal {A}}[x,i,k_T,k_S]\), is to hold the highest score of a derivation of the subtree T(x) to a substring \(S'\) of S starting at index i with \(k_T\) deletions from T(x) and \(k_S\) deletions from \(S'\). Note that we abuse notation and use a node x of T also as an index for the DP table entries that refer to x. If no such derivation exists, \({\mathcal {A}}[x,i,k_T,k_S] = -\infty\). Addressing \({\mathcal {A}}{}\) with some of its indices given as dots, e.g. \({\mathcal {A}}[x, i,\cdot ,\cdot ]\), refers to the subtable of \({\mathcal {A}}{}\) that is comprised of all entries of \({\mathcal {A}}{}\) whose first two indices are x and i. Some entries of the DP table define illegal derivations, namely, derivations for which the number of deletions are inconsistent with the start index i, the derived node and S. For example, such are derivations that have more deletions from the string than there are characters in the derived string. These entries are called invalid entries and their value is defined as \(-\infty\) throughout the algorithm. Formally, an entry \({\mathcal {A}}[x,i,k_T,k_S]\) is invalid if one of the following is true: \(k_T > \mathsf {span}(x)\), \(k_S > L(x,k_T,k_S)\), \(E(x,i,k_S,k_T) > n\), or \(L(x,k_T,k_S) < 0\).
The algorithm first initializes the entries of \({\mathcal {A}}{}\) that are meant to hold scores of derivations of the leaves of T to every possible substring of S using the following rule. For every \(0\le k_S\le d_S\) and every \(x\in \mathsf {leaves}(root_T)\), do:
-
1
\({\mathcal {A}}[x,i,1,k_S] = 0\)
-
2
\({\mathcal {A}}[x,i,0,k_S] = \displaystyle {\max _{\begin{array}{c} i'=i,...,i+k_S \end{array}}}h(x,S[i'])\)
We remark that this initialization rule can be replaced by initializing \({\mathcal {A}}[x,i,0,0]\) with h(x, S[i]) and for every \(k_T\ne 0\) and \(k_S\ne 0\) initializing \({\mathcal {A}}[x,i,k_T,k_S]\) with \(-\infty\). Nonetheless, we use the former initialization rule because it does not change the time complexity of the algorithm while helping keep notations and proofs simpler.
After the initialization, all other entries of \({\mathcal {A}}{}\) are filled as follows. Go over the internal nodes of T in postorder. For every internal node x, go in ascending order over every index i, that can be a start index for the substring of S derived from T(x) (the possible values of i are explained in the next paragraph). For every x and i, use the algorithm for Q-mapping or P-mapping according to the type of x. Both algorithms receive the same input: a substring \(S'\) of S, the node x, its children \(x_{1},\dots ,x_{\gamma }\), the collection of possible derivations of the children (denoted by \({\mathcal {D}}{}\)), which have already been computed and stored in \({\mathcal {A}}{}\) (as will be explained ahead) and the deletion arguments \(d_T,d_S\). Intuitively, the substring \(S'\) is the longest substring of S starting at index i that can be derived from T(x) given \(d_T\) and \(d_S\). After being called, both algorithms return a set of derivations of T(x) to a prefix of \(S'=S[i:e]\) and their scores. The set holds the highest scoring derivation for every \(E(x,i,d_T,0) \le e \le E(x,i,0,d_S)\) and for every legal deletion combination \(0\le k_T\le d_T\), \(0\le k_S \le d_S\).
Next, we explain the possible values of i and the definition of \(S'\) more formally. To this end, recall the length function given in "Preliminaries" section, \(L(x,k_T,k_S) \doteq \mathsf {span}(x) - k_T + k_S\). Thus, on the one hand, a substring of maximum length is obtained when there are no deletions from the tree and \(d_S\) deletions from the string. Hence, \(S'=S[i:E(x,i,0,d_S)]\). On the other hand, a shortest substring is obtained when there are \(d_T\) deletions from the tree and none from the string. Then, the length of the substring is \(L(x,d_T,0) = \mathsf {span}(x)-d_T\). Hence, the index i runs between 1 and \(n-(\mathsf {span}(x)-d_T)+1\).
We now turn to address the aforementioned input collection \({\mathcal {D}}{}\) in more detail. Formally, it contains the best scoring derivations of every child \(x'\) of x to every substring of \(S'\) with up to \(d_T\) and \(d_S\) deletions from the tree and string, respectively. It is produced from the entries \({\mathcal {A}}[x', i', k_T, k_S]\) (where each entry gives one derivation) for all \(k_T\) and \(k_S\), and all \(i'\) between i and the end index of \(S'\), i.e. \(i\le i' \le E(x,i,0,d_S)\). For the efficiency of the Q-mapping and P-mapping algorithms, the derivations in \({\mathcal {D}}{}\) are grouped by their root (\(\mu .v\)) and arranged in descending order with respect to their end point (\(\mu .e\)). This does not increase the time complexity of the algorithm, as this ordering is received by previous calls to the Q-mapping and P-mapping algorithms.
In the final stage of the main algorithm, when the DP table is full, the score of a best derivation is the maximum of \(\{{\mathcal {A}}[root_T,i,k_T,k_S] : k_T\le d_T\), \(k_S \le d_S\), \(1\le i\le n-(\mathsf {span}(root_T)-k_T)+1\}\). We remark that by tracing back through \({\mathcal {A}}{}\) the one-to-one mapping that yielded this derivation can be found.
P-node mapping
Before describing the P-mapping algorithm, we set up some useful terminology.
P-node mapping: terminology
We first define the notion of a partial derivation. In the P-mapping algorithm, the derivation of the input node x is built by considering subsets U of its children. With respect to such a subset U, a derivation \(\mu\) of x is built as if x had only the children in U, and is called a partial derivation. Formally, \(\mu\) is a partial derivation of a node x if \(\mu .v=x\) and there is a subset of children \(U'\subseteq \mathsf {children}(x)\) such that the two following conditions are true. First, for every \(u\in U'\) all the leaves in T(u) are neither mapped nor deleted under \(\mu\) - that is, there is no mapping pair \((\ell ,y) \in \mu .o\) such that \(\ell \in \mathsf {leaves}(u)\). Second, for every \(v \in \mathsf {children}(x)\setminus U'\) the leaves in T(v) are either mapped or deleted under \(\mu\). For every \(u \in U'\), we say that u is ignored under \(\mu\). Notice that any derivation is a partial derivation, where the set of ignored nodes (\(U'\) above) is empty. Since all derivations that are computed in a single call to the P-mapping algorithm have the same start point i, it can be omitted (for brevity) from the end point function: thus, we denote \(E_{I}(x,k_T,k_S)\doteq L(x,k_T,k_S)\). Also, for a set U of nodes, we define \(L(U,k_T,k_S) \doteq \sum _{x\in U}\mathsf {span}(x)+k_S-k_T\) and accordingly \(E_{I}(U,k_T,k_S)\doteq L(U,k_T,k_S)\).
We now define certain collections of derivations with common properties (such as having the same numbers of deletions and end point).
Definition 2
The collection of all the derivations of every node \(u \in U\) to suffixes of \(S'[1:E_{I}(U,k_T,k_S)]\) with exactly \(k_T\) deletions from the tree and exactly \(k_S\) deletions from the string is denoted by \({\mathcal {D}}{(U,k_T,k_S)}\).
Definition 3
The collection of all the best derivations from the nodes in U to suffixes of \(S'[1:E_{I}(U,k_T,k_S)]\) with up to \(k_T\) deletions from the tree and up to \(k_S\) deletions from the string is denoted by \({\mathcal {D}}_\le {(U,k_T,k_S)}\). Specifically, for every node \(u \in U\), \(k'_T\le k_T\) and \(k'_S\le k_S\), the set \({\mathcal {D}}_\le {(U,k_T,k_S)}\) holds only one highest scoring derivation of u to a suffix of \(S'[1:E_{I}(U,k_T,k_S)]\) with \(k'_T\) and \(k'_S\) deletions from the tree and string, respectively.
It is important to distinguish between these two definitions. First, the derivations in \({\mathcal {D}}{(U,k_T,k_S)}\) have exactly \(k_T\) and \(k_S\) deletions, while the derivations in \({\mathcal {D}}_\le {(U,k_T,k_S)}\) have up to \(k_T\) and \(k_S\) deletions. Second, in \({\mathcal {D}}{(U,k_T,k_S)}\) there can be several derivations that differ only in their score and in the one-to-one mapping that yields them, while in \({\mathcal {D}}_\le {(U,k_T,k_S)}\), there is only one derivation for every node \(u\in U\) and deletion combination pair \((k'_T,k'_S)\). Note that the end points of all of the derivations are equal.
Definition 2 is used for describing the content of an entry of the DP table, where the focus is on the collection of all the derivations of x to \(S'\) with exactly \(k_T\) and \(k_S\) deletions, \({\mathcal {D}}{(\{x\},k_T,k_S)}\). For simplicity, the abbreviation \({\mathcal {D}}{(u,k_T,k_S)} = {\mathcal {D}}{(\{u\},k_T,k_S)}\) is used. In every step of the P-mapping algorithm, a different set of derivations of the children of x is examined, thus, Definition 3 is used for \(U \subseteq \mathsf {children}(x)\). In addition, the set of derivations \({\mathcal {D}}{}\) that is received as input to the algorithms can be described using Definition 3 as can be seen in Eq. 1 below. In this equation, the union is over all \(U\subseteq \mathsf {children}(x)\) because in this way the derivations of all the children of x with every possible end point are obtained (in contrast to having only \(U=\mathsf {children}(x)\), which results in the derivations of all the children of x with the end point \(E_{I}(\mathsf {children}(x),k_T,k_S)\)).
$$\begin{aligned} {\mathcal {D}}{} = \bigcup _{U \subseteq \mathsf {children}(x)}\bigcup _{k_T\le d_T}\bigcup _{k_S\le d_S} {\mathcal {D}}_\le {(U,k_T,k_S)} \end{aligned}$$
(1)
In the P-mapping algorithm for \(C \subseteq \mathsf {children}(x)\), the notation \(x^{(C )}\) is used to indicate that the node x is considered as if its only children are the nodes in C (the nodes in \(\mathsf {children}(x)\setminus C\) are ignored). Consequentially, the span of \(x^{(C )}\) is defined as \(\mathsf {span}(x^{(C )}) \doteq \sum _{c\in C}\mathsf {span}(c)\), and the set \({\mathcal {D}}(x^{(C )},k_T,k_S){}\) (in Definition 2 where \(U=\{x^{(C )}\}\)) now refers to a set of partial derivations. To use \(x^{(C )}\) to describe the base cases of the algorithm, let us define \(x^{(\emptyset )}\) (\(x^{(C )}\) for \(C=\emptyset\)) as a tree with no labeled leaves to map.
P-node mapping: the algorithm
Recall that the input consists of an internal P-node x, a string \(S'\), bounds on the number of deletions from the tree T and the string \(S'\), \(d_T\) and \(d_S\), respectively, and a set of derivations \({\mathcal {D}}{}\) (see Eq. 1). The output of the algorithm is \(\bigcup _{0\le k_T \le d_T}\bigcup _{0\le k_S \le d_S}{{\,\mathrm{arg\max }\,}}_{\mu \in {\mathcal {D}}(x,k_T,k_S){}}\mu .score\), which is the collection of the best scoring derivations of x to every possible prefix of \(S'\) having up to \(d_T\) and \(d_S\) deletions from the tree and string, respectively. Thus, there are \(O(d_T d_S)\) derivations in the output.
The algorithm (whose pseudocode is given in 2) constructs a 3-dimensional DP table \({\mathcal {P}}{}\), which has an entry for every \(0\le k_T \le d_T\), \(0\le k_S \le d_S\) and subset \(C\subseteq \mathsf {children}(x)\). The purpose of an entry \({\mathcal {P}}[C,k_T,k_S]\) is to hold the best score of a partial derivation in \({\mathcal {D}}(x^{(C )},k_T,k_S){}\), i.e. a partial derivation rooted in \(x^{(C )}\) to a prefix of \(S'\) with exactly \(k_T\) deletions from the tree and \(k_S\) deletions from the string. The children of x that are not in C are ignored (as defined in "P-node mapping: terminology" section) under the partial derivation stored by the DP table entry \({\mathcal {P}}[C,k_T,k_S]\), thus they are neither deleted nor counted in the number of deletions from the tree, \(k_T\). (They will be accounted for in the computation of other entries of \({\mathcal {P}}{}\).) Similarly to the main algorithm, some of the entries of \({\mathcal {P}}{}\) are invalid, and their value is defined as \(-\infty\). Formally, an entry \({\mathcal {P}}[C,k_T,k_S]\) is invalid if one of the following is true: \(k_T > \sum _{c \in C}\mathsf {span}(c)\), \(k_S > L(x^{(C )},k_T,k_S)\), \(L(x^{(C )},k_T,k_S) > \mathsf {len}(S')\), or \(L(x^{(C )},k_T,k_S) < 0\). Every entry \({\mathcal {P}}[C,k_T,k_S]\) for which \(L(x^{(C )},k_T,k_S)=0\) and \(k_S=0\) or for which \(C=\emptyset\) and \(k_T=0\) is initialized with 0. The first set of entries captures the case in which the derived substring is the empty string and thus no character can be deleted from it, i.e. \(k_S\) must equal 0. The second set of entries captures the case in which all the children of x are ignored, thus the value of \(k_T\) must be 0.
After the initialization, the remaining entries of \({\mathcal {P}}{}\) are calculated using the recursion rule in Eq. 2 below. The order of computation is ascending with respect to the size of the subsets C of the children of x, and for a given \(C\subseteq \mathsf {children}(x)\), the order is ascending with respect to the number of deletions from both tree and string.
$$\begin{aligned} {\mathcal {P}}[C,k_T,k_S] = \max {\left\{ \begin{array}{ll} {\mathcal {P}}[C,k_T,k_S-1] \\ \displaystyle \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S)} {\mathcal {P}}[C\setminus \{\mu .v\},k_T-\mu .del_T,k_S-\mu .del_S] + \mu .score \end{array}\right. } \end{aligned}$$
(2)
Intuitively, every entry \({\mathcal {P}}[C,k_T,k_S]\) defines some index \(e'\) of \(S'\) that is the end point of every partial derivation in \({\mathcal {D}}(x^{(C )},k_T,k_S){}\). Thus, \(S'[e']\) must be a part of any partial derivation \(\mu \in {\mathcal {D}}(x^{(C )},k_T,k_S){}\), so, either \(S'[e']\) is deleted under \(\mu\) or it is mapped under \(\mu\). The former option is captured by the first case of the recursion rule. If \(S'[e']\) is mapped under \(\mu\), then due to the hierarchical structure of T(x), it must be mapped under some derivation \(\mu '\) of one of the children of x that are in C. Thus, we receive the second case of the recursion rule.
We remark that the case of a node deletion is captured by the initialization and that adding the option of deleting a node in the recursion rule is therefore redundant.
Once the entire DP table is filled, a derivation of maximum score for every end point and deletion numbers combination can be found in \({\mathcal {P}}[\mathsf {children}(x), \cdot , \cdot ]\). Traversing the said subtable in a specific order guarantees the output derivations are ordered with respect to their end point without further calculations.
Complexity analysis of the PQ-Tree Search algorithm
In this section we compare the time complexity of the main algorithm (in "The main algorithm" section) to the naïve solution for PQ-Tree Search. The complexities of the two algorithms described before as well as the complexity of the Q-mapping algorithm are given in the followings lemmas. Lemma 1 and Lemma 2 are proven in "Time and space complexity of the PQ-tree search algorithm" section, and Lemma 3 is proven in Sect. 1.3 of Additional file 1.
Lemma 1
The algorithm in "The main algorithm" section takes \(O(n \gamma {d_T}^2 {d_S}^2 (m_p 2^{\gamma } + m_q))\) time and \(O(d_T d_S (m n + 2^\gamma ))\) space, where \(\gamma\) is the maximum degree of a node in T.
Lemma 2
The P-mapping algorithm takes \(O({d_T}^2 {d_S}^2 \gamma 2^\gamma )\) time and \(O(d_T d_S 2^\gamma )\) space.
Lemma 3
The Q-mapping algorithm takes \(O({d_T}^2 {d_S}^2 \gamma )\) time and \(O(d_T d_S \gamma )\) space.
From Lemma 1 it is proven that PQ-Tree Search has an FPT solution with the parameter \(\gamma\) (Theorem 1).
Theorem 1
PQ-Tree Search with parameter \(\gamma\) is FPT. Particularly, it has an FPT algorithm that runs in \(O^*(2^{\gamma })\) timeFootnote 1.
The naïve solution for PQ-Tree Search is to solve sequence alignment with bounded gaps for every substring of S versus every string \(S'\in C(T)\), so the naïve solution takes \(O(2^{m_q}{(\gamma !)}^{m_p} n m(d_T+d_S)d_T d_S)\) time. (A full description of the naïve algorithm and its complexity is given in Sect. 4 of Additional file 1.) Therefore, we conclude that the time complexity of our algorithm is substantially better, as exemplified by considering two complementary cases. One, when there are only P-nodes in T (i.e. \(m_p=O(m)\)), the naïve algorithm is super-exponential in \(\gamma\), and even worse, exponential in m, while ours is exponential only in \(\gamma\), and hence polynomial for any \(\gamma\) that is constant (or even logarithmic in the input size). Second, when there are only Q-nodes in T (i.e. \(m_q=O(m)\)), the naïve algorithm is exponential while ours is polynomial.