 Research
 Open access
 Published:
Approximate search for known gene clusters in new genomes using PQtrees
Algorithms for Molecular Biology volume 16, Article number: 16 (2021)
Abstract
Gene clusters are groups of genes that are colocally conserved across various genomes, not necessarily in the same order. Their discovery and analysis is valuable in tasks such as gene annotation and prediction of gene interactions, and in the study of genome organization and evolution. The discovery of conserved gene clusters in a given set of genomes is a well studied problem, but with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence. Thus, we define a new problem in comparative genomics, denoted PQTree Search that takes as input a PQtree T representing the known gene orders of a gene cluster of interest, a genetogene substitution scoring function h, integer arguments \(d_T\) and \(d_S\), and a new sequence of genes S. The objective is to identify in S approximate new instances of the gene cluster; These instances could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by \(d_T\) and \(d_S\), respectively. We prove that PQTree Search is NPhard and propose a parameterized algorithm that solves the optimization variant of PQTree Search in \(O^*(2^{\gamma })\) time, where \(\gamma\) is the maximum degree of a node in T and \(O^*\) is used to hide factors polynomial in the input size. The algorithm is implemented as a search tool, denoted PQFinder, and applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQtrees. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.
Introduction
Recent advances in pyrosequencing techniques, combined with global efforts to study infectious diseases, yield huge and rapidlygrowing databases of microbial genomes [3, 4]. These big new data statistically empower genomiccontext based approaches to functional analysis: the biological principle underlying such analysis is that groups of genes that are located close to each other across many genomes often code for proteins that interact with one another, suggesting a common functional association. Thus, if the functional association and annotation of the clustered genes is already known in one (or more) of the genomes, this information can be used to infer functional characterization of homologous genes that are clustered together in another genome.
Groups of genes that are colocally conserved across many genomes are denoted gene clusters. The locations of the group of genes comprising a gene cluster in the distinct genomes are denoted instances. Gene clusters in prokaryotic genomes often correspond to (one or several) operons; those are neighbouring genes that constitute a single unit of transcription and translation. However, the order of the genes in the distinct instances of a gene cluster may not be the same.
The discovery (i.e. datamining) of conserved gene clusters in a given set of genomes is a well studied problem [5,6,7]. However, with the rapid sequencing of prokaryotic genomes a new problem is inspired. Namely, given an already known gene cluster that was discovered and studied in one genomic dataset, to identify all the instances of the gene cluster in a given new genomic sequence.
One exemplary application for this problem is the search for chromosomal gene clusters in plasmids. Plasmids are circular genetic elements that are harbored by prokaryotic cells where they replicate independently from the chromosome. They can be transferred horizontally and vertically, and are considered a major driving force in prokaryotic evolution, providing mutation supply and constructing new operons with novel functions [8], for example antibiotic resistance [9]. This motivates biologists to search for chromosomal gene clusters in plasmids, and to study structural variations between the instances of the found gene clusters across the two distinct replicons. However, in addition to the fact that plasmids evolve independently from chromosomes and in a more rapid pace [10], their sequencing, assembly and annotation involves a more noisy process [11].
To accommodate all this, the proposed search approach should be an approximate one, sensitive enough to tolerate some amount of genome rearrangements: transpositions and inversions, missing and intruding genes, and classification of genes with similar function to distinct orthology groups due to sequence divergence or convergent evolution. Yet, for the sake of specificity and search efficiency, we consider confining the allowed variations by two types of biological knowledge: (1) bounding the allowed rearrangement events considered by the search, based on some grammatical model trained specifically from the known gene orders of the gene cluster, and (2) governing the genetogene substitutions considered by the search by combining sequence homology with functionalannotation based semantic similarity.
Bounding the allowed rearrangement events. The PQtree [12] is a combinatorial data structure classically used to represent gene clusters [13]. A PQtree of a gene cluster describes its hierarchical inner structure and the relations between instances of the cluster succinctly, aids in filtering meaningful from apparently meaningless clusters, and also gives a natural and meaningful way of visualizing complex clusters. A PQtree is a rooted tree with three types of nodes: Pnodes, Qnodes and leaves. The children of a Pnode can appear in any order, while the children of a Qnode must appear in either lefttoright or righttoleft order. (In the special case when a node has exactly two children, it does not matter whether it is labeled as a Pnode or a Qnode.) Booth and Lueker [12], who introduced this data structure, were interested in representing a set of permutations over a set U, i.e. every member of U appears exactly once as a label of a leaf in the PQtree. We, on the other hand, allow each member of U to appear as a label of a leaf in the tree any nonnegative number of times. Therefore, we will henceforth use the term string rather than permutation when describing the gene orders derived from a given PQtree.
An example of a PQtree is given in Fig. 1. It represents a Phn gene cluster that encodes proteins that utilize phosphonate as a nutritional source of phosphorus in prokaryotes [14]. The biological assumptions underlying the representation of gene clusters as PQtrees is that operons evolve via progressive merging of suboperons, where the most basic units in this recursive operon assembly are colinearly conserved suboperons [15]. In the case where an operon is assembled from suboperons that are colinearly dependent, the conserved gene order could correspond, e.g., to the order in which the transcripts of these genes interact in the metabolic pathway in which they are functionally associated [16]. Thus, transposition events shuffling the order of the genes within this suboperon could reduce its fitness. On the other hand, inversion events, in which the genes participating in this suboperon remain colinearly ordered are accepted. This case is represented in the PQtree by a Qnode (marked with a rectangle). In the case where an operon is assembled from suboperons that are not colinearly codependent, convergent evolution could yield various orders of the assembled components [15]. This case is represented in the PQtree by a Pnode (marked with a circle). Learning the internal topology properties of a gene cluster from its corresponding gene orders and constructing a query PQtree accordingly, could empower the search to confine the allowed rearrangement operations so that colinear dependencies among genes and between suboperons are preserved.
Governing the genetogene substitutions. A prerequisite for gene cluster discovery is to determine how genes relate to each other across all the genomes in the dataset. In our experiment, genes are represented by their membership in Clusters of Orthologous Groups (COGs) [17], where the sequence similarity of two genes belonging to the same COG serves as a proxy for homology. Despite low sequence similarity, genes belonging to two different COGs could have a similar function, which would be reflected in the functional description of the respective COGs. Using methods from natural language processing [18], we compute for each pair of functional descriptions a score reflecting their semantic similarity. Combining sequence and functional similarity could increase the sensitivity of the search and promote the discovery of systems with related functions.
Our contribution and roadmap. We define two new problems in comparative genomics, denoted PQTree Search and PQTree Alignment (in "Preliminaries" section), where the second is a subproblem of the first. Both problems take as input a PQtree T (the query) representing the known gene orders of a gene cluster of interest, a genetogene substitution scoring function h, integer arguments \(d_T\) and \(d_S\), and a sequence of genes S (the target). The objective in PQTree Search is to identify an approximate instance \(S'\) of the gene cluster, such that \(S'\) is a substring of S. The objective of PQTree Alignment is to determine whether \(S'\) is an approximate instance of the gene cluster; An approximate instance could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by \(d_T\) and \(d_S\), respectively. We prove that both PQTree Search and PQTree Alignment are NPhard (Theorems 2, 3 in "PQtree search is NPhard" section).
We define optimization variants of PQTree Search and PQTree Alignment (in " Preliminaries" section) and propose an algorithm (in "A parameterized algorithm" section) that solves PQTree Search in \(O(n \gamma {d_T}^2 {d_S}^2 (m_p \cdot 2^{\gamma } + m_q))\) time, where n is the length of S, \(m_p\) and \(m_q\) denote the number of Pnodes and Qnodes in T, respectively, and \(\gamma\) denotes the maximum degree of a node in T. The proposed algorithm for PQTree Search solves PQTree Alignment for every substring of S. Thus, in the same time and space complexities, we can also report all approximate instances of T in S and not only the optimal one.
The algorithm is implemented as a search tool, denoted PQFinder. The code for the tool as well as all the data needed to reconstruct the results are publicly available on GitHub [2]. The tool is applied to search for instances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes; methods are given in "Methods and datasets" section. In our preliminary results (in "Results" section), we report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQtree. One of these results, coding for a heavy metal efflux pump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting new structural variants of known gene clusters.
Previous related works. Permutations on strings representing gene clusters have been studied earlier by [19,20,21,22,23]. PQtrees were previously applied in physical mapping [24, 25], as well as to other comparative genomics problems [26,27,28].
In Landau et al. [28] an algorithm was proposed for representation and detection of gene clusters in multiple genomes, using PQtrees: the proposed algorithm computes a PQtree of k permutations of length n in O(kn) time, and it is proven that the computed PQtree is the one with a minimum number of possible rearrangements of its nodes while still representing all k permutations. In the same paper, the authors also present a general scheme to handle gene multiplicity and missing genes in permutations. For every character that appears a times in each of the k strings, the time complexity for the construction of the PQtree, according to the scheme in that paper, is multiplied by an \(O((a!)^k)\) factor.
Additional applications of PQtrees to genomics were studied in [29,30,31], where PQtrees were considered to represent and reconstruct ancestral genomes.
However, as far as we know, searching for approximate instances of a gene cluster that is represented as a PQtree, in a given new string, is a new computational problem.
Semantic similarity measures between Gene Ontology (GO) terms [32] have been previously used in tasks such as protein function prediction [33, 34], functional enrichment analysis of gene expression datasets [35, 36], and proteinprotein interaction inference [37, 38]. In the context of gene cluster analysis, a recent study mined gene clusters that have common functional associations among seven amniote genomes, by measuring the GO term similarity of the respective genes [39]. However, the Gene Ontology Consortium provides annotations for only 41 prokaryotic genomes, while the dataset used in this study consists of 1487 prokaryotic genomes. Transferring GO annotations from the annotated genomes to the other genomes in our dataset using gene sequence similarity would lead to a limited gene coverage. Therefore, in this study we use COG functional descriptions to measure semantic similarity between genes.
Preliminaries
Let \(\Pi\) be an NPhard problem. In the framework of Parameterized Complexity, each instance of \(\Pi\) is associated with a parameter k, and the goal is to confine the combinatorial explosion in the running time of an algorithm for \(\Pi\) to depend only on k. Formally, \(\Pi\) is fixedparameter tractable (FPT) if any instance (I, k) of \(\Pi\) is solvable in time \(f(k)\cdot I^{{\mathcal {O}}(1)}\), where f is an arbitrary computable function of k. Nowadays, Parameterized Complexity supplies a rich toolkit to design or refute the existence of FPT algorithms [40,41,42].
PQtree: representing the pattern.
The possible reordering of nodes in a PQtree may create many equivalent PQtrees. In [12] two PQtrees T and \(T'\) are defined as equivalent (denoted \(T \equiv T'\)) if one tree can be obtained by legally reordering the nodes of the other; namely, randomly permuting the children of a Pnode, and reversing the children of a Qnode. To allow for deletions in the PQtrees, a generalization of that definition is given in Definition 1 below. Here, smoothing is a recursive process in which if by deleting leaves from a tree T, some internal node x of T is left without children, then x is also deleted, but its deletion is not counted (i.e. only leaf deletions are counted).
Definition 1
(QuasiEquivalence Between PQTrees) For any two PQtrees, T and \(T'\), the PQtree T is quasiequivalent to \(T'\) with a bound d, denoted \(T\succeq _d T'\), if \(T'\) can be obtained from T by (a) randomly permuting the children of some of the Pnodes of T, (b) reversing the children of some of the Qnodes of T, and (c) deleting up to d leaves from T and applying the corresponding smoothing. (The order of the operations does not matter.)
Figure 2 shows two equivalent PQtrees (\(T_1\) and \(T_2\)) that are each quasiequivalent with \(d=1\) to the third PQtree (\(T_3\)). The frontier of a PQtree T, denoted F(T), is the sequence of labels on the leaves of T read from left to right. For example, the frontier of the PQtree \(T_1\) in Fig. 2 is ABCDEFG. It is interesting to consider the set of frontiers of all the equivalent PQtrees, defined in [12] as a consistent set and denoted by \(C(T)=\{F(T'):T\equiv T'\}\). Intuitively, C(T) is the set of all leaf label sequences defined by the PQtree structure and obtained by legally reordering its nodes. Here, we generalize the consistent set definition to allow a bounded number of deletions from T, using quasiequivalence. Thus, the set of dBounded QuasiConsistent trees is denoted by \(C_{d}(T)=\{F(T'):T\succeq _{d} T'\}\).
Clearly \(C(T)=C_0(T)\), and so in a setting where \(d=0\) the former notation is used. For a node x of a PQtree T, the subtree of T rooted in x is denoted by T(x), the set of leaves in T(x) is denoted by \(\mathsf {leaves}(x)\), and the span of x is denoted by \(\mathsf {span}(x)\) and defined as \(\mathsf {leaves}(x)\).
Defining the problems
As a preface for the new problems defined ahead, consider the PQTree Membership problem defined in Problem 1 below, which stems from the definition of consistent set.
Problem 1
(PQTree Membership) Given a PQtree T and a string S, decide if \(S\in C(T)\).
When considering applications of PQtrees to comparative genomics, it is important to allow for insertion, deletion and substitution operations. Thus, a new problem named PQTree Alignment is defined. In what follows we give a decision variant of this problem (in Problem 2), and an optimization variant of this problem (in Problem 3). PQTree Alignment can be thought of as an extension of the PQTree Membership problem that allows insertions, deletions and substitutions of genes. Then, intuitively, given a PQtree T and a string \(S'\), the objective is to find a string \(S''\) such that \(S''\in C(T)\) and \(S''\) is the most similar to \(S'\), where similarity is measured as a sequence alignment score. To avoid confusion, the term insertion is not used, and instead two types of deletions are used: deletions form the PQtree and deletions form the string. In addition, in the rest of this paper, the term substitution is used to encompass both matches and mismatches between aligned genes.
Formally, an instance of the PQTree Alignment problem is a tuple of the form \((T, S', h, d_T, d_S)\), where T is a PQtree with m leaves, \(m_p\) Pnodes, \(m_q\) Qnodes, and every leaf x in T has a label \(\mathsf {label}(x)\)\(\in \Sigma _T\); \(S'=\sigma _1 \cdots \sigma _n \in {\Sigma _S^n}\) is a string of length n representing a sequence of genes; \(d_T\in {\mathbb {N}}\) specifies the number of allowed deletions from T; \(d_S\in {\mathbb {N}}\) specifies the number of allowed deletions from \(S'\); and h is a boolean substitution function, describing the possible substitutions between the leaf labels of T and the characters of the given string, \(S'\). The function h receives a pair \((\sigma _t, \sigma _s)\), where \(\sigma _t \in \Sigma _T\) is one of the labels of the leaves of T, and \(\sigma _s \in \Sigma _S\) is one of the characters of the given string \(S'\), and returns True if \(\sigma _t\) can be replaced with \(\sigma _s\), and False, otherwise. Considering the biological problem at hand, \(\Sigma _T\) and \(\Sigma _S\) are both sets of genes.
The objective of PQTree Alignment is to find a onetoone mapping \({\mathcal {M}}{}\) between the leaves of T and the characters of \(S'\), which comprises a set of pairs each having one of three forms: the substitution form, \((x,\sigma _s(\ell ))\), where x is a leaf in T, \(\sigma _s \in \Sigma _S\), \(h(\mathsf {label}(x),\sigma _s) = True\) and \(\ell \in \{1,\cdots {,}n\}\) is the index of the occurrence of \(\sigma _s\) in \(S'\) that is mapped to the leaf x; the character deletion form, \((\varepsilon , \sigma _s(\ell ))\), which marks the deletion of the character \(\sigma _s\in \Sigma _S\) at index \(\ell\) of \(S'\); the leaf deletion form, \((x, \varepsilon )\), which marks the deletion of x, a leaf node of T.
Applying the substitutions defined in \({\mathcal {M}}{}\) to \(S'\), resulting in the string \(S_{{\mathcal {M}}{}}\), is the process in which for every \((x,\sigma _s(\ell ))\in {\mathcal {M}}{}\), the character \(\sigma _s\) at index \(\ell\) of \(S'\) is deleted if \(x=\varepsilon\), and otherwise substituted by \(\mathsf {label}(x)\). This process is demonstrated in Fig. 3B. We say that \(S'\) is derived from T under \({\mathcal {M}}{}\) with \(d_T\) deletions from the tree and \(d_S\) deletions from the string, if \(d_T\) is equal to the number of pairs in \({\mathcal {M}}{}\) of the leaf deletion form \((x,\varepsilon )\), \(d_S\) is equal to the number of pairs in \({\mathcal {M}}{}\) of the character deletion form \((\varepsilon ,\sigma )\), and \(S_{{\mathcal {M}}{}}\in C_{d_T}(T)\). Thus, by definition, there is a PQtree \(T'\) such that \(F(T')=S_{{\mathcal {M}}{}}\) and \(T \succeq _{d_T} T'\). Note that the deletions of the nodes in T to obtain the nodes in \(T'\) are determined by \({\mathcal {M}}{}\). The conversion of T to \(T'\) as defined by the derivation is illustrated in Fig. 3A. The set of permutations and node deletions performed to obtain \(T'\) from T together with the substitutions and deletions from \(S'\) specified by \({\mathcal {M}}{}\) is named the derivation \(\mu\) of T to \(S'\). We also say that \({\mathcal {M}}{}\) yields the derivation \(\mu\).
Problem 2
(Decision PQTree Alignment) Given a string \(S'\) of length n, a PQtree T with m leaves, deletion bounds \(d_T,d_S \in {\mathbb {N}}{}\), and a boolean substitution function h between \(\Sigma _S\) and \(\Sigma _T\), decide if there is a onetoone mapping \({\mathcal {M}}{}\) that yields a derivation of T to \(S'\) with up to \(d_T\) and up to \(d_S\) deletions from T and \(S'\), respectively.
Notice that by setting both deletion bounds (\(d_T\) and \(d_S\)) to zero and defining \(h(\sigma _t, \sigma _s)=True\) if and only if \(\sigma _t =\sigma _s\), the PQTree Membership problem is obtained from PQTree Alignment. Also, if \(n < md_T\) or \(n>m+d_S\), then PQTree Alignment will return false.
To define an optimization version of the PQTree Alignment problem it is necessary to have a score for every possible substitution between the characters in \(\Sigma _T\) and the characters in \(\Sigma _S\). Hence, for this problem variant assume that h is a substitution scoring function, that is, \(h(\sigma _t,\sigma _s)\) for \(\sigma _t\in \Sigma _T, \sigma _s\in \Sigma _S\) is the score for substituting \(\sigma _s\) by \(\sigma _t\) in the derivation, and if \(\sigma _t\) cannot be substituted by \(\sigma _s\), then \(h(\sigma _t,\sigma _s) = \infty\). In addition, we need a cost function, denoted by \(\delta\), for the deletion of a character of \(S'\) and for the deletion of a leaf of T according to the label of the leaf. So, formally, an instance of the optimization variant of PQTree Alignment is \((T, S', h, \delta , d_T, d_S)\). The score of a derivation \(\mu\), denoted by \(\mu .score\), is the sum of scores of all operations (deletions from the tree, deletions from the string and substitutions) in \(\mu\). Now, instead of deciding whether there exists a onetoone mapping that yields a derivation of T to \(S'\), we can search for the onetoone mapping that yields the best derivation (if there exists such a derivation), i.e. a onetoone mapping for which \(\mu .score\) is the highest.
Problem 3
(Optimization PQTree Alignment) Given a string \(S'\) of length n, a PQtree T with m leaves, deletion bounds \(d_T,d_S\in {\mathbb {N}}{}\), a substitution scoring function h between \(\Sigma _S\) and \(\Sigma _T\), and a deletion cost function \(\delta\), return the onetoone mapping \({\mathcal {M}}{}\) that yields the highest scoring derivation of T to \(S'\) with up to \(d_T\) deletions from T and up to \(d_S\) deletions from \(S'\) (if such a mapping exists).
More generally, in our application the string represents a genome, which is a lot longer than the strings that can be derived from the given PQtree T. Thus, a new problem named PQTree Search is defined (in Problems 4, 5 below). Intuitively, in PQTree Search the objective is to find a substring of a given string S for which PQTree Alignment returns true (or returns the best score, in the optimization variant).
Formally, an instance of the PQTree Search problem is a tuple \((T, S, h, d_T, d_S)\), where \(T, h, d_T\) and \(d_S\) are defined as in PQTree Alignment, and S is defined as \(S'\) with the exception that the string S representing the input genome, can be of any length n (and not bounded by \(md_T\) and \(m+d_S\)). The objective of PQTree Search is to find a onetoone mapping \({\mathcal {M}}{}\) between the leaves of T and the characters of a substring \(S'\) of S that yields a derivation with up to \(d_T\) and up to \(d_S\) deletions from T and \(S'\), respectively. For \(1\le i\le j \le n\), \(S'=S[i:j] = \sigma _i...\sigma _j\) is a substring of S beginning at index i and ending at index j (inclusive). The substring \(S'\) is a prefix of S if \(S'=S[1:j]\) and it is a suffix of S if \(S'=S[i:n]\). In addition, we denote \(\sigma _i\), the \(i^{\text {th}}\) character of S, by S[i].
Now, we would like to acknowledge in the definition of a derivation that a derivation can be to a substring of the target string (as it is in the PQTree Search problem), rather than requiring that the full target string is derived, and add some related terms and notations. So, for a derivation \(\mu\) of T to \(S'=S[s:e]\), the following terms and notations (illustrated in Fig. 3) are given. The root of T (denoted \(root_T\)) is the node that \(\mu\) derives or the root of the derivation and it is denoted by \(\mu .v\). For abbreviation, we say that \(\mu\) is a derivation of \(\mu .v\). The substring \(S'\) is the string that \(\mu\) derives . We name s and e the start and end points of the derivation and denote them by \(\mu .s\) and \(\mu .e\), respectively. The onetoone mapping that yields \(\mu\) is denoted by \(\mu .o\). The number of deletions from the tree is denoted by \(\mu .del_T\). The number of deletions from the string is denoted by \(\mu .del_S\). In addition, if x is a leaf node in T and \((x,\sigma _s(\ell ))\in \mu .o\), then x is mapped to \(S[\ell ]\) under \(\mu\). The character \(S[\ell ]\) is said to be deleted under \(\mu\) if \((\varepsilon ,\sigma _s(\ell ))\in \mu .o\). If \(x \in T(\mu .v)\) is a leaf for which \((x,\varepsilon )\in \mu .o\), then x is deleted under \(\mu\). For an internal node x of T, if every leaf in T(x) is deleted under \(\mu\), then x is deleted under \(\mu\), and otherwise x is kept under \(\mu\). Notice that in PQTree Alignment all the derivations have the start point 1 (\(s=1\)) and the end point m (\(e=m\)). Given a node x and the numbers of deletions \(k_T\) and \(k_S\) of a derivation, the length of the derived string \(S'\) can be calculated using the following length function: \(L(x,k_T,k_S) \doteq \mathsf {span}(x)  k_T + k_S\).
We define two versions of the PQTree Search problem: a decision version (Problem 4) and an optimisation version (Problem 5).
Problem 4
(Decision PQTree Search) Given a string S of length n, a PQtree T with m leaves, deletion bounds \(d_T,d_S \in {\mathbb {N}}{}\), and a boolean substitution function h between \(\Sigma _S\) and \(\Sigma _T\), decide if there is a onetoone mapping \({\mathcal {M}}{}\) that yields a derivation of T to a substring \(S'\) of S with up to \(d_T\) and up to \(d_S\) deletions from T and \(S'\), respectively.
Problem 5
Given a string S of length n, a PQtree T with m leaves, deletion bounds \(d_T,d_S\in {\mathbb {N}}{}\), a substitution scoring function h between \(\Sigma _S\) and \(\Sigma _T\), and a deletion cost function \(\delta\), return the onetoone mapping \({\mathcal {M}}{}\) that yields the highest scoring derivation of T to a substring \(S'\) of S with up to \(d_T\) deletions from T and up to \(d_S\) deletions from \(S'\) (if such a mapping exists).
A parameterized algorithm
In this section we develop a dynamic programming (DP) algorithm to solve the optimization variant of PQTree Search (Problem 5). Our algorithm receives as input an instance of PQTree Search \((T, S, h, d_T, d_S)\), where h is a substitution scoring function. Our default assumption is that deletions are not penalized, and therefore \(\delta\) (the deletion cost function) is not given as input. The case where deletions are penalized is described in Sect. 2 of Additional file 1. The output of the algorithm is a onetoone mapping, \({\mathcal {M}}{}\), that yields the best (highest scoring) derivation of T to a substring of S with up to \(d_T\) deletions from T and up to \(d_S\) deletions from the substring, and the score of that derivation. With a minor modification, the output can be extended to include a onetoone mapping for every substring of S and the derivations that they yield.
Brief overview
On a high level, our algorithm consists of three components: the main algorithm, and two other algorithms that are used as procedures by the main algorithm. Apart from an initialization phase, the crux of the main algorithm is a loop that traverses the given PQtree, T. For each internal node x, it calls one of the two other algorithms: Pmapping (given in " Pnode mapping: the algorithm" section) and Qmapping. These algorithms find and return the best derivations from the subtree of T rooted in x, T(x), to substrings of S, based on the type of x (Pnode or Qnode). So, the main algorithm solves PQTree Alignment for all substrings of S that start at a specific index. Then, the scores of the derivations are stored in the DP table. The outline of the algorithm is exemplified in Fig. 4.
We now give a brief informal description of the main ideas behind our Pmapping and Qmapping algorithms. Our Pmapping algorithm is inspired by an algorithm described by van Bevern et al. [43] to solve the Job Interval Selection problem. Our problem differs from theirs mainly in its control of deletions. Intuitively, in the Pmapping algorithm we consider the task at hand as a packing problem, where every child of x is a set of intervals, each corresponding to a different substring. The objective is to pack nonoverlapping intervals such that for every child of x at most one interval is packed. Then, the algorithm greedily selects a child \(x'\) of x and decides either to pack one of its intervals (and which one) or to pack none (in which case \(x'\) is deleted). The Qmapping algorithm is similar to the classical problem of sequence alignment with bounded gaps and therefore will not be elaborated in the paper. It is deferred to the supplementary material (see Sect. 1, Additional file 1).
In the following sections, we describe the main algorithm ("The main algorithm" section) and the Pmapping algorithm ("Pnode mapping" section). Afterwards, the time complexity of the algorithm is analyzed and compared to that of a naïve algorithm ("Complexity analysis of the PQtree search aalgorithm" section). The modifications necessary for penalizing deletions are deferred to the supplementary material (see Sect. 2, Additional file 1).
The main algorithm
We now delve into more technical details. The algorithm (whose pseudocode is given in 1) constructs a 4dimensional DP table \({\mathcal {A}}{}\) of size \(m' \times n \times d_T+1 \times d_S+1\), where \(m'=m+m_p+m_q\) is the number of nodes in T. The purpose of an entry of the DP table, \({\mathcal {A}}[x,i,k_T,k_S]\), is to hold the highest score of a derivation of the subtree T(x) to a substring \(S'\) of S starting at index i with \(k_T\) deletions from T(x) and \(k_S\) deletions from \(S'\). Note that we abuse notation and use a node x of T also as an index for the DP table entries that refer to x. If no such derivation exists, \({\mathcal {A}}[x,i,k_T,k_S] = \infty\). Addressing \({\mathcal {A}}{}\) with some of its indices given as dots, e.g. \({\mathcal {A}}[x, i,\cdot ,\cdot ]\), refers to the subtable of \({\mathcal {A}}{}\) that is comprised of all entries of \({\mathcal {A}}{}\) whose first two indices are x and i. Some entries of the DP table define illegal derivations, namely, derivations for which the number of deletions are inconsistent with the start index i, the derived node and S. For example, such are derivations that have more deletions from the string than there are characters in the derived string. These entries are called invalid entries and their value is defined as \(\infty\) throughout the algorithm. Formally, an entry \({\mathcal {A}}[x,i,k_T,k_S]\) is invalid if one of the following is true: \(k_T > \mathsf {span}(x)\), \(k_S > L(x,k_T,k_S)\), \(E(x,i,k_S,k_T) > n\), or \(L(x,k_T,k_S) < 0\).
The algorithm first initializes the entries of \({\mathcal {A}}{}\) that are meant to hold scores of derivations of the leaves of T to every possible substring of S using the following rule. For every \(0\le k_S\le d_S\) and every \(x\in \mathsf {leaves}(root_T)\), do:

1
\({\mathcal {A}}[x,i,1,k_S] = 0\)

2
\({\mathcal {A}}[x,i,0,k_S] = \displaystyle {\max _{\begin{array}{c} i'=i,...,i+k_S \end{array}}}h(x,S[i'])\)
We remark that this initialization rule can be replaced by initializing \({\mathcal {A}}[x,i,0,0]\) with h(x, S[i]) and for every \(k_T\ne 0\) and \(k_S\ne 0\) initializing \({\mathcal {A}}[x,i,k_T,k_S]\) with \(\infty\). Nonetheless, we use the former initialization rule because it does not change the time complexity of the algorithm while helping keep notations and proofs simpler.
After the initialization, all other entries of \({\mathcal {A}}{}\) are filled as follows. Go over the internal nodes of T in postorder. For every internal node x, go in ascending order over every index i, that can be a start index for the substring of S derived from T(x) (the possible values of i are explained in the next paragraph). For every x and i, use the algorithm for Qmapping or Pmapping according to the type of x. Both algorithms receive the same input: a substring \(S'\) of S, the node x, its children \(x_{1},\dots ,x_{\gamma }\), the collection of possible derivations of the children (denoted by \({\mathcal {D}}{}\)), which have already been computed and stored in \({\mathcal {A}}{}\) (as will be explained ahead) and the deletion arguments \(d_T,d_S\). Intuitively, the substring \(S'\) is the longest substring of S starting at index i that can be derived from T(x) given \(d_T\) and \(d_S\). After being called, both algorithms return a set of derivations of T(x) to a prefix of \(S'=S[i:e]\) and their scores. The set holds the highest scoring derivation for every \(E(x,i,d_T,0) \le e \le E(x,i,0,d_S)\) and for every legal deletion combination \(0\le k_T\le d_T\), \(0\le k_S \le d_S\).
Next, we explain the possible values of i and the definition of \(S'\) more formally. To this end, recall the length function given in "Preliminaries" section, \(L(x,k_T,k_S) \doteq \mathsf {span}(x)  k_T + k_S\). Thus, on the one hand, a substring of maximum length is obtained when there are no deletions from the tree and \(d_S\) deletions from the string. Hence, \(S'=S[i:E(x,i,0,d_S)]\). On the other hand, a shortest substring is obtained when there are \(d_T\) deletions from the tree and none from the string. Then, the length of the substring is \(L(x,d_T,0) = \mathsf {span}(x)d_T\). Hence, the index i runs between 1 and \(n(\mathsf {span}(x)d_T)+1\).
We now turn to address the aforementioned input collection \({\mathcal {D}}{}\) in more detail. Formally, it contains the best scoring derivations of every child \(x'\) of x to every substring of \(S'\) with up to \(d_T\) and \(d_S\) deletions from the tree and string, respectively. It is produced from the entries \({\mathcal {A}}[x', i', k_T, k_S]\) (where each entry gives one derivation) for all \(k_T\) and \(k_S\), and all \(i'\) between i and the end index of \(S'\), i.e. \(i\le i' \le E(x,i,0,d_S)\). For the efficiency of the Qmapping and Pmapping algorithms, the derivations in \({\mathcal {D}}{}\) are grouped by their root (\(\mu .v\)) and arranged in descending order with respect to their end point (\(\mu .e\)). This does not increase the time complexity of the algorithm, as this ordering is received by previous calls to the Qmapping and Pmapping algorithms.
In the final stage of the main algorithm, when the DP table is full, the score of a best derivation is the maximum of \(\{{\mathcal {A}}[root_T,i,k_T,k_S] : k_T\le d_T\), \(k_S \le d_S\), \(1\le i\le n(\mathsf {span}(root_T)k_T)+1\}\). We remark that by tracing back through \({\mathcal {A}}{}\) the onetoone mapping that yielded this derivation can be found.
Pnode mapping
Before describing the Pmapping algorithm, we set up some useful terminology.
Pnode mapping: terminology
We first define the notion of a partial derivation. In the Pmapping algorithm, the derivation of the input node x is built by considering subsets U of its children. With respect to such a subset U, a derivation \(\mu\) of x is built as if x had only the children in U, and is called a partial derivation. Formally, \(\mu\) is a partial derivation of a node x if \(\mu .v=x\) and there is a subset of children \(U'\subseteq \mathsf {children}(x)\) such that the two following conditions are true. First, for every \(u\in U'\) all the leaves in T(u) are neither mapped nor deleted under \(\mu\)  that is, there is no mapping pair \((\ell ,y) \in \mu .o\) such that \(\ell \in \mathsf {leaves}(u)\). Second, for every \(v \in \mathsf {children}(x)\setminus U'\) the leaves in T(v) are either mapped or deleted under \(\mu\). For every \(u \in U'\), we say that u is ignored under \(\mu\). Notice that any derivation is a partial derivation, where the set of ignored nodes (\(U'\) above) is empty. Since all derivations that are computed in a single call to the Pmapping algorithm have the same start point i, it can be omitted (for brevity) from the end point function: thus, we denote \(E_{I}(x,k_T,k_S)\doteq L(x,k_T,k_S)\). Also, for a set U of nodes, we define \(L(U,k_T,k_S) \doteq \sum _{x\in U}\mathsf {span}(x)+k_Sk_T\) and accordingly \(E_{I}(U,k_T,k_S)\doteq L(U,k_T,k_S)\).
We now define certain collections of derivations with common properties (such as having the same numbers of deletions and end point).
Definition 2
The collection of all the derivations of every node \(u \in U\) to suffixes of \(S'[1:E_{I}(U,k_T,k_S)]\) with exactly \(k_T\) deletions from the tree and exactly \(k_S\) deletions from the string is denoted by \({\mathcal {D}}{(U,k_T,k_S)}\).
Definition 3
The collection of all the best derivations from the nodes in U to suffixes of \(S'[1:E_{I}(U,k_T,k_S)]\) with up to \(k_T\) deletions from the tree and up to \(k_S\) deletions from the string is denoted by \({\mathcal {D}}_\le {(U,k_T,k_S)}\). Specifically, for every node \(u \in U\), \(k'_T\le k_T\) and \(k'_S\le k_S\), the set \({\mathcal {D}}_\le {(U,k_T,k_S)}\) holds only one highest scoring derivation of u to a suffix of \(S'[1:E_{I}(U,k_T,k_S)]\) with \(k'_T\) and \(k'_S\) deletions from the tree and string, respectively.
It is important to distinguish between these two definitions. First, the derivations in \({\mathcal {D}}{(U,k_T,k_S)}\) have exactly \(k_T\) and \(k_S\) deletions, while the derivations in \({\mathcal {D}}_\le {(U,k_T,k_S)}\) have up to \(k_T\) and \(k_S\) deletions. Second, in \({\mathcal {D}}{(U,k_T,k_S)}\) there can be several derivations that differ only in their score and in the onetoone mapping that yields them, while in \({\mathcal {D}}_\le {(U,k_T,k_S)}\), there is only one derivation for every node \(u\in U\) and deletion combination pair \((k'_T,k'_S)\). Note that the end points of all of the derivations are equal.
Definition 2 is used for describing the content of an entry of the DP table, where the focus is on the collection of all the derivations of x to \(S'\) with exactly \(k_T\) and \(k_S\) deletions, \({\mathcal {D}}{(\{x\},k_T,k_S)}\). For simplicity, the abbreviation \({\mathcal {D}}{(u,k_T,k_S)} = {\mathcal {D}}{(\{u\},k_T,k_S)}\) is used. In every step of the Pmapping algorithm, a different set of derivations of the children of x is examined, thus, Definition 3 is used for \(U \subseteq \mathsf {children}(x)\). In addition, the set of derivations \({\mathcal {D}}{}\) that is received as input to the algorithms can be described using Definition 3 as can be seen in Eq. 1 below. In this equation, the union is over all \(U\subseteq \mathsf {children}(x)\) because in this way the derivations of all the children of x with every possible end point are obtained (in contrast to having only \(U=\mathsf {children}(x)\), which results in the derivations of all the children of x with the end point \(E_{I}(\mathsf {children}(x),k_T,k_S)\)).
In the Pmapping algorithm for \(C \subseteq \mathsf {children}(x)\), the notation \(x^{(C )}\) is used to indicate that the node x is considered as if its only children are the nodes in C (the nodes in \(\mathsf {children}(x)\setminus C\) are ignored). Consequentially, the span of \(x^{(C )}\) is defined as \(\mathsf {span}(x^{(C )}) \doteq \sum _{c\in C}\mathsf {span}(c)\), and the set \({\mathcal {D}}(x^{(C )},k_T,k_S){}\) (in Definition 2 where \(U=\{x^{(C )}\}\)) now refers to a set of partial derivations. To use \(x^{(C )}\) to describe the base cases of the algorithm, let us define \(x^{(\emptyset )}\) (\(x^{(C )}\) for \(C=\emptyset\)) as a tree with no labeled leaves to map.
Pnode mapping: the algorithm
Recall that the input consists of an internal Pnode x, a string \(S'\), bounds on the number of deletions from the tree T and the string \(S'\), \(d_T\) and \(d_S\), respectively, and a set of derivations \({\mathcal {D}}{}\) (see Eq. 1). The output of the algorithm is \(\bigcup _{0\le k_T \le d_T}\bigcup _{0\le k_S \le d_S}{{\,\mathrm{arg\max }\,}}_{\mu \in {\mathcal {D}}(x,k_T,k_S){}}\mu .score\), which is the collection of the best scoring derivations of x to every possible prefix of \(S'\) having up to \(d_T\) and \(d_S\) deletions from the tree and string, respectively. Thus, there are \(O(d_T d_S)\) derivations in the output.
The algorithm (whose pseudocode is given in 2) constructs a 3dimensional DP table \({\mathcal {P}}{}\), which has an entry for every \(0\le k_T \le d_T\), \(0\le k_S \le d_S\) and subset \(C\subseteq \mathsf {children}(x)\). The purpose of an entry \({\mathcal {P}}[C,k_T,k_S]\) is to hold the best score of a partial derivation in \({\mathcal {D}}(x^{(C )},k_T,k_S){}\), i.e. a partial derivation rooted in \(x^{(C )}\) to a prefix of \(S'\) with exactly \(k_T\) deletions from the tree and \(k_S\) deletions from the string. The children of x that are not in C are ignored (as defined in "Pnode mapping: terminology" section) under the partial derivation stored by the DP table entry \({\mathcal {P}}[C,k_T,k_S]\), thus they are neither deleted nor counted in the number of deletions from the tree, \(k_T\). (They will be accounted for in the computation of other entries of \({\mathcal {P}}{}\).) Similarly to the main algorithm, some of the entries of \({\mathcal {P}}{}\) are invalid, and their value is defined as \(\infty\). Formally, an entry \({\mathcal {P}}[C,k_T,k_S]\) is invalid if one of the following is true: \(k_T > \sum _{c \in C}\mathsf {span}(c)\), \(k_S > L(x^{(C )},k_T,k_S)\), \(L(x^{(C )},k_T,k_S) > \mathsf {len}(S')\), or \(L(x^{(C )},k_T,k_S) < 0\). Every entry \({\mathcal {P}}[C,k_T,k_S]\) for which \(L(x^{(C )},k_T,k_S)=0\) and \(k_S=0\) or for which \(C=\emptyset\) and \(k_T=0\) is initialized with 0. The first set of entries captures the case in which the derived substring is the empty string and thus no character can be deleted from it, i.e. \(k_S\) must equal 0. The second set of entries captures the case in which all the children of x are ignored, thus the value of \(k_T\) must be 0.
After the initialization, the remaining entries of \({\mathcal {P}}{}\) are calculated using the recursion rule in Eq. 2 below. The order of computation is ascending with respect to the size of the subsets C of the children of x, and for a given \(C\subseteq \mathsf {children}(x)\), the order is ascending with respect to the number of deletions from both tree and string.
Intuitively, every entry \({\mathcal {P}}[C,k_T,k_S]\) defines some index \(e'\) of \(S'\) that is the end point of every partial derivation in \({\mathcal {D}}(x^{(C )},k_T,k_S){}\). Thus, \(S'[e']\) must be a part of any partial derivation \(\mu \in {\mathcal {D}}(x^{(C )},k_T,k_S){}\), so, either \(S'[e']\) is deleted under \(\mu\) or it is mapped under \(\mu\). The former option is captured by the first case of the recursion rule. If \(S'[e']\) is mapped under \(\mu\), then due to the hierarchical structure of T(x), it must be mapped under some derivation \(\mu '\) of one of the children of x that are in C. Thus, we receive the second case of the recursion rule.
We remark that the case of a node deletion is captured by the initialization and that adding the option of deleting a node in the recursion rule is therefore redundant.
Once the entire DP table is filled, a derivation of maximum score for every end point and deletion numbers combination can be found in \({\mathcal {P}}[\mathsf {children}(x), \cdot , \cdot ]\). Traversing the said subtable in a specific order guarantees the output derivations are ordered with respect to their end point without further calculations.
Complexity analysis of the PQTree Search algorithm
In this section we compare the time complexity of the main algorithm (in "The main algorithm" section) to the naïve solution for PQTree Search. The complexities of the two algorithms described before as well as the complexity of the Qmapping algorithm are given in the followings lemmas. Lemma 1 and Lemma 2 are proven in "Time and space complexity of the PQtree search algorithm" section, and Lemma 3 is proven in Sect. 1.3 of Additional file 1.
Lemma 1
The algorithm in "The main algorithm" section takes \(O(n \gamma {d_T}^2 {d_S}^2 (m_p 2^{\gamma } + m_q))\) time and \(O(d_T d_S (m n + 2^\gamma ))\) space, where \(\gamma\) is the maximum degree of a node in T.
Lemma 2
The Pmapping algorithm takes \(O({d_T}^2 {d_S}^2 \gamma 2^\gamma )\) time and \(O(d_T d_S 2^\gamma )\) space.
Lemma 3
The Qmapping algorithm takes \(O({d_T}^2 {d_S}^2 \gamma )\) time and \(O(d_T d_S \gamma )\) space.
From Lemma 1 it is proven that PQTree Search has an FPT solution with the parameter \(\gamma\) (Theorem 1).
Theorem 1
PQTree Search with parameter \(\gamma\) is FPT. Particularly, it has an FPT algorithm that runs in \(O^*(2^{\gamma })\) time^{Footnote 1}.
The naïve solution for PQTree Search is to solve sequence alignment with bounded gaps for every substring of S versus every string \(S'\in C(T)\), so the naïve solution takes \(O(2^{m_q}{(\gamma !)}^{m_p} n m(d_T+d_S)d_T d_S)\) time. (A full description of the naïve algorithm and its complexity is given in Sect. 4 of Additional file 1.) Therefore, we conclude that the time complexity of our algorithm is substantially better, as exemplified by considering two complementary cases. One, when there are only Pnodes in T (i.e. \(m_p=O(m)\)), the naïve algorithm is superexponential in \(\gamma\), and even worse, exponential in m, while ours is exponential only in \(\gamma\), and hence polynomial for any \(\gamma\) that is constant (or even logarithmic in the input size). Second, when there are only Qnodes in T (i.e. \(m_q=O(m)\)), the naïve algorithm is exponential while ours is polynomial.
Methods and datasets
Dataset and gene cluster generation. 1487 fully sequenced prokaryotic strains with COG ID annotations (see Additional file 2) were downloaded from GenBank (NCBI; ver 10/2012). Among these strains, 471 genomes included a total of 933 plasmids.
The gene clusters were generated using the tool CSBFinderS [44]. CSBFinderS was applied to all the genomes in the dataset after removing their plasmids, using parameters \(q=1\) (a colinear gene cluster is required to appear in at least one genome) and \(k=0\) (no insertions are allowed in a colinear gene cluster), resulting in 595,708 colinear gene clusters. Next, ignoring strand and gene order information, colinear gene clusters that contain the exact same COGs were united to form the generalized set of gene clusters. The resulting gene clusters were then filtered to 26,270 gene clusters that appear in more than 30 genomes.
Generation of PQTrees. The generation of PQtrees was performed using a program [45] that implements the algorithm described in [28] for the construction of a PQtree from a list of strings comprised from the same set of characters. In the case where a character appeared more than once in a training string, the PQtree with a minimum sized consistent set was chosen. The generated PQtrees varied in size and complexity. The length of their frontier ranged between 4 and 31, and the size of their consistent set ranged between 4 and 362, 880.
Implementation and performance. PQFinder is implemented in Java 1.8. The runs were performed on an Intel Xeon X5680 machine with 192 GB RAM. The time it took to run all plasmid genomes against one PQtree ranged between 5.85 seconds (for a PQtree with a consistent set of size 4) and 181.5 seconds (for a PQtree with a consistent set of size 362, 880). In total it took an hour and 47 minutes to run each one of the 779 PQtrees against each one of the 933 plasmids.
Substitution scoring function. The substitution scoring function reflects the distance between each pair of COGs, that is computed based on sentences describing the functional annotation of the COGs (e.g., “ABCtype sugar transport system, ATPase component”). The “Bag of Words model” was employed, where the functional description of each COG is represented by a sparse vector that is normalized to have a unit Euclidean norm. First, each COG description was tokenized and the occurrences of tokens in each description was counted and normalized using tfidf term weighting. Then, the cosine similarity between each two vectors was computed, resulting in similarity scores ranging between 0 and 1. The sentences describing COGs are short, therefore each word largely influences the score, even after the tfidf term weighting. Therefore, words that do not describe protein functions that were found in the top 30 most common words in the description of all COGs were used as stopwords. Two COGs with the same COG IDs were set to have a score of 1.1, and the substitution score between a gene with no COG annotation to any other COG was set to be − 0.1. Two COGs with a zero score were penalized to have a score of 0.2 and the deletion of a COG from the query PQtree or the target string was set to have a cost of zero.
Enrichment analysis. For each of the four variants in Fig. 5C, a hypergeometric test was performed to measure the enrichment of the corresponding variant in one of the classes in which it appears. A total of 10 pvalues were computed and adjusted using the Bonferroni correction; two pvalues were found significant (\(<0.05\)), reported in "Results" section.
Specificity score. We define a specificity score for a PQtree T of a gene cluster named Sscore, where a more specific tree yields a higher Sscore. Let \({\tilde{T}}\) be the least specific PQtree that could have been generated for the genes of the gene cluster based on which T was constructed. Namely, a PQtree that allows all permutations of said genes, has height 1 and is rooted in a Pnode whose children (being the leaves of \({\tilde{T}}\)) are the leaves of T. The Sscore of T is defined as \(\frac{C({\tilde{T}})}{C(T)}\). For a gene cluster of permutations (i.e. there are no duplications), the computation of C(T) is as described in Eq. 3, where the set of Pnodes in T is denoted by T.p.
For a gene cluster that has duplications, the set C(T) is generated to learn its size. Let \({\mathsf {a}}(\ell ,T)\) denote the number of appearances of the label \(\ell\) in the leaves of T and let \(\mathsf {labels}(T)\) denote the set of all the distinct labels of the leaves of T. So, the formula for \(C({\tilde{T}})\) is as in Eq. 4. Clearly, for T with no duplications \(C({\tilde{T}}) = F(T)!\).
Results
Chromosomal gene orders rearranged in plasmids
The labeling of each internal node of a PQtree as P or Q, is learned during the construction of the tree, based on some interrogation of the gene orders from which the PQtree is trained [28]. As a result, the set of strings that can be derived from a PQtree T, consists of two parts: (1) all the strings representing the known gene orders from which T was constructed, and (2) additional strings, denoted treeguided rearrangements, that do not appear in the set of gene orders constructing T, but can be obtained via rearrangement operations that are constrained by T. Thus, the treeguided rearrangements conserve the internal topology properties of the gene cluster, as learned from the corresponding gene orders during the construction of T, such that colinear dependencies among genes and between suboperons are preserved in the inferred gene orders.
In this section, we used the PQtrees constructed from chromosomal gene clusters, to examine whether treeguided rearrangements can be found in plasmids. The objective was to discover gene orders in plasmids that abide by a PQtree representing a chromosomal gene cluster, and differ from all the gene orders participating in the construction of the PQtree. PQtrees that are constructed from gene clusters that have only one gene order or gene clusters with less than four COGs cannot generate gene orders that differ from the ones participating in their construction. Therefore, only 779 out of 26,270 chromosomal gene clusters were used for the construction of query PQtrees (the generation of the chromosomal gene clusters is detailed in Sect. Methods and datasets). Using our tool PQFinder that implements the algorithm proposed for solving the PQTree Search problem, the query PQtrees were run against all plasmid genomes. This benchmark was run conservatively without allowing substitutions or deletions from the PQtree or from the target string. 380 of the query gene clusters were found in at least one plasmid. The instances of these gene clusters in plasmids are provided in the Supplementary Materials in [2] as a session file that can be viewed using the tool CSBFinderS [44].
Treeguided rearrangements were found among instances of 29 chromosomal gene clusters. The PQtrees corresponding to these gene clusters were sorted by a decreasing Sscore, where higher scores are given to a more specific tree (details in "Methods and datasets" section). In this setting, the higher the Sscore, the smaller the number of possible gene orders that can be derived from the respective PQtree. Interestingly, 21 out of these 29 gene clusters code for transporters, namely 20 importers (ABCtype transport systems) and one exporter (efflux pump). The 10 top ranking results are presented in Table 1.
We selected the third topranking PQtree in Table 1 for further analysis. This PQtree was constructed from seven gene orders of a gene cluster that encodes a heavy metal efflux pump. This gene cluster was found in the chromosomes of 79 genomes (represented by the seven distinct gene orders mentioned above) and in the plasmids of seven genomes. The instance of the chromosomal gene cluster identified as a treeguided rearrangement in plasmids was found in the strain Cupriavidus metallidurans CH34, isolated from an environment polluted with high concentrations of several heavy metals. This strain contains two large plasmids that confer resistance to a large number of heavy metals such as zinc, cadmium, copper, cobalt, lead, mercury, nickel and chromium. We hypothesize that the rearrangement event could have been caused by a heavy metal stress [46]. In the following section we will focus on this PQtree to further study its different variants in plasmids.
Finding approximate instances of an RND efflux pump
The heavy metal efflux pump examined in the previous section (corresponding to the third topranking PQtree in Table 1), was used as a PQFinder query and rerun against all the plasmids in our dataset in order to discover approximate instances of this gene cluster, possibly encoding remotely related variations of the efflux pump it encodes. This time, in order to increase sensitivity, a semantic substitution scoring function (described in Sect. Methods and datasets) was used, and the arguments were set to \(d_T=1\) (up to one deletion from the tree, representing missing genes) and \(d_S=3\) (up to three deletions from the plasmid, representing intruding genes). An instance of a gene cluster is accepted if it was derived from the corresponding PQtree with a score that is higher than 0.75 of the highest possible score attainable by the query. This search resulted in the detection of approximate instances of the query gene cluster in the plasmids of 24 genomes; These results are displayed in Figs. 5, 6, and Additional file 1: Figure S1.
Heavy metal efflux pumps are involved in the resistance of bacteria to a wide range of toxic metal ions [47] and they belong to the resistancenodulationcell division (RND) family. In Gramnegative bacteria, RND pumps exist in a tripartite form, comprised from an outermembrane protein (OMP), an inner membrane protein (IMP), and a periplasmic membrane fusion protein (MFP) that connects the other two proteins. In some cases, the genes of the RND pump are flanked with two regulatory genes that encode the factors of a twocomponent regulatory system comprising a sensor/histidine kinase (HK) and response regulator (RR) (Fig. 5B). This regulatory system responds to the presence of a substrate, and consequently enhances the expression of the efflux pump genes.
The PQtree of this gene cluster (Fig. 5A) shows that the COGs encoding the IMP and MFP proteins always appear as an adjacent pair, the OMP COG is always adjacent to this IMPMFP pair, and the HK and RR COGs appear as a pair downstream or upstream to the other COGs. COG3696, which encodes the IMP protein, is annotated as a heavy metal efflux pump protein, while the other COGs are common to all RND efflux pumps. Therefore, it is very likely that the respective gene cluster corresponds to a heavy metal RND pump. The absence of an additional periplasmic protein likely indicates that this gene cluster encodes a Czclike efflux pump that exports divalent metals such as the cobalt, zinc and cadmium exporter in Cupriavidus metallidurans [47] (Fig. 5C(1)).
PQFinder discovered instances of this gene cluster in the plasmids of 12 genomes (Fig. 5C(1) and D), and it is significantly enriched in the \(\beta\)proteobacteria class (hypergeometric pvalue= \(1.09\times 10^{5}\), Bonferroni corrected pvalue = \(1.09\times 10^{4}\)). In addition, three other variants of RND pumps were found as instances of the query gene cluster (Fig. 5C(2–4)). The plasmids of three genomes contained instances that were missing the COG corresponding to the OMP gene CzcC (Fig. 5C(2)). This could be caused by a low quality sequencing or assembly of these plasmids. An alternative possible explanation is that a Czclike efflux pump can still be functional without CzcC; a previous study showed that the deletion of CzcC resulted in the loss of cadmium and cobalt resistance, but most of the zinc resistance was retained [47].
Some instances identified by the query, found in the plasmids of six genomes, seem to encode a different heavy metal efflux pump (Figs. 5C(3), 6). This variant includes all COGs from the query, in addition to an intruding COG that encodes a periplasmic protein (CusF). This protein is a predicted copper usher that facilitates access of periplasmic copper towards the heavy metal efflux pump. Indeed, the genomic region of Cuslike efflux pumps that export monovalent metals, such as the silver and copper exporter in Escherichia coli, include this periplasmic protein, in contrast to the Czclike efflux pump [47]. This variant was found in the plasmids of six bacterial genomes belonging to the class \(\gamma\)proteobacteria (Fig. 5D). This gene cluster is significantly enriched in the \(\gamma\)proteobacteria class (hypergeometric pvalue= \(2.13\times 10^{4}\), Bonferroni corrected pvalue = \(2.13\times 10^{3}\)). Surprisingly, all of these strains, except for one, are annotated as human or animal pathogens. Interestingly, previous studies suggest that the host immune system exploits excess copper to poison invading pathogens [48], which can explain why these pathogens evolved copper efflux pumps.
Another variant of the pump, appearing in five genomes (Fig. 5C(4) and D), resulted from a substitution of the query IMP gene (COG3696) by a different IMP gene (COG0841) belonging to the multidrug efflux pump AcrABTolC. The AcrABTolC system, mainly studied in Escherichia coli, transports a diverse array of compounds with little chemical similarity [49]. AcrABTolC is an example of an intrinsic nonspecific efflux pump, which is widespread in the chromosomes of Gramnegative bacteria, and likely evolved as a general response to environmental toxins [50]. In this case, the query gene cluster and the identified variant share all COGs, except for the COGs encoding the IMP genes. The differing COGs are responsible for substrate recognition, which naturally differs between the two pumps, as one pump exports heavy metal while the other exports multiple drugs. When considering the functional annotation of these two COGs, we see that the query metal efflux pump COG encoding the IMP gene is annotated as “Cu/Ag efflux pump CusA”, while in the multidrug efflux pump the COG encoding the IMP gene is annotated as “Multidrug efflux pump subunit AcrB”. Thus, in spite of the difference in substrate specificity, the semantic similarity measure employed by PQFinder was able to reflect their functional similarity and allowed the substitution between them, while conferring to the structure of the PQtree.
PQtree search is NPhard
In this section we prove Theorem 2 by describing a reduction from the Job Interval Selection problem (JISP) to PQTree Search. This reduction also proves that PQTree Alignment is NPhard (Theorem 3).
Theorem 2
PQTree Search is NPhard.
Theorem 3
PQTree Alignment is NPhard.
Since its initial definition by Nakajima and Hakimi [51], JISP has seen several equivalent definitions [43, 52,53,54]. We use the following formulation for JISP k based on colors. Given \(\gamma\) ktuples of intervals on the real line, where the intervals of every ktuple have a different color i (\(1\le i \le \gamma\)), select exactly one interval of each color (ktuple) such that no two intervals intersect. The notation \(I_j^i\) is used to denote the interval that starts at \(s_{ij}\), ends at \(f_{ij}\) (i.e. the interval \([s_{ij},f_{ij}]\)) and has the color i (i.e. it is a part of the \(i^{\text {th}}\) ktuple).
JISP3 was shown to be NPcomplete by Keil [52]. Crama et al. [54] showed that JISP3 is NPcomplete even if all intervals are of length 2. We use these results to show that PQTree Search is NPhard.
The reduction. Let J be an instance of JISP3 where all intervals have a length of two. It is easy to see that shifting all intervals by some constant does not change the problem. Hence, assume that the leftmost starting interval starts at 1. Let L be the rightmost ending point of an interval, so the focus can be only on the segment [1, L] of the real line. Now, an instance of PQTree Search \((T,S,h,d_T,d_S)\) is constructed (an illustrated example is given in Fig. 7 below):

The PQtree T : The root node, \(root_T\), is a Pnode with \(3L23\gamma\) children: \(x_1, \cdots {,} x_{\gamma },y_1, \cdots {,} y_{3L24\gamma }\). The children of \(root_T\) are defined as follows: for every color \(1 \le i \le \gamma\), create a Qnode \(x_i\) with four children \(x_i^s, \ x_i^a, \ x_i^b, \ x_i^f\); for every index \(1\le i\le 3L23\gamma\), create a leaf \(y_i\).

The string S : Define \(S=\sigma _1 \sigma _a \sigma _b \sigma _2 \sigma _a \sigma _b \cdots \sigma _a \sigma _b \sigma _L\).

The substitution function h : For every interval of the color i, \(I_j^i=[s_{ij},f_{ij}]\), the function h returns True for the following pairs: \((x_i^s, \sigma _{s_{ij}})\), \((x_i^f, \sigma _{f_{ij}})\), \((x_i^a, \sigma _a)\) and \((x_i^b, \sigma _b)\). In addition, every leaf \(y_r\) can be substituted by every character of S, namely for every index \(1\le r\le 3L23\gamma\) and for every \(s \in \{a,b,1,\cdots ,L\}\) the function h returns True for the pair \((y_r, \sigma _s)\). For every other pair h returns False. For the optimization version of the problem, define a substitution scoring function \(h'\), such that \(h'(u,v) = 1\) if \(h(u,v) = True\) and \(h'(u,v) = \infty\) if \(h(u,v) = False\).

Number of deletions: Define \(d_T=0\) and \(d_S=0\), i.e. deletions are forbidden from both tree and string.
An example of the reduction is shown in Fig. 7. The JISP3 instance J is a collection of two 3tuples (one blue and one red) where each interval is of length 2 (Fig. 7A). Running the reduction algorithm on J yields the PQTree Search instance in Fig. 7B. The pairs that can be substituted (i.e. the pairs for which h returns True) are given by the lines connecting the leaves of the PQtree and the characters of the string S. The nodes and substitutable pairs created due to the blue and red intervals in the JISP3 instance are marked in blue and red, respectively. The substitutable pairs containing a y node are marked in gray. Note that the colors given in Fig. 7B are not a part of the PQTree Search instance, and are given for convenience.
Notice that in the reduction, the number of deletions is zero and the height of the tree is 2. Thus, these parameters cannot be used to design an FPT algorithm. In addition, notice that though the output of the reduction is referred as an instance of PQTree Search, it is also an instance of PQTree Alignment. Ahead the reduction is proven for PQTree Search, but the proof for PQTree Alignment is the same.
Proof
Correctness Let J be an instance of JISP3, and let \((T,S,h,d_T,d_S)\) be the output of the reduction on this instance. We prove that there exists a collection of intervals that is a solution for J if and only if there exists a onetoone mapping that is a solution to \((T,S,h,d_T,d_S)\).
One direction. Suppose that there exists a solution to the output instance of PQTree Search of the reduction, \((T,S,h,d_T,d_S)\). This solution is a onetoone mapping \({\mathcal {M}}{}\): for every \(1\le i\le \gamma\), a set of pairs of the form \((x_i^j,\sigma _k(\ell ))\) for \(j\in \{s,f,a,b\}\), and for every \(1\le r\le 3L23\gamma\), pairs of the form \((y_r,\sigma _k(\ell ))\) where \(k\in \{1,\cdots ,L,a,b\}\) and \(1\le \ell \le 3L2\). By the definition of PQTree Search, each \(x_i^j\), \(y_r\) and \(\sigma _k(\ell )\) appear in exactly one pair. Considering the mappings of the children of a node \(x_i\), they must be the following: \((x_i^s,\sigma _k(\ell ))\), \((x_i^a,\sigma _a(\ell +1))\), \((x_i^b,\sigma _b(\ell +2))\) and \((x_i^f,\sigma _{k+1}(\ell +3))\). To see this, observe that a node \(x_i^a\) must be mapped to \(\sigma _a\), because it is the only character by which it can be substituted under h. In the same way, a node \(x_i^b\) must be mapped to \(\sigma _b\). Because \(d_T=0\), \(d_S=0\) and due to the properties of a Qnode, once \(x_i^s\) is mapped to the character in index \(\ell\) (i.e. \((x_i^s,\sigma (\ell ))\in {\mathcal {M}}{}\)), \(x_i^a\) must be mapped to the character in index \(\ell +1\) or in index \(\ell 1\) (i.e. the adjacent character to the one to which \(x_i^s\) is mapped), then \(x_i^b\) must be mapped to the character in index \(\ell +2\) or \(\ell 2\), respectively, and \(x_i^f\) to \(\ell + 3\) or \(\ell 3\), respectively. Since \(\sigma _a\) is always the character preceding \(\sigma _b\) in S, \(x_i^b\) must be mapped to an index larger by one than the index mapped to \(x_i^a\). Hence, the children of the Qnode \(x_i\) are mapped from left to right.
Now, let us derive a solution for the original JISP3 instance from the solution to PQTree Search. For every 3tuple of color \(1\le i\le \gamma\), where \((x_i^s,\sigma _k(\ell ))\in\) \({\mathcal {M}}{}\), choose the interval \(I_k^i=[k,k+1]\) from the 3tuple of color i. For example, if a part of the solution for the PQTree Search instance in Fig. 7B is \(\{(x_1^s,\sigma _1(1)),\) \((x_1^a,\sigma _a(2)),\) \((x_1^b,\sigma _b(3)),\) \((x_1^f,\sigma _2(4))\} \subset\) \({\mathcal {M}}{}\), then \(I_1^1\) is the interval chosen for the first color (blue) in the derived solution for the JISP3 instance in Fig. 7.A. Note that \(I_k\) is indeed one of the intervals of color i, due to the definition of h, \(h(x_i^s,\sigma _k) = True\) and \(h(x_i^f,\sigma _{k+1})=True\) if and only if there is an interval of color i starting at k and ending at \(k+1\). Thanks to \({\mathcal {M}}{}\) being a onetoone mapping, the intervals do not intersect, and for every color there is only one interval chosen.
Second Direction. Let us prove that if there is a solution for the original instance of JISP3 J, then there is a solution for \((T,S,h,d_T,d_S)\). Let \({\mathcal {I}}=\{I_{j_1}^1,...,I_{j_\gamma }^\gamma \}\) be a solution of J such that \(I_{j_i}^i=[s_{ij_i},f_{ij_i}]\) is the interval chosen for the 3tuple of color i. First, the solution for the PQTree Search instance \((T,S,h,d_T,d_S)\) is constructed. For every \(1\le i\le \gamma\), insert the following pairs into \({\mathcal {M}}{}\): \((x_i^s,\sigma _{s_{ij_i}}(3s_{ij_i}2))\), \((x_i^a,\sigma _{a}(3s_{ij_i}1))\), \((x_i^b,\sigma _{b}(3s_{ij_i}))\), and \((x_i^f,\sigma _{f_{ij_i}}(3f_{ij_i}2))\). For example, if \(I_2^2\) is the interval chosen from the second (red) 3tuple in the solution of the JISP3 instance in Fig. 7.A, then the solution for the PQTree Search instance in Fig. 7B includes the pairs \(\{(x_2^s,\sigma _2(4)), (x_2^a,\sigma _a(5)), (x_2^b,\sigma _b(6)), (x_2^f,\sigma _3(7))\}\). Observe that only one pair was inserted for every leaf of T, and since no two intervals intersect, every index of S appears in only one pair in \({\mathcal {M}}{}\). Hence, a onetoone mapping between \(4\gamma\) leaves of T and \(4\gamma\) indices of S was defined, and \(3L4\gamma 2\) additional pairs need to be inserted to \({\mathcal {M}}{}\) in order to construct a solution for the PQTree Search instance. According to h, every node \(y_r\) (\(1\le r\le 3L23\gamma\)) can be mapped to every character \(\sigma _k\), so arbitrarily insert the pairs \((y_r,\sigma _{k_r}(\ell _r))\) to \({\mathcal {M}}{}\), such that no index or node appear in more than one pair. (It can be done because there are \(3L4\gamma 2\) y nodes and after mapping the 4 children of every one of the \(\gamma\) \(x_i\) nodes, \(3L4\gamma 2\) characters of S are left without a mapping). Thus, a onetoone mapping \({\mathcal {M}}{}\) between all the leaves of T and all the indices of S (i.e. no deletions from S and T) was defined, and it is left to prove that S can be derived from T under \({\mathcal {M}}{}\).
The children of a Qnode \(x_i\) from left to right are: \(x_i^s,x_i^a,x_i^b,x_i^f\), and so, because \(d_T=0\) and \(d_S=0\) (no deletions from both tree and string), they have to be mapped to consecutive indices of S; this is indeed the case according to our definition of \({\mathcal {M}}{}\). The mapping of every \(y_r\) is obviously also legal. Finally, \(root_T\) is a Pnode, so its children can be arranged in any order, and they are. This completes the proof of correctness of the reduction. \(\square\)
This concludes the proof of Theorem 2.
Correctness of our algorithms
In this section we prove the correctness of the PQTree Search algorithm ("Correctness of the main algorithm" section) and the Pmapping algorithm ("Correctness of the Pnode mapping algorithm" section). First, some definitions that are used in the proofs are given.
Addition and removal of a derivation. Given a partial derivation \(\mu\), which derives an internal node x, let us define the removal and addition of another derivation \(\eta\): \(\mathsf {remove}(\mu ,\eta )\) and \(\mathsf {add}(\mu ,\eta )\). To this end, we say that \(\eta\) is the derivation of \(x'\) under \(\mu\) if \(x'=\eta .v\in \mathsf {children}(\mu .v)\) and \(\eta .o \subseteq \mu .o\), i.e. the onetoone mapping that yields \(\eta\) is a subset of the onetoone mapping that yields \(\mu\).
Operation 1
The operation \(\mathsf {remove}(\mu ,\eta )\) is defined only if \(\eta\) is the derivation of \(\eta .v\) under \(\mu\) and if either \(\eta .e = \mu .e\) or \(\eta .s=\mu .s\) is true. The operation returns a new partial derivation \(\mu '\) of \(\mu .v\) that ignores the subtree \(T(\eta .v)\). If \(\eta .e = \mu .e\), then \(\mu '\) derives the string \(S[\mu .s:\eta .s1]\), and if \(\eta .s=\mu .s\), then \(\mu '\) derives the string \(S[\eta .e+1:\mu .e]\). In any case the number of deletions from the tree is \(\mu '.del_T= \mu .del_T \eta .del_T\) and from the string it is \(\mu '.del_S= \mu .del_S \eta .del_S\). Furthermore, \(\mu .o \setminus \eta .o\) is the onetoone mapping that yields \(\mu '\).
Operation 2
The operation \(\mathsf {add}(\mu ,\eta )\) is defined only if either \(\eta .s = \mu .e+1\) or \(\eta .e=\mu .s1\) is true and if \(\eta .v\in \mathsf {children}(x)\) and it is ignored under \(\mu\). The operation returns a new partial derivation \(\mu '\) of \(\mu .v\). The derivation of \(\eta .v\) under \(\mu '\) is \(\eta\), and the mapping or deletion of every other leaf or character in the string is defined the same as it was in \(\mu\). Consequentially, if \(\eta .s = \mu .e+1\), then \(\mu '\) derives the string \(S[\mu .s:\eta .e]\), and if \(\eta .e=\mu .s1\), then \(\mu '\) derives the string \(S[\eta .s:\mu .e]\). In any case, \(\mu '.del_T= \mu .del_T+ \eta .del_T\), \(\mu '.del_S= \mu .del_S+ \eta .del_S\) and the onetoone mapping that yields \(\mu '\) is \(\mu .o \cup \eta .o\).
Addition and removal of a deleted character. Given a partial derivation \(\mu\), which derives a string S, and an index i of S let us define the removal and addition of a deleted character: \(\mathsf {removeDel}(\mu , i)\) and \(\mathsf {addDel}(\mu ,i)\).
Operation 3
The operation \(\mathsf {removeDel}(\mu , i)\) is defined only if \(i=\mu .e\) or \(i=\mu .s\), and if S[i] is deleted under \(\mu\). The operation returns a partial derivation \(\mu '\) with \(\mu .del_S1\) deletions from the string. If \(i=\mu .e\), then \(\mu '\) derives the string \(S[\mu .s,\mu .e1]\), and if \(i=\mu .s\), then \(\mu '\) derives the string \(S[\mu .s+1,\mu .e]\). The onetoone mapping that yields \(\mu '\) is \(\mu .o \setminus \{(\varepsilon , S[i](i))\}\).
Operation 4
The operation \(\mathsf {addDel}(\mu ,i)\) is defined only if \(i=\mu .e+1\) or \(i=\mu .s1\). The operation returns a partial derivation \(\mu '\) with \(\mu .del_S+1\) deletions from the string. If \(i=\mu .e+1\), then \(\mu '\) derives the string \(S[\mu .s,\mu .e+1]\), and if \(i=\mu .s1\), then \(\mu '\) derives the string \(S[\mu .s1,\mu .e]\). The onetoone mapping that yields \(\mu '\) is \(\mu .o \cup \{(\varepsilon , S[i](i))\}\).
Correctness of the main algorithm
In this section we prove the correctness of the PQTree Search algorithm presented in "The main algorithm" section by proving Lemma 4. In this proof, the correctness of the Qmapping algorithm (described in Sect. 1 of Additional file 1) and of the Pmapping algorithm (described in "Pnode mapping: the algorithm" section) is assumed. In addition, the set of all derivations to \(S[i,E(x,i,k_T,k_S)]\) rooted in x that have exactly \(k_T\) deletions from the tree and exactly \(k_S\) deletions from the string is denoted by \({\mathcal {D}}_M(x,i,k_T,k_S){}\). Similarly to the notation in Definition 2, the \({\mathcal {D}}_M(x,i,k_T,k_S){}\) notation is used to represent the set of derivations whose score might be in \({\mathcal {A}}[x,i,k_T,k_S]\).
Lemma 4
At the end of the algorithm every entry \({\mathcal {A}}[x,i,k_T,k_S]\) of the DP table \({\mathcal {A}}{}\) holds the highest score of a derivation of \(S[i,E(x,i,k_T,k_S)]\) rooted in x that has \(k_S\) deletions from the string and \(k_T\) deletions from the tree, i.e. \({\mathcal {A}}[x,i,k_T,k_S] = \max _{\mu \in {\mathcal {D}}_M(x,i,k_T,k_S){}}\mu .score\)
Proof
We prove Lemma 4 by induction on the entries of \({\mathcal {A}}{}\) in the order described in the algorithm. Namely, for two entries \({\mathcal {A}}[x_1,i_1, k_{T_1},k_{S_1}]\) and \({\mathcal {A}}[x_2,i_2,k_{T_2},k_{S_2}]\), \({\mathcal {A}}[x_1,i_1, k_{T_1},k_{S_1}] < {\mathcal {A}}[x_2,i_2,k_{T_2},k_{S_2}]\) if and only if \(x_1\) appears before \(x_2\) in the postorder of T or both \(x_1=x_2\) and \(i_1<i_2\). If \(x_1=x_2\) and \(i_1=i_2\), then the order between the entries is chosen arbitrarily.
Base Case. The base case of the algorithm is the initialization of the DP table, where the entries \({\mathcal {A}}[x,i,k_T,k_S]\) for \(x\in \mathsf {leaves}(root)\) and \(k_T \in \{0,1\}\) are computed. When \(k_T=0\), there are no deletions from the tree. So, x must be mapped to some character \(S[\ell ]\) (\(i\le \ell \le E(x,i,0,k_S)\)). In this version of the algorithm the deletion of a character does not change the score of the derivation, so the maximal score of a derivation in \({\mathcal {D}}_M(x,i,0,k_S)\) is the maximum score of a mapping of x to some character \(S[\ell ]\) (\(i\le \ell \le E(x,i,0,k_S)\)), which is the initialization value of the entry \({\mathcal {A}}[x,i,0,k_S]\). When \(k_T=1\), there is one deletion from the tree. The derived subtree T(x) has one leaf, x, and so it must be the deleted leaf. All characters in the derived string, \(S[i:E(x,i,1,k_S)]\), must also be deleted. Deletions do not add to the score of the derivation, and so all the derivations in \({\mathcal {D}}_M(x,i,1,k_S)\) have a score of 0, which is the initialization value of \({\mathcal {A}}[x,i,1,k_S]\).
Induction Assumption. Assume that every entry \({\mathcal {A}}[x',i',k'_T,k'_S]\) such that \({\mathcal {A}}[x',i',k'_T,k'_S]\) \(< {\mathcal {A}}[x,i,k_T,k_S]\) holds the best score of a derivation from the set \({\mathcal {D}}_M(x',i',k'_T,k'_S)\). Namely, \({\mathcal {A}}[x',i',k'_T,k'_S] = \max _{\mu \in {\mathcal {D}}_M(x',i',k'_T,k'_S)}{\mu .score} = OPT(x',i',k'_T,k'_S)\).
Induction Step. For every internal node x and possible start index i, the algorithm fills the DP table entry \({\mathcal {A}}[x,i,k_T,k_S]\) according to the values returned from the Qmapping and Pmapping algorithms based on the type of x. The correctness of the Pmapping algorithm is proven in Section Correctness of the Pnode mapping algorithm, and the correctness of the Qmapping algorithm is proven in Section 1.2 of Additional file 1. Hence, it is only necessary to prove that the input the algorithms expect to receive is sent correctly from the main algorithm.
Both the Qmapping and Pmapping algorithms expect to receive the internal node which should be the root of all the output derivations, a substring \(S'\) of S, the deletion bounds \(d_T\) and \(d_S\), and a collection of the best scoring derivations of every child of x to every substring of \(S'\) with up to \(d_T\) and \(d_S\) deletions from the tree and string, respectively. By definition an entry in \({\mathcal {A}}[x,i,\cdot ,\cdot ]\) concerns the derivations of x with a start point i. The end point of the longest derivation of those derivations is \(E(x,i,0,d_S)\). Hence, the internal node sent to the Qmapping or Pmapping algorithm is x and the substring \(S'\) equals \(S[i,E(x,i,0,d_S)]\). The deletion bounds \(d_T\) and \(d_S\) are given as input to the main algorithm. Lastly, the best derivations of the children of x are stored in \({\mathcal {A}}{}\). Because a node \(x' \in \mathsf {children}(x)\) appears before x in the postorder of T, then for every \(i', k'_T, k'_S\), it holds that \({\mathcal {A}}[x',i',k'_T,k'_S] < {\mathcal {A}}[x,i,k_T,k_S]\), and from the induction assumption \({\mathcal {A}}[x',i',k'_T,k'_S]= OPT(x',i',k'_T,k'_S)\). So, indeed the expected input to the Qmapping and Pmapping algorithms is correct. This completes the proof. \(\square\)
Correctness of the Pnode mapping algorithm
In this section we prove the correctness of the Pmapping algorithm presented in "Pnode mapping: the algorithm" section by proving Lemma 5.
Lemma 5
At the end of the algorithm every entry of the DP table, \({\mathcal {P}}[C,k_T,k_S]\), holds the best score for a partial derivation of \(x^{(C )}\) to a prefix of \(S'\) with \(k_T\) deletions from the tree and \(k_S\) deletions from the string, i.e. \({\mathcal {P}}[C,k_T,k_S] = \max _{\mu \in {\mathcal {D}}(x^{(C )},k_T,k_S){}}\mu .score\)
Proof
We prove Lemma 5 by induction on the entries of \({\mathcal {P}}{}\) in the order described in the algorithm. Namely, for two entries \({\mathcal {P}}[C_1,k_{T_1},k_{S_1}]\) and \({\mathcal {P}}[C_2,k_{T_2},k_{S_2}]\), \({\mathcal {P}}[C_1,k_{T_1},k_{S_1}] < {\mathcal {P}}[C_2,k_{T_2},k_{S_2}]\) if and only if

\(C_1 < C_2\), or

\(C_1 = C_2\) and \(k_{S_1} < k_{S_2}\), or

\(C_1 = C_2\) and \(k_{S_1} = k_{S_2}\) and \(k_{T_1} < k_{T_2}\)
If \(C_1 \ne C_2\), \(C_1 = C_2\), \(k_{S_1} = k_{S_2}\) and \(k_{T_1} = k_{T_2}\) are all satisfied, then the order between the entries is chosen arbitrarily.
Base Cases. There are two types of base cases, as described in the initialization of the DP table.

1
\(L(x^{(C )},k_T,k_S)=0\) and \(k_S=0\): Let \(\mu\) be a derivation of \(x^{(C )}\) with \(k_T\) and \(k_S\) deletions. By definition, \(\mu\) derives an empty string, i.e. there are no characters to map to the leaves of the subtrees rooted in the nodes in C. Hence, every child of x that is considered (the nodes in C) must be deleted under \(\mu\). All the nodes in C can be deleted if the sum of their spans is equal to the allowed number of deletions in \(\mu\) (that is, \(k_T\)). From the definition of \(L(x^{(C )},k_T,k_S)=0\) and the fact that \(k_S=0\), we obtain that indeed \(k_T = \sum _{c\in C}\mathsf {span}(c)\). Every child node of x that is kept under \(\mu\) adds to the score of the derivation of x, but there are none in this case. In addition, every deletion from the subtree T(x) adds nothing to the score (in the penaltyfree version of the algorithm). Hence, the score of \(\mu\) must equal 0.

2
\(C=\emptyset\) and \(k_T=0\): In this case all of the children of x are ignored, so there are no leaves to map. Hence, every character of the derived string should be deleted. Note that the derived string is \(S'[1:E_{I}(x^{(C )},k_T,k_S)]\), and its length is \(L(x^{(C )},k_T,k_S) = \sum _{c\in C}\mathsf {span}(c)  k_T + k_S = \sum _{c\in \emptyset }\mathsf {span}(c)  0 + k_S = k_S\). So, the number of deletions from the string in this case is exactly the number needed to delete all the characters in the derived string.
Induction Assumption. Assume that every table entry \({\mathcal {P}}[C',k'_T,k'_S]\) such that \({\mathcal {P}}[C',k'_T,k'_S]\) \(< {\mathcal {P}}[C,k_T,k_S]\) holds the best score of a derivation in \({\mathcal {D}}(C',k'_T,k'_S)\). Namely, \({\mathcal {P}}[C',k'_T,k'_S]\) \(= \max _{\mu \in {\mathcal {D}}(C',k'_T,k'_S)}{\mu .score} = OPT(C',k'_T,k'_S)\).
Induction Step. Towards the proof of the step, we prove the following Eq. 5:
 \(\le\)::

Let \(\mu ^* \in {\mathcal {D}}(x^{(C )},k_T,k_S)\) be a derivation such that \(\mu ^*.score = OPT(C,k_T,k_S)\), and let \(e_c=E_{I}(x^{(C )},k_T,k_S)\). By definition, \(\mu ^*\) is a derivation of \(x^{(C )}\) to the string \(S'[1:e_c]\). In a derivation every character of the derived string is either deleted or it is a part of a substring derived from one of the children of x. So, either \(S'[e_c]\) is deleted under \(\mu ^*\), or it is mapped under some derivation of a child of \(x^{(C )}\), to a substring \(S'[i:e_c]\) (for an index \(0<i\le e_c\)). First, if the former is true, then by removing the deletion of \(S'[e_c]\) from \(\mu ^*\) (\(\mathsf {removeDel}(\mu ^*, E_{I}(x^{(C )}, k_T, k_S))\)) a derivation \(\mu ' \in {\mathcal {D}}(x^{(C )},k_T,k_S1)\) is obtained. The derivation \(\mu '\) derives the string \(S'[1:E_{I}(x^{(C )},k_T,k_S1)] = S'[1:e_c  1]\). So, the following Eq. 6 is true.
$$\begin{aligned} \begin{aligned} \mu ^*.score&= \mu '.score\\&\le OPT(C,k_T,k_S1)\\&\le \max (OPT(C,k_T,k_S1),\\&\displaystyle \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}} {OPT(C\setminus \{\mu .v\},k_T\mu .del_T,k_S\mu .del_S) + \mu .score}) \end{aligned} \end{aligned}$$(6)Note that even if there is a penalty cost for deletions, the cost for the deletion of \(S'[e_c]\) (i.e. \(\Delta (S'[e_c])\)) is constant in this setting. So, for two derivations \(\eta , \eta ' \in {\mathcal {D}}(x^{(C )},k_T,k_S1)\) if \(\eta .score \le \eta '.score\) then \(\eta .score \Delta (S'[e_c]) \le \eta '.score \Delta (S'[e_c])\). Hence, the conclusion from Eq. 6 is still true. Second, if the latter is true, then there is a node \(y \in C\) for which there is a derivation \(\mu _y \in {\mathcal {D}}{}\) such that \(\mu _y.e = e_c\) and \(\mu .y\) is the derivation of y under \(\mu ^*\). For \(\mu ^*\) to be a legal derivation, \(\mu _y\) must be in \({\mathcal {D}}_\le (C,k_T,k_S){}\). Hence, \(\mu _y.score \le \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}}{\mu .score}\). Furthermore, by removing \(\mu _y\) from \(\mu ^*\), \(\mathsf {remove}(\mu ^*,\mu .y)\), the obtained partial derivation, \(\mu '\), is of \(x^{(C \setminus \{y\})}\) to \(S'[1:\mu _y.s1]\) with \(k_T\mu _y.del_T\) deletions from the tree and \(k_S\mu _y.del_S\) from the string. Thus, \(\mu ' \in {\mathcal {D}}(x^{(C \setminus \{y\})},k_T\mu _y.del_T, k_S\mu _y.del_S)\), and so \(\mu '.score \le OPT(C\setminus \{y\}, k_T\mu _y.del_T, k_S\mu _y.del_S)\). Note that indeed \(\mu _y.s = 1 + E_{I}(x^{(C \setminus \{y\})},k_T\mu _y.del_T, k_S\mu _y.del_S)\), as can be seen in the following Eq. 7.
$$\begin{aligned} \begin{aligned} \mu _y.s&= e_c  L(y,\mu _y.del_T, \mu _y.del_S) + 1 \\&= \sum _{c\in C}{\mathsf {span}(c)} + k_S  k_T  (\mathsf {span}(y) + \mu _y.del_S \mu _y.del_T) + 1 \\&= \sum _{c\in C\setminus \{y\}}{\mathsf {span}(c)} + k_S\mu _y.del_S (k_T\mu _y.del_T) + 1 \\&= E_{I}(x^{(C \setminus \{y\})},k_T\mu _y.del_T, k_S\mu _y.del_S) + 1 \end{aligned} \end{aligned}$$(7)By combining our conclusions about \(\mu _y\) and \(\mu '\) together, we obtain the following Eq. 8.
$$\begin{aligned} \begin{aligned} \mu ^*.score&= \mu '.score + \mu _y.score \\&\le OPT(C\setminus \{y\},k_T\mu _y.del_T, k_S\mu _y.del_S) + \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}}{\mu .score} \\&\le \displaystyle \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}} {OPT(C\setminus \{\mu .v\},k_T\mu .del_T,k_S\mu .del_S) + \mu .score} \\&\le \max (OPT(C, k_T, k_S1), \\&\displaystyle \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}} {OPT(C\setminus \{\mu .v\},k_T\mu .del_T,k_S\mu .del_S) + \mu .score}) \end{aligned} \end{aligned}$$(8)  \(\ge\)::

Let \(\mu ^*\) be a derivation such that Eq. 9 holds, and let \(e_c=E_{I}(x^{(C )},k_T,k_S)\).
$$\begin{aligned} \begin{aligned} \mu ^*.score =&\max (OPT(C, k_T, k_S1), \\&\max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}} OPT(C\setminus \{\mu .v\},k_T\mu .del_T,k_S\mu .del_S) + \mu .score) \end{aligned} \end{aligned}$$(9)So, either \(\mu ^*.score = OPT(C, k_T, k_S1)\), or \(\mu ^*.score = \displaystyle \max _{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}}\) \(OPT(C\setminus \{\mu .v\},\) \(k_T\mu .del_T,k_S\mu .del_S) + \mu .score\). First, if the former is true, let \(\eta \in {\mathcal {D}}(x^{(C )},k_T,k_S1)\) be a derivation with \(\eta .score = OPT(C, k_T, k_S1)\). By definition, \(\eta\) derives the substring \(S'[1:E_{I}(x^{(C )},k_T,k_S1)]\). Adding to \(\eta\) the deletion of \(S'[e_c]\), \(\mathsf {addDel}(\eta ,e_c)\), results in a derivation \(\eta '\) of \(x^{(C )}\) to the string \(S'[1:e_c]\) with \(k_T\) deletions from the tree and \(k_S\) deletions from the string. The string \(S'[1:e_c]\) is equal to the concatenation of \(S'[1:E_{I}(x^{(C )},k_T,k_S1)]\) and \(S'[e_c]\). So, \(\eta ' \in {\mathcal {D}}(x^{(C )},k_T,k_S){}\), and thus \(\eta '.score \le OPT(C, k_T, k_S)\). The derivation \(\eta '\) was constructed such that \(\mu ^*.score = \eta '.score\), so \(\mu ^*.score \le OPT(C, k_T, k_S)\). Second, if the latter is true, then let \(\eta ^* = {{\,\mathrm{arg\max }\,}}_{\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}}OPT(C\setminus \{\mu .v\}, k_T  \mu .del_T, k_S  \mu .del_S) + \mu .score\). Adding \(\eta ^*\) to a partial derivation \(\eta \in {\mathcal {D}}(x^{(C \setminus \{\eta ^*.v\})},k_T  \eta ^*.del_T, k_S  \eta ^*.del_S)\), \(\mathsf {add}(\eta ,\eta ^*)\), results in a partial derivation, \(\eta '\), with \(k_T\eta ^*.del_T+ \eta ^*.del_T= k_T\) deletions from the tree and \(k_S\eta ^*.del_S+ \eta ^*.del_S= k_S\) deletions from the string, that takes into account the children of x that are in \(C\setminus \{\eta ^*.v\} \cup \{\eta ^*.v\} = C\). It is a legal partial derivation since \(\eta ^*\) derives the node \(\eta ^*.v\) that is not in \(C\setminus \{\eta ^*.v\}\) to a string that does not intersect with the string derived by \(\eta\). The string that is derived by \(\eta\) is \(S'[\eta .s:\eta .e]\) and it does not intersect with the string derived by \(\eta ^*\) (\(S'[\eta ^*.s:\eta ^*.e]\)). That is because \(\eta .e + 1 = \eta ^*.s\), as can be seen similarly to Eq. 7. So, \(\eta ' \in {\mathcal {D}}(x^{(C )},k_T,k_S)\), and thus \(\eta '.score \le OPT(C, k_T, k_S)\). The partial derivation \(\eta '\) was constructed such that \(\mu ^*.score = \eta '.score\), so \(\mu ^*.score \le OPT(C, k_T, k_S)\).
From the induction assumption, \({\mathcal {P}}[C, k_T, k_S1] = OPT(C, k_T, k_S1)\) and for every \(\mu \in {\mathcal {D}}_\le (C,k_T,k_S)\), \({\mathcal {P}}[C\setminus \{\mu .v\}, k_T  \mu .del_T, k_S  \mu .del_S] = OPT(C\setminus \{\mu .v\}, k_T  \mu .del_T, k_S  \mu .del_S)\). Thus, from Eq. 5, it follows that \({\mathcal {P}}[C, k_T, k_S] = OPT(C, k_T, k_S)\). This completes the proof. \(\square\)
Time and Space Complexity of the PQTree Search Algorithm
In this section the complexity of the main algorithm for PQTree Search as well as the complexity of the Pmapping algorithm are proven.
Time and Space Complexity of the Main Algorithm
Here we prove Lemma 1.
Proof
The number of leaves in the PQtree T is m, hence there are O(m) nodes in the tree, i.e the size of the first dimension of the DP table, \({\mathcal {A}}{}\), is O(m). In the algorithm description ("Pnode mapping" section) a bound for the possible start indices of substrings derived from nodes in T is given (for a node x, the start index i runs between 1 and \(n(\mathsf {span}(x)d_T)+1\)). The node with the largest span in T is the root which has a span of m. The root is mapped to the longest substring when there are \(d_S\) deletions from the string. Hence, the size of the second dimension of \({\mathcal {A}}{}\) is \(\Omega (n(m+d_S)+1) = \Omega (n)\) (given that \(d_S<< n\)). The nodes with the smallest spans are the leaves, which have a span of 1, hence the size of the second dimension of \({\mathcal {A}}{}\) is O(n). The third and fourth dimensions of \({\mathcal {A}}{}\) are of size \(d_T+1\) and \(d_S+1\), respectively. In total, the DP table \({\mathcal {A}}{}\) is of size \(O(d_T d_S m n)\).
In the initialization step \(O(d_S m n)\) entries of \({\mathcal {A}}{}\) are computed in \(O(d_S)\) time each. This holds because there are m leaves and O(n) start indices for every string of length \(k_S\le d_S\), and it takes \(O(d_S)\) time to compute the max function. There are also \(O((d_T1) d_S m n)\) entries of \({\mathcal {A}}{}\) that are computed in O(1) time each. These are the entries initialized with the 0 and \(\infty\) values. This results in a \(O((d_T + d_S) d_S m n)\) time initialization step which can be reduced to \(O(d_T d_S m n)\) by using the replacement initialization rule mentioned in " Pnode mapping" section, though they are both negligible. The Pmapping algorithm is called for every Pnode in T and every possible start index i, i.e. the Pmapping algorithm is called \(O(n m_p)\) times. Similarly, the Qmapping algorithm is called \(O(n m_q)\) times. Thus, it takes \(O(n\ (m_p \cdot \text {Time(Pmapping)} + m_q \cdot \text {Time(Qmapping)}))\) time to fill the DP table. In the final stage of the algorithm the maximum over the entries corresponding to every combination of deletion numbers and start index (\(0\le k_T\le d_T\), \(0\le k_S \le d_S, 1\le i\le n(\mathsf {span}(x)d_T)+1\)) is computed. So, it takes \(O(d_T d_S n)\) time to find a derivation with maximum score. Tracing back through the DP table to find the actual mapping does not increase the time complexity.
From Lemma 2, the Pmapping algorithm takes \(O(\gamma 2^{\gamma } {d_T}^2 {d_S}^2)\) time and \(O(d_T d_S 2^\gamma )\) space, and from Lemma 3, the Qmapping algorithm takes \(O(\gamma {d_T}^2 {d_S}^2)\) time and \(O(d_T d_S \gamma )\) space. Thus, in total, our algorithm runs in \(O(n (m_p \cdot \gamma 2^{\gamma } {d_T}^2 {d_S}^2 + m_q \cdot \gamma {d_T}^2 {d_S}^2)) = O(n \gamma {d_T}^2 {d_S}^2 (m_p \cdot 2^{\gamma } + m_q))\) time. Adding to the space required for the main DP table the space required for the Pmapping algorithm (the space needed for the Qmapping algorithm is insignificant with respect to the Pmapping algorithm) results in a total space complexity of \(O(d_T d_S m n) + O(d_T d_S 2^\gamma ) = O(d_T d_S (m n + 2^\gamma ))\). This completes the proof. \(\square\)
Time and Space Complexity of the PNode Mapping Algorithm
Here we prove Lemma 2.
Proof
The most space consuming part of the algorithm is the 3dimensional DP table. The first dimension, C, can be any subset of the set \(\mathsf {children}(x)\), and therefore it is of size \(2^{\mathsf {children}(x)} = 2^\gamma\). The size of the second and third dimensions (i.e. \(k_T\) and \(k_S\)) are \(d_T+1\) and \(d_S+1\), respectively. Hence, the space of the DP algorithm is \(O(d_T d_S 2^\gamma )\).
The algorithm has three parts: initialization, filling the DP table, and returning the derivations in the required order. The most time consuming calculation required in the initialization is the calculation of \(L(x^{(C )},k_T,k_S)\). It requires summing the spans of all nodes in C. This calculation will also be required in the second part of the algorithm. To avoid the repetitive calculations, it is performed once for every \((C,k_T,k_S)\) tuple and the results are saved. This requires \(O(d_T d_S 2^{\mathsf {children}(x)}) = O(d_t d_S 2^\gamma )\) space (for this is the number of such tuples). Each value is calculated in \(O(\mathsf {children}(x)) = O(\gamma )\) time. Hence, the calculation of all the \(L(x^{(C )},k_T,k_S)\) values (and thus all the \(E_{I}(x^{(C )},k_T,k_S)\) values) takes \(O(d_T d_S \gamma 2^\gamma )\) time and \(O(d_T d_S 2^\gamma )\) space. The second part of the algorithm is done by calculating the value of every entry in the \(O(d_T d_S 2^\gamma )\) entries of \({\mathcal {P}}{}\), using the recursion rule in Eq. 2. The first line among the rule takes O(1) time, since it involves looking in another entry of \({\mathcal {P}}{}\) and basic computations. The second line of the rule involves going over all derivations \(\mu \in {\mathcal {D}}_\le (C,k_T,k_S){}\). Namely, going over all derivations with a specific end point, which derives a node in C and has no more than a specific number of deletions from the tree and string (i.e. \(\mu .e=E_{I}(C,k_T,k_S)\), \(\mu .v \in C\), \(\mu .del_T\le k_T\) and \(\mu .del_S\le k_S\)). The number of deletions from the tree and string are bounded by \(d_T\) and \(d_S\), respectively, and the number of nodes in C is bounded by the number of children of x, \(\gamma\). Hence, the time to calculate one entry of \({\mathcal {P}}{}\) is \(O(d_T d_S \gamma )\). In total, the second part of the algorithm takes \(O({d_T}^2 {d_S}^2 \gamma 2^\gamma )\) time. Finally, to construct the returned set of derivations, the algorithm goes over every deletion combination \(k_T,k_S\) once, i.e. it takes \(O(d_T d_S)\) time. In total, the algorithm takes \(O({d_T}^2 {d_S}^2 \gamma 2^\gamma ) + O(d_T d_S \gamma 2^\gamma ) + O(d_T d_S) = O({d_T}^2 {d_S}^2 \gamma 2^\gamma )\) time. \(\square\)
Conclusions
In this paper, we defined two new problems in comparative genomics, denoted PQTree Search and PQTree Alignment, where the second is a subproblem of the first. Both problems take as input a PQtree T representing the known gene orders of a gene cluster of interest, a genetogene substitution scoring function h, integer arguments \(d_T\) and \(d_S\), and a sequence of genes S. The objective in PQTree Search is to identify an approximate instance \(S'\) of the gene cluster, such that \(S'\) is a substring of S. The objective of PQTree Alignment is to determine whether \(S'\) is an approximate instance of the gene cluster; An approximate instance could vary from the known gene orders by genome rearrangements that are constrained by T, by gene substitutions that are governed by h, and by gene deletions and insertions that are bounded from above by \(d_T\) and \(d_S\), respectively.
We proved that the PQTree Search and the PQTree Alignment problems are NPhard and proposed a parameterized algorithm that solves PQTree Search in \(O^*(2^{\gamma })\) time by solving PQTree Alignment for every substring of S. The parameter \(\gamma\) is the maximum degree of a node in T and \(O^*\) is used to hide factors polynomial in the input size.
The proposed algorithm was implemented as a publicly available tool and harnessed to search for treeguided rearrangements of chromosomal gene clusters in plasmids. We identified 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements are guided by the corresponding PQtree. A treeguided rearrangement event of one of these gene clusters, coding for a heavy metal efflux pump, was detected in a bacterial strain that was isolated from an environment polluted with several heavy metals. Thus, a future extension of this study could explore whether similar gene cluster rearrangement events are correlated with environmental stress or other bacterial adaptations.
The said gene cluster was further analysed to characterize its approximate instances in plasmids. An interesting variant of the analysed gene cluster, found among its approximate instances, corresponds to a copper efflux pump. It was found mainly in pathogenic bacteria, and likely constitutes a bacterial defense mechanism against the host immune response. These results exemplify how our proposed tool PQFinder can be harnessed to find meaningful variations of known biological systems that are conserved as gene clusters, and to explore their function and evolution.
Another interesting approach to perform a comparative analysis of gene clusters in chromosomes versus plasmids could theoretically be based on the alignment of PQtrees that represent the respective gene clusters. However, this will require denovo discovery of gene clusters in both chromosomes and plasmids—a task that is more challenging in plasmids than in chromosomes for the following two reasons. First, as it is more difficult to assemble plasmids than to assemble chromosomes, some of the plasmids may not be accurately reconstructed [11]. Second, the plasmid gene pool is more diverse and less conserved than the gene pool of chromosomes [55]. This motivated us to identify gene clusters in chromosomes and then to search for approximate treeguided rearrangements of these gene clusters in plasmids.
One of the downsides to using PQtrees to represent gene clusters is that very rare gene orders taken into account in the tree construction could greatly increase the number of allowed rearrangements and thus substantially lower the specificity of the PQtree. Thus, a natural continuation of our research would be to increase the specificity of the model by considering a stochastic variation of PQTree Search and PQTree Alignment. Namely, defining a PQtree in which the internal nodes hold the probability of each rearrangement, and adjusting the algorithms for PQTree Search and PQTree Alignment accordingly. In addition, future extensions of this work could also aim to increase the sensitivity of the model by incorporating gene orientation, and by taking into account gene duplications and genefusion events, which are typical events in gene cluster evolution.
Notes
The notation O* is used to hide factors polynomial in the input size.
Abbreviations
 NPhard::

Nondeterministic Polynomialtime Hard
 FPT::

Fixed Parameter Tractable
 JISP::

Job Interval Selection Problem
References
Zimerman GR, Svetlitsky D, Zehavi M, ZivUkelson M. Approximate search for known gene clusters in new genomes using pqtrees. In: 20th International workshop on algorithms in bioinformatics (WABI 2020). 2020. https://doi.org/10.4230/LIPIcs.WABI.2020.1. Schloss DagstuhlLeibnizZentrum für Informatik. https://drops.dagstuhl.de/opus/volltexte/2020/12790/
Zimerman GR. The PQFinder tool. www.github.com/GaliaZim/PQFinder
Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. Refseq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42(D1):553–9. https://doi.org/10.1093/nar/gkt1274.
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, et al. Patric, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42(D1):581–91. https://doi.org/10.1093/nar/gkt1099.
Böcker S, Jahn K, Mixtacki J, Stoye J. Computation of median gene clusters. J Comput Biol. 2009;16(8):1085–99. https://doi.org/10.1089/cmb.2009.0098.
He X, Goldwasser MH. Identifying conserved gene clusters in the presence of homology families. J Comput Biol. 2005;12(6):638–56. https://doi.org/10.1089/cmb.2005.12.638.
Winter S, Jahn K, Wehner S, Kuchenbecker L, Marz M, Stoye J, Böcker S. Finding approximate gene clusters with gecko 3. Nucleic Acids Res. 2016;44(20):9600–10. https://doi.org/10.1093/nar/gkw843.
Norris V, Merieau A. Plasmids as scribbling pads for operon formation and propagation. Res Microbiol. 2013;164(7):779–87. https://doi.org/10.1016/j.resmic.2013.04.003.
He S, Chandler M, Varani AM, Hickman AB, Dekker JP, Dyda F. Mechanisms of evolution in highconsequence drug resistance plasmids. mBio. 2016. https://doi.org/10.1128/mBio.0198716.
Eberhard WG. Evolution in bacterial plasmids and levels of selection. Q Rev Biol. 1990;65(1):3–22. https://doi.org/10.1086/416582.
Orlek A, Stoesser N, Anjum MF, Doumith M, Ellington MJ, Peto T, Crook D, Woodford N, Walker AS, Phan H, et al. Plasmid classification in an era of wholegenome sequencing: application in studies of antibiotic resistance epidemiology. Front Microbiol. 2017;8:182. https://doi.org/10.3389/fmicb.2017.00182.
Booth KS, Lueker GS. Testing for the consecutive ones property, interval graphs, and graph planarity using pqtree algorithms. J Comput Syst Sci. 1976;13(3):335–79. https://doi.org/10.1016/S00220000(76)800451.
Bergeron A, Gingras Y, Chauve C. Formal models of gene clusters. Bioinform Algorithms Tech Appl. 2008;8:177–202. https://doi.org/10.1002/9780470253441.ch8.
Metcalf WW, Wanner BL. Evidence for a fourteengene, phnC to phnP locus for phosphonate metabolism in escherichia coli. Gene. 1993;129(1):27–32. https://doi.org/10.1016/03781119(93)90692V.
Fondi M, Emiliani G, Fani R. Origin and evolution of operons and metabolic pathways. Res Microbiol. 2009;160(7):502–12. https://doi.org/10.1016/j.resmic.2009.05.001.
Wells JN, Bergendahl LT, Marsh JA. Operon gene order is optimized for ordered protein complex assembly. Cell Rep. 2016;14(4):679–85. https://doi.org/10.1016/j.celrep.2015.12.085.
Tatusov RL, Galperin MY, Natale DA, Koonin EV. The cog database: a tool for genomescale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28(1):33–6. https://doi.org/10.1093/nar/28.1.33.
Salton G, Wong A, Yang CS. A vector space model for automatic indexing. Commun ACM. 1975;18(11):613–20. https://doi.org/10.1145/361219.361220.
Bergeron A, Corteel S, Raffinot M (2002) The algorithmic of gene teams. In: International workshop on algorithms in bioinformatics. Springer. p. 464–476 . https://doi.org/10.1007/3540457844_36.
Eres R, Landau G.M, Parida L (2003) A combinatorial approach to automatic discovery of clusterpatterns. In: International workshop on algorithms in bioinformatics. Springer. p. 139–150 . https://doi.org/10.1007/9783540397632_11.
Heber S, Stoye J (2001) Algorithms for finding gene clusters. In: International workshop on algorithms in bioinformatics. Springer. p. 252–263 . https://doi.org/10.1007/3540446966_20.
Schmidt T, Stoye J (2004) Quadratic time algorithms for finding common intervals in two and more sequences. In: combinatorial pattern matching. Springer. p. 347–358 . https://doi.org/10.1007/9783540278016_26.
Uno T, Yagiura M. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica. 2000;26(2):290–309. https://doi.org/10.1007/s004539910014.
Alizadeh F, Karp RM, Weisser DK, Zweig G. Physical mapping of chromosomes using unique probes. J Comput Biol. 1995;2(2):159–84. https://doi.org/10.1089/cmb.1995.2.159.
Christof T, Jünger M, Kececioglu J, Mutzel P, Reinelt G. A branchandcut approach to physical mapping of chromosomes by unique endprobes. J Comput Biol. 1997;4(4):433–47. https://doi.org/10.1089/cmb.1997.4.433.
Bérard S, Bergeron A, Chauve C, Paul C. Perfect sorting by reversals is not always difficult. IEEE/ACM Trans Comput Biol Bioinform. 2007;4(1):4–16. https://doi.org/10.1145/1229968.1229972.
Bergeron A, Mixtacki J, Stoye J (2004) Reversal distance without hurdles and fortresses. In: annual symposium on combinatorial pattern matching. Springer. p. 388–399 . https://doi.org/10.1007/9783540278016_29.
Landau GM, Parida L, Weimann O. Gene proximity analysis across whole genomes via pq trees. J Comput Biol. 2005;12(10):1289–306. https://doi.org/10.1089/cmb.2005.12.1289.
Adam Z, Turmel M, Lemieux C, Sankoff D. Common intervals and symmetric difference in a modelfree phylogenomics, with an application to streptophyte evolution. J Comput Biol. 2007;14(4):436–45. https://doi.org/10.1089/cmb.2007.A005.
Bergeron A, Blanchette M, Chateau A, Chauve C (2004) Reconstructing ancestral gene orders using conserved intervals. In: international workshop on algorithms in bioinformatics. Springer. p. 14–25 . https://doi.org/10.1007/9783540302193_2.
Parida L. Using pq structures for genomic rearrangement phylogeny. J Comput Biol. 2006;13(10):1685–700. https://doi.org/10.1089/cmb.2006.13.1685.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9.
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, BenHur A, et al. A largescale evaluation of computational protein function prediction. Nat Methods. 2013;10(3):221–7.
Jiang Y, Oron TR, Clark WT, Bankapur AR, DAndrea D, Lepore R, Funk CS, Kahanda I, Verspoor KM, BenHur A, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016;17(1):1–19.
Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, MartinezCruz LA, Corrales FJ, Rubio A. Correlation between gene expression and go semantic similarity. IEEE/ACM Trans Comput Biol Bioinform. 2005;2(4):330–8.
Yang D, Li Y, Xiao H, Liu Q, Zhang M, Zhu J, Ma W, Yao C, Wang J, Wang D, et al. Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant go categories. Bioinformatics. 2008;24(2):265–71.
Cho YR, Hwang W, Ramanathan M, Zhang A. Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinform. 2007;8(1):265.
Zhang SB, Tang QR. Proteinprotein interaction inference based on semantic similarity of gene ontology terms. J Theor Biol. 2016;401:30–7.
Doerr D, Stoye J. A perspective on comparative and functional genomics. 2019;361–372.
Cygan M, Fomin FV, Kowalik L, Lokshtanov D, Marx D, Pilipczuk M, Pilipczuk M, Saurabh S. Parameterized algorithms. Cham: Springer; 2015. https://doi.org/10.1007/9783319212753.
Downey RG, Fellows MR. Fundamentals of parameterized complexity texts in computer science. Cham: Springer; 2013. https://doi.org/10.1007/9781447155591.
Fomin FV, Lokshtanov D, Saurabh S, Zehavi M. Kernelization: theory of parameterized preprocessing. England: Cambridge University Press; 2019.
van Bevern R, Mnich M, Niedermeier R, Weller M. Interval scheduling and colorful independent sets. J Sched. 2015;18(5):449–69. https://doi.org/10.1007/s1095101403985.
Svetlitsky D, Dagan T, ZivUkelson M. Discovery of multioperon colinear syntenic blocks in microbial genomes. Bioinformatics. 2020. https://doi.org/10.1093/bioinformatics/btaa503.
Gourevitch L. A program for PQtree construction. https://github.com/levgou/pqtrees
Vandecraen J, Chandler M, Aertsen A, Houdt RV. The impact of insertion sequences on bacterial genome plasticity and adaptability. Crit Rev Microbiol. 2017;43(6):709–30. https://doi.org/10.1080/1040841X.2017.1303661 (PMID: 28407717).
Nies DH. Effluxmediated heavy metal resistance in prokaryotes. FEMS Microbiol Rev. 2003;27(2–3):313–39.
Fu Y, Chang FMJ, Giedroc DP. Copper transport and trafficking at the hostbacterial pathogen interface. Acc Chem Res. 2014;47(12):3605–13.
Du D, Wang Z, James NR, Voss JE, Klimont E, OheneAgyei T, Venter H, Chiu W, Luisi BF. Structure of the AcrABTolC multidrug efflux pump. Nature. 2014;509(7501):512–5.
Sulavik MC, Houseweart C, Cramer C, Jiwani N, Murgolo N, Greene J, DiDomenico B, Shaw KJ, Miller GH, Hare R, et al. Antibiotic susceptibility profiles of escherichia coli strains lacking multidrug efflux pump genes. Antimicrob Agents Chemother. 2001;45(4):1126–36. https://doi.org/10.1128/AAC.45.4.11261136.2001.
Nakajima K, Hakimi SL. Complexity results for scheduling tasks with discrete starting times. J Algorithms. 1982;3(4):344–61. https://doi.org/10.1016/01966774(82)90030X.
Keil JM. On the complexity of scheduling tasks with discrete starting times. Oper Res Lett. 1992;12(5):293–5. https://doi.org/10.1016/01676377(92)90087J.
Spieksma FC. On the approximability of an interval scheduling problem. Journal of Scheduling. 1999;2(5):215–27. https://doi.org/10.1002/(SICI)10991425(199909/10)2:5¡215::AIDJOS27¿3.0.CO;2Y.
Spieksma FC, Crama Y. The complexity of scheduling short tasks with few starting times. Netherlands: Rijksuniversiteit Limburg. Vakgroep Wiskunde; 1992.
Norman A, Hansen LH, Sørensen SJ. Conjugative plasmids: vessels of the communal gene pool. Philos Trans R Soc B Biol Sci. 2009;364(1527):2275–89.
Zimerman GR, Svetlitsky D, Zehavi M, ZivUkelson M. Approximate search for known gene clusters in new genomes using PQtrees. 2020. arXiv: 2007.03589.
Acknowledgements
Many thanks to Lev Gourevitch for his excellent implementation of a PQtree builder. We also thank the anonymous WABI reviewers for their very helpful comments.
Funding
The research of G.R.Z. was partially supported by the Planning and Budgeting Committee of the Council for Higher Education in Israel and by the Frankel Center for Computer Science at Ben Gurion University. The research of G.R.Z. and M.Z. was partially supported by the Israel Science Foundation (Grant No. 1176/18). The research of G.R.Z., D.S. and M.Z.U. was partially supported by the Israel Science Foundation (grant no. 939/18).
Author information
Authors and Affiliations
Contributions
MZ and MZU initiated and guided the research. GRZ developed and implemented the presented algorithms, and participated in the application of the algorithms to the experimental benchmarks. DS developed the bioinformatic pipeline and performed the experiments and the data analysis. GRZ and DS wrote the paper, with guidance by MZ and MZU. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A conference version of this paper appeared in WABI2020 [1]
Supplementary Information
Additional file 1.
Supplementary material of the paper, including additional descriptions, proofs and figures.
Additional file 2.
A list of chromosomes and plasmids analysed in the main text.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zimerman, G.R., Svetlitsky, D., Zehavi, M. et al. Approximate search for known gene clusters in new genomes using PQtrees. Algorithms Mol Biol 16, 16 (2021). https://doi.org/10.1186/s13015021001909
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015021001909