Notation and problem formulation
Before proceeding, we shall introduce some notation and definitions that will prove useful throughout the article. The human genome is divided in 23 pairs of chromosomes, i.e., organized structures of DNA that contain many genes, regulatory elements and other nucleotide sequences. When a nucleotide site of a specific chromosome region shows a variability within a population of individuals then it is called a Single Nucleotide Polymorphism (SNP). Specifically, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the minor allele) while for the rest of the population a different nucleotide is observed (the major allele). The minor allele, or mutant type[24], is generally encoded as ‘1’, as opposed to the major allele, or wild type[24], generally encoded as ‘0’. A haplotype is a set of alleles, or more formally, a string of length m over an alphabet Σ = {0,1}[25].
Given a setof n haplotypes, denote as the set of alleles and h
i
(s),, as the s-th allele of haplotype. Given two distinct haplotypes we denote as the subset of different alleles between h
i
and h
j
, as the distance between h
i
and h
j
, and we say that h
i
and h
j
are adjacent if. From a biological point of view, the adjacency between a pair of distinct haplotypes means that one of the two haplotypes evolved from the other by mutation of a specific SNP over time.
Consider a graph having a vertex for each haplotype inand an edge for each pair of adjacent haplotypes Then, a phylogeny T ofis a spanning tree of G, i.e., an acyclic subgraph of G in which a pair of vertices is adjacent in T if. It is worth noting that, according to the definition of the edge set E, in general a phylogeny ofmay not exist as the graph might not be connected. To always ensure the existence of a phylogeny for we introduce the set which consists of the minimum number of haplotypes that should be added toin such a way that, defined and, the graph is connected. From a biological point of view, the set represents the set of haplotypes that are ancestors of the observed ones but either had gone extinct or just were not observed in that sample (also called Steiner nodes).
Denote as a phylogeny of, as the edge set of, and as the length of the phylogeny, i.e., the sum of the distances, for all. Then, the problem of finding a phylogeny ofthat satisfies the parsimony criterion consists of solving the following optimization problem:
Problem 2
The Most Parsimonious Phylogeny Estimation Problem from SNP haplotypes (MPPEP-SNP).Given a setof n haplotypes having m alleles each, find the minimum cardinality haplotype set to be added toso that the phylogeny has minimum length.
If the haplotype setis empty, i.e., if is connected, then MPPEP-SNP can be solved in polynomial time as the minimum spanning tree is a (optimal) solution to the MPPEP-SNP. Similarly, if the input haplotype setsatisfies the perfect phylogeny condition i.e., the requirement that each allele changes only once throughout the optimal phylogeny (see[19]), then the MPPEP-SNP can be still solved in polynomial time[26–28]. Unfortunately, it is possible to prove that in the general case the MPPEP-SNP is-hard (see[1, 22]). In fact, the binary nature of the SNP haplotypes allows us to interpret a generic haplotype as a vertex of a m-dimensional unit hypercube, its s-th allele as the s-th coordinate of the vertex h
i
, and the set as the set of Steiner vertices of the unit hypercube. Hence the MPPEP-SNP can be seen as particular case of the Steiner tree problem in a graph, a notorious-hard combinatorial optimization problem[29].
Finding the optimal solutions to the MPPEP-SNP is fundamental to validating the parsimony criterion, i.e., to verify whether, for a given instance of MPPEP-SNP, the results predicted by the criterion match the experimental ones. Unfortunately, the-hardness of the MPPEP-SNP limits the size of the instances analyzable to the optimum, which in turn affects the ability to validate the parsimony criterion, hence the practical utility of the predictions themselves. In order to address this concern, in the following section we shall develop an integer programming model able to provide optimal solutions to real instances of the MPPEP-SNP.
A mixed integer programming model for the MPPEP-SNP
Let the set of potential vertices of a phylogeny ofand assume the convention to denote the n haplotypes inas the first n vertices in V. The first task in designing an integer programming model for the MPPEP-SNP that uses a polynomial-size number of variables consists of characterizing V, i.e., determining an upper and a lower bound on the cardinality of the set. In fact, observe that contains potentially an exponential number of haplotypes, namely all vertices of the unit hypercube that belong to the set. However, we can easily bound the cardinality of by means of the following approach. Consider the complete graph, where, and construct a minimum spanning tree of Denote as the set of edges (h
i
,h
j
) of. Then, an upper bound UB on the overall number of Steiner vertices of the optimal phylogeny can be obtained by computing the sum
Similarly, denote, a lower bound LB on the overall number of Steiner vertices of can be obtained as[30, 31]:
Denote u
i
, i ∈ V, as a decision variable equal to 1 if the i-th vertex of V is considered in the optimal solution to the MPEPP-SNP and 0 otherwise; as a decision variable equal to 1 if in the optimal solution to the MPPEP-SNP the s-th coordinate of the vertex u
i
, i ∈ V, is 1 and 0 otherwise; as a decision variable equal to 1 if in the optimal solution to the MPPEP-SNP the pair of distinct vertices i,j ∈ V has a change at s-th coordinate, and 0 otherwise; and y
ij
as a decision variable equal to 1 if the pair of distinct vertices i,j∈V is adjacent in the optimal solution to the MPPEP-SNP and 0 otherwise. Finally, let,, and Q = {1,2,…,n + LB}. Then, a valid formulation for the MPPEP-SNP is the following:
Formulation 1
Basic Model
(1a)
(1b)
(1c)
(1d)
(1e)
(1f)
(1g)
(1h)
(1i)
(1j)
(1k)
(1l)
(1m)
The objective function (1a) aims at minimizing the length of the optimal phylogeny. Constraints (1b) impose that the coordinates of the first n vertices in V are exactly the ones of the input haplotype set Constraints (1c) impose that the s-th coordinate of vertex u
i
, i ∈ V, can assume value 1 only if vertex u
i
is considered in the optimal solution to the problem. Constraints (1d)-(1e) force variables to be one if in the optimal solution to the problem there exist a pair of adjacent vertices i j ∈ V having a different value at the s-th coordinate. Constraints (1f) impose that in an optimal solution to the problem two distinct vertices i j ∈ V can be adjacent only if. Constraints (1g)-(1h) impose that in the optimal solution to the problem variable y
ij
may assume value 1 only if both vertices i and j are considered. Vice versa, constraints (1i) impose that if in the optimal solution to the problem a vertex u
i
, i ∈ V, is considered then at least one variable y
ij
must assume value 1. Constraints (1j) and (1k) impose the Generalized Subtour Elimination Constraints (GSEC)[23]. Finally, constraints (1l) impose that the first n + LB vertices in V have to be considered in the optimal solution to the problem.
Note that Formulation 1 can be easily extended to the case in which the haplotypes are represented by multi-character data, i.e., sequences over an alphabet Σ = {0,1,2,…,γ}. In fact, in such a case it is sufficient to replace each character c in the haplotype by a binary γ -vector ν such that the s-th coordinate of ν is equal to 1 if the character c is equal to s, s ∈ Σ, and 0 otherwise. For example, if the generic haplotype were, for example, the string 〈AACGT〉, then it could be represented as 〈1000 1000 0100 0010 0001〉.
Reducing model size
Formulation 1 is characterized by a polynomial number of variables and an exponential number of constraints. Its implementation can be performed by means of standard branch-and-cut approaches that use GSEC separation oracles such as those described in[32].
It is worth noting that variables and can be relaxed in Formulation1c)-(1e) and the convexity constraint (1f) suffice to guarantee their integrality in any optimal solution to the problem. Moreover, Formulation 1 can be reduced in size by removing those variables that are redundant or whose value is known in the optimal solution to the problem. For example, variables y
ij
can be removed from Formulation 1 as it is easy to realize that they are redundant. Similarly, all variables such that and d
ij
> 1 do not need to be defined as their value will be always zero for any and in any feasible solution to the problem. Variables u
i
, i ∈ Q, do not need to be declared as their value will be always 1 any feasible solution to the problem. Finally, variables,, can be removed as their value is univocally assigned by the input haplotype set The reduction process can be further combined with the preprocessing strategies described in[1] to obtain even smaller formulations. Such strategies allow one to remove alleles from the input haplotype setwithout altering the optimal solution to the problem. For example, suppose that the haplotype setis such that there exists an allele such that, for all; then it is easy to realize thatcan be removed fromas in any feasible solution to the problem the coordinate of any vertex in the phylogeny would be characterized by having x i ŝ = 1. A similar situation occurs when there exists an allele such that, for all. Analogously, suppose that the input haplotype setis characterized by equal alleles, i.e., by the existence of two alleles, say and, such that, for all. Then it is easy to realize that if one removes one of the two alleles from say, and multiplies the-th coordinate by 2 does not alter neither the structure nor the value of the optimal solution to the problem. Describing all the preprocessing techniques for shrinking the input haplotype setis beyond the scope of the present article. The interested reader will find a systematic discussion of this topic in[1].
By applying the previously cited reduction strategies to Formulation 1 and denotingas the set of alleles resulting from the application of the preprocessing strategies described in[1], ws as the number of alleles inequal to the s-th,, Z as the set for which variables are defined, R = {n + LB + 1,n + LB + 2,…,n + UB}, and, for any C ⊂ V, the following reduced formulation for the MPPEP-SNP can be obtained:
Formulation 2
Reduced Model
(2a)
(2b)
(2c)
(2d)
(2e)
(2f)
(2g)
(2h)
(2i)
(2j)
(2k)
(2l)
(2m)
(2n)
Note that in Formulation 2 variables and cannot be relaxed anymore.
Strengthening valid inequalities
By exploiting the integrality of variables u
i
,, and, a number of valid inequalities can be developed to strengthen Formulation 2.
Proposition 1
Constraints
(3)
are valid for Formulation 2.
Proof
In a feasible solution to the problem variable u
i
, i ∈ V∖(Q ∪ {n + UB}), can assume only value 0 or 1. If u
i
= 0, constraint (3) reduces to ui + 1 ≤ 0 which is trivially valid for Formulation 2. If u
i
= 1, constraint (3) reduces to ui + 1 ≤ 1 which is again valid. □
Constraints (3) impose an ordering on the variables u
i
, i ∈ R, so that vertex ui + 1 can be considered in the optimal solution to the problem only if vertex u
i
has been already considered.
Proposition 2
Constraints
(4)
are valid for Formulation 2.
Proof
In a feasible solution to the problem a vertex u
i
,, cannot be a terminal vertex. In fact, if such a condition held, a cheaper solution could be obtained by dropping u
i
from, contradicting the optimality of itself. Hence, the degree of any vertex in must be at least 2. Now, in a feasible solution to the problem variables u
i
∈ {0,1}. If u
i
= 0, constraint (4) reduces to
which is trivially valid. Vice versa, if u
i
= 1, constraint (4) reduces to
which is again valid for the above arguments. □
Proposition 3
Constraints
(5)
(6)
are valid for Formulation 2.
Proof
As observed in the previous proposition, in a feasible solution to the problem,, i,j∈Z, can assume only value 0 or 1. If, then constraint (5) (respectively constraint (6)) reduces to (respectively), which is trivially valid due to the integrality of variables. If, then either or. If then constraint (5), (respectively constraint (6)) reduces to (respectively), which is trivially valid. If then constraint (5) (respectively constraint (6)) reduces to (respectively), which is again valid. □
Similar arguments can be used to prove the following proposition:
Proposition 4
Constraints
(7)
(8)
are valid for Formulation 2.
Given an input haplotype setand a pair of non-adjacent haplotypes h
i
and h
j
, there exit multiple equivalent paths that may connect h
i
and h
j
in the unary hypercube. This characteristic constitutes the principal class of symmetries for the MPPEP-SNP and may lead to poor relaxations for the problem. For example, suppose that the input haplotype setis constituted by haplotypes h1 = 〈00〉 and h2 = 〈11〉. Then a possible solution to the instance may consist either of a star centered in haplotype h3 = 〈10〉 or a star centered in haplotype h3 = 〈01〉(see Figure1). Note that both solutions are feasible and optimal for the specific instance. A possible strategy to break this class of symmetries consists of imposing the following constraints:
Proposition 5
Constraints
(9)
(10)
are valid for Formulation 2.
Proof
The statement trivially follows from the integrality of variables and from constraints (2b). □
Constraints (9)-(10) impose an ordering on the coordinates of the vertices in by means of the smallest big-M possible, i.e., a power of 2. Note that the distinction between constraints (9) and (10) is necessary, as in principle vertices in R may not be needed in the optimal solution to the problem.
Proposition 6
Constraints
(11)
are valid for Formulation 2.
Proof
In a feasible solution to the problem, the sum,, i,j∈Z, can assume only value 0 or 1. If, constraint (11) reduces to which is trivially valid. Vice versa, If, constraint (11) reduces to which is again valid due to Propositions 1 and 2. □
Proposition 6 forces vertices in to be sorted according to a decreasing degree order. In this way, it is possible to avoid the occurrence of symmetric solutions to the problem differing just for the degree of the Steiner vertices (see e.g., Figure2).
Let and k ∈ V, k ∉ Q3. Then the following proposition holds:
Proposition 7
Constraints
(12)
are valid for Formulation 2.
Proof
In a feasible solution to the problem the path between two distinct haplotypes cannot be shorter than the distance. Hence, if the distance between h
i
and h
j
is greater or equal to three, vertices i and j cannot be adjacent to a same vertex k, i.e., only one of the two sums or can be equal to 1. □
Note that if k ∈ R then (12) can be strengthened by replacing the right-hand-side by u
k
. Moreover, Proposition 7 can be generalized as follows. Consider the sets, d ∈ {3,4,…,m}, C ⊂ V such that 2 ≤ |C| ≤ d − 1 and C ∩ Q
d
= ∅, and a path p that involves only vertices in C. Denote p
k
the k-th vertex in p. Then the following proposition holds:
Proposition 8
The family of constraints
(13)
called forbidden path constraints, are valid for Formulation 2.
Proof
Similarly to Proposition 7, in a feasible solution to the problem the path p between two distinct haplotypes cannot be shorter than the distance. Hence, if the distance between h
i
and h
j
is greater or equal to d, at most |C| vertices can belong to p. □