Designing minimal microbial strains of desired functionality using a genetic algorithm

Background The rational, in silico prediction of gene-knockouts to turn organisms into efficient cell factories is an essential and computationally challenging task in metabolic engineering. Elementary flux mode analysis in combination with constraint minimal cut sets is a particularly powerful method to identify optimal engineering targets, which will force an organism into the desired metabolic state. Given an engineering objective, it is theoretically possible, although computationally impractical, to find the best minimal intervention strategies. Results We developed a genetic algorithm (GA-MCS) to quickly find many (near) optimal intervention strategies while overcoming the above mentioned computational burden. We tested our algorithm on Escherichia coli metabolic networks of three different sizes to find intervention strategies satisfying three different engineering objectives. Conclusions We show that GA-MCS finds all practically relevant targets for any (non)-linear engineering objective. Our algorithm also found solutions comparable to previously published results. We show that for large networks optimal solutions are found within a fraction of the time used for a complete enumeration.


Background
The availability of high amount of biological data has led to the reconstruction of genome-scale metabolic networks for many organisms [1][2][3][4] which can be analysed and probed using mathematical and computational methods [5,6]. Prominent among these are constraint based modelling approaches which depend on the stoichiometry of the reactions. These include methods like flux balance analysis, FBA, [7] and elementary flux mode analysis, EFMA [8,9]. The major difference between these approaches is that FBA seeks particular flux solutions whereas EFMA seeks to describe the entire flux space by enumerating all its elementary and balanced pathways which are called elementary flux modes, EFMs. Thus, the complete set of EFMs describes all possible cellular states. The disadvantage is that enumerating all the EFMs of a metabolic network is computationally very demanding as the number of EFMs explodes with network size [10]. However, the ability to enumerate EFMs has been steadily improving [11][12][13][14].
An important application of an EFMA is the prediction of gene knockouts to turn wild-type organisms into efficient minimal cell factories [15]. The design of efficient cell factories is based on the concept of networks of minimal functionality. These are derived from wildtype metabolic networks by keeping typically very few, specifically selected metabolic functions, e.g., EFMs with high yields of products of interest, while diminishing all other unwanted (wildtype) functionality by appropriately selected gene/reaction knockouts. These interventions channel the available carbon flux towards the product of interest. Based on EFMA the concept of constrained minimal cut sets, cMCS can be used to redirect cellular resources towards the product of interest [16]. cMCS are minimal (reaction) knock-out strategies, that disable unwanted EFMs (e.g., low product yield/growth) while the desired EFMs (e.g., high product yield) are preserved.

Open Access
Algorithms for Molecular Biology *Correspondence: juergen.zanghellini@boku.ac.at In particular, cMCSs of minimal cardinality are important as these solutions minimize the experimental effort when knockouts are actually implemented in vivo. Several methods for the computation of cMCS based on a given EFM spectrum are known [16][17][18]. Alternatively, cMCS can also be calculated directly without first calculating EFMs [19][20][21]. However, in all these methods, explicit design criteria must be used (e.g. by providing boundaries for the desired minimal product yield). This is problematic in so far as a slight change in the design criteria might lead to large changes in the minimal cardinality of the cMCSs, i.e. the minimal number of required knockouts. For example, Trinh et al. [15] optimized E. coli for ethanol production with seven reaction knockouts. Jungreuthmayer et al. [22] on the other hand, were able to design a strain with identical key features and almost identical overall functionality, which required only five reaction knockouts.
If the EFMs are known it is theoretically possible but generally impractical to find all optimal partitions of EFMs and their corresponding cMCSs (of minimal cardinality). In a recent work Ruckerbauer et al. [23] approach this problem by first finding the smallest possible cMCS which contributes towards the engineering objective. Then cMCSs of higher cardinality are successively enumerated such that the engineering objective value is greater than or equal to that of the previous smaller cMCS. This circumvents the problem of large number of binning possibilities but will work, in a reasonable amount of time, only for small scale networks.
Here we present a novel approach which uses a genetic algorithm, GA to "evolve" near optimal solutions from starting sets of randomly partitioned modes. This results in minimal strains such that only that fraction of the total EFMs which contribute towards the design objective are active after deletion of the predicted cMCSs. This approach combines the simplicity of a GA with the power of EFMA and cMCS. The GA not only circumvents the manual partitioning of EFMs but also finds increasingly better solutions in a relatively short amount of time. This method can be used to satisfy not only traditional design objectives like product yield and growth but can also incorporate more complex design objectives like high growth-coupled product yield using minimal number of knockouts or even non-linear objectives.

Elementary flux modes, EFMs
The material balances in a metabolic network with m internal metabolites and r reactions in steady state can be represented by where N is the m × r stoichiometric matrix and v is a flux vector containing the fluxes through the network and v ∈ R r , i.e., v = (v 1 , . . . , v r ) T . The set of reactions can be partitioned based on thermodynamic constraints into sets of reversible and irreversible reactions. If Irrev is the index set of irreversible reactions, The support of the flux vector v can be defined as supp(v) = {j|v j � = 0}, which is the set of reaction indices in v with non-zero flux values. An EFM, e, is a flux vector v � = 0 which satisfies (1), (2) and a non-decomposability condition which states that, there is no non-trivial flux vector w satisfying (1), (2) and whose support is a proper subset of e, i.e., supp(w) ⊂ supp(e). The non-decomposability condition means that the removal of any supporting reaction in an EFM will block a steady state flux through it. The set of all EFMs of a network completely describes the entire metabolic capabilities of the network. Every possible flux through the network can be expressed as a non-negative weighted combination of EFMs without cancellation. This means that if the flux through a reaction is 0, then all the contributing EFMs necessarily will have 0 flux through that reaction. For more information on EFMs, see [24].
We will use the following notation henceforth, E = supp(e). Let E = {E 1 , . . . , E n } represent the full set of all n EFMs in support notation.

Constrained minimal cutsets, cMCSs
Suppose there are certain network states which need to be suppressed. These states can be represented by a set of EFMs T, where T ⊂ E. The problem then becomes one of "killing" all the EFMs in T. This can be done by "knocking-out" a cutset C of reactions which will "hit" all of T. That is, C will be a minimal cut set, MCS, if there is no proper subset B ⊂ C which satisfies (3) [25].
Suppose that in addition to network states which need to be suppressed, there are certain states which we need to preserve when knockouts are applied (e.g. biomass production and product formation). This can be done using the concept of cMCS [16]. The set of desired EFMs D corresponds to the network states to be preserved. Since in general it cannot be expected that an MCS will not hit any of D, we will say that we would like to have at least k EFMs untouched by an MCS where k ≤ | D |.
Given an MCS C, let the set of EFMs D C represent D ∈ D which survive after applying C,

An MCS which satisfies (3) and the following constraint is a cMCS
Thus an intervention problem is defined by a set of target EFMs T which need to be "killed" and a set of desired modes D of which at least k have to be "kept". Several methods to solve (6) are available [16][17][18]. Note that D ∪ T does not necessarily unite to the full set of EFMs since there could be EFMs which we do not want to either kill or keep but instead have a "don't care" status. However, we do not need to specify such an association since we will not operate on these EFMs. We will operate only on the EFMs we are interested in (D and T) and do not bother with what happens to the EFMs with "don't care" status because by definition it wouldn't matter to us if these EFMs survive or are killed.
In the following we describe a GA to solve the intervention problem (6). For simplicity our implementation partitions the complete set of EFMs into D and T and does not make use of the "don't care" option.

Methods
The EFM kill/keep problem Equation (6) allows to search for cMCS which keep certain EFMs and kill others. However, it is not intuitive which EFMs to keep and which to kill in order to minimize the cardinality of the cMCSs. Thus the question arises: What is the best partitioning of EFMs in order to reach a specific engineering objective? Even in a modest sized network, the possible combination of EFMs to keep or kill is very large. For example, in a small scale network with 5000 EFMs, the number of possible kill/keep combinations is 2 5000 . It is practically impossible to explore all points in such a large solution space. Therefore, it makes sense to utilize a program that finds the best set of EFMs to keep, and the corresponding cMCSs which will achieve this for a given an engineering objective [23]. We do this using a GA, the working of which is described below.

The genetic algorithm, GA
GAs are heuristics inspired by the theory of evolution, generally used when the extreme of the function cannot be analytically established or when it is impractical to search the whole solution space. GAs work on problems by encoding possible solutions into a population of individuals. These individuals are chromosome like data structures which are iteratively refined to "evolve" better solutions by applying strategies inspired by Darwinian evolution [26][27][28][29]. In our implementation each individual represents an intervention problem (6).  where w k ∈ [0, 1] is a freely adjustable GA parameter. s j i -values are assigned randomly but we provided for the possibility to pre-process EFMs such that EFMs with desirable characteristics have a higher chance of being 1. For example, suppose a cell is described by the following set of EFMs {E 1 , . . . , E 7 }, where only E 1 , E 3 and E 7 support product formation. If we want to optimize for product formation, we clearly do not want to keep the non-producer. So we choose s j i such that undesirable states never get selected. In our example possible randomly selected individuals could look like S 1 = {1, 0, 1, 0, 0, 0, 1}, S 2 = {1, 0, 1, 0, 0, 0, 0}, etc. while {1, 1, 1, 0, 0, 0, 1} would not be generated because it includes E 2 which we want to eliminate. This leads to a significant reduction in the search space. Finally, for each individual S i , cMCS are calculated using the MHScalculator [30].
GAs aim to proceed towards better solutions by evaluating each individual S i against a fitness function F and selecting the top-performers for procreation. The fitness function reflects the design objective since those are the traits we want to improve. In our implementation individuals are selected for mating using a fitness proportionate selection [31]. In addition, we use the concept of "elitism" where a pre-specified percentage of top-performers will propagate into the next generation without any modification as shown in Fig. 2c. This guarantees that the population's maximum fitness does not decrease. We use crossover, mutation [26,27], and random selection based on previous information about surviving EFMs to produce a new generation of individuals. These mechanisms are explained below.

Crossover
We take two parent individuals, S 1 and S 2 , and randomly exchange their elements to create two new offspring S 3 and S 4 . We implemented the following three standard types of crossovers. For 1point crossover, generate a random integer r c , 1 ≤ r c < n for each pair of parents, then the offspring of crossover are S 3 = {s 1 1 , ...s r c 1 , s r c +1 2 , ...s n 2 } and S 4 = {s 1 2 , ...s r c 2 , s r c +1 1 , ...s n 1 }, (see Fig. 2a). In 2point crossover, two random integers r c1 , r c2 , 1 ≤ r c1 < r c2 < n are generated for each pair of parents. The offspring in this scenario are S 3 = {s 1 , ...s n 2 }. In uniform crossover, for each EFM a random number 0 ≤ r j u < 1 is generated and the offspring are

Mutation
Given an individual S 1 and a random integer r, 1 ≤ r < n , the mutated individual is The absolute number of such random integers generated for each individual is given by ρr m , where r m is a freely adjustable GA parameter, the mutation rate, 0 ≤ r m < 1 and ρ the maximum number of EFMs with desirable characteristics, ρ ≤ n (see Fig. 2a).

Pattern-based individual generation
In addition to mutation and crossover we create new individuals based on the fittest patterns. For each individual S, whose corresponding intervention problem has solution(/s), we generate a "design pattern", which contains only the surviving EFMs, Given a binary individual S = 1010001, if only EFM 3 and 7 survive the intervention, the resulting pattern will be 0010001. Thus a pattern is a specific strain design for an intervention problem. A solvable intervention problem typically produces more than one solution. Therefore, one individual will usually have more than one pattern associated with it. Since the fitness depends on the surviving EFMs, each pattern will have its own fitness value. Thus one individual may be associated with more than one fitness value. Here, the fitness of an individual S is defined as the fitness of the fittest pattern P.
To create the new individuals, we start by weighting each EFM proportional to the number of times the EFM survived in all previous patterns. Let P t represent the entire set of patterns found until a given generation t. The weight w i t for an EFM E i is calculated by Next we generate a set of desired candidate EFMs by randomly selecting a random number of EFMs with nonvanishing w i t . Out of these desired candidate EFMs new individuals were composed by including those candidate EFMs for which a randomly selected number r i was not larger than the weight of the corresponding candidate EFM, 0 ≤ r i < max w t and max w t is the maximum of all such weights (see Fig. 2b), The number of individuals generated by this method can be controlled by the GA parameter 'new_S' , Table 1. It is a way to consider all good solutions obtained so far and ensures that more EFMs with desirable properties find their way into the set of desired EFMs. This helps the GA to reach the optimum faster. The GA stops after reaching a pre-specified number of generations or when the maximum fitness doesn't improve for a given number of generations, outputting all MCSs of minimal cardinality associated with each desired pattern. The schematic of the GA implemented and used here is shown in Fig. 1 along with a small illustratory example in Fig. 2.

Implementation
The GA was implemented in Perl http://www.perl.org/. cMCSs were calculated with MHScalculator which is an open source C-program that is freely available [30]. EFMs were calculated using the regEfmtool [13]. All runs were performed on a machine with the following specifications-2 CPUs, 12 cores, Intel Xeon X5650 2.67 GHz and an Ubuntu 14.04 LTS operating system, allowing the used programs to utilise 10 threads in parallel. Caching in form of look-up tables is employed to store previously obtained MCS, patterns and corresponding fitnesses, to avoid repetition of calculation. We also use tmpfs, a temporary file storage created on the RAM, for faster i/o on intermediate files. A general description of the parameters used for controlling the GA are shown in Table 1. Specific parameter values for the individual runs are shown in Table 2.

Validation
We ran the GA on an E. coli core model, M3, [15] and two smaller models, M1 and M2, which were derived from the parent model, M3, by removing several reactions. M3 describes the central carbon metabolism of E. coli including the uptake and utilization of several hexose and pentose sugars. Compared to M3, M2 is restricted to model only glucose utilization (all other carbon uptake relations were removed). Finally, M1, the smallest model of the three, describes glucose utilization under anaerobic conditions. The main topological properties of the three models are summarized in Table 3.

Results
Our aim is to design optimised E. coli strains for ethanol production. The optimization objectives considered in this study were ethanol yield (Y Etoh ), substrate specific productivity which is the product of normalised specific ethanol production and normalised biomass production [32] also called "efficiency" (η Etoh = Y Etoh × Y Biomass ), and an objective which considers both the yield and efficiency together. In all objectives, we favour solutions with low cardinalities (for details see Table 4).

Benchmarking
We tested the performance of the GA against the automatic partitioning method, APM developed by Ruckerbauer et al. [23] using the models M1, M2 and M3. The APM was selected for comparison, as for any given, linear engineering objective APM enumerates all optimal knockout strategies without requiring any manual interference. We tested for maximum efficiency and ethanol production using the fitness function F 1 and F 2 , respectively as given in Table 4. For the three models used we listed the main characteristics of the optimal solutions with respect to the fitness functions in Table 3. All simulations were run five times. In the following we reported averages over these five runs, unless otherwise stated.

Maximizing for efficiency
We used the fitness function F 2 (Table 4) with the parameters shown in Table 2 to optimize for efficiency. The GA was terminated when the fitness function remained unchanged for 15 generations.
The GA found all optimal solutions for the small model M1 (see Fig. 3a). In the bigger models M2 and M3 the GA did not find the best solutions but got within 3 and 1.2 % of the maximum fitness, respectively.
In M2 and within the selected runtime, the GA mostly found near optimal solutions (see Fig. 3b), and rarely converged to the optimal solution. In the case of M3 the GA got stuck in a local optimum (see Fig. 3c).
While the GA does not necessarily identify the absolute best solutions, it generally finds near-optimal solutions extremely quickly. In M2 and M3 near-optimal solutions are found in about 25 and 2.5 % of the time taken by the APM, respectively (see Fig. 3). Only in the small-scale model M1, which is easy to enumerated fully, the GA is slower than the APM.
Comparing the MCSs obtained with the GA to the ones obtained with the APM, as shown in Fig. 4a-c, reveals that our algorithm retrieves 100 % of all low cardinality MCS. The number drops with increasing MCS' cardinality. This behavior is expected as our fitness functions favors low cardinality solutions. Thus it is very unlikely that the GA will identify many high cardinality solutions. In fact, this explains the non-monotonic behavior of the line of maximum efficiency in Fig. 3c. Because the fitness function F 2 allows for a trade off between cardinality and maximum efficiency, the efficiency might decrease. Yet the fitness function still increases.

Maximizing for ethanol production
We used the fitness function F 1 (Table 4) with the parameters shown in Table 2 to maximise for ethanol yield. The GA was terminated when the fitness function remained unchanged for 15 generations.
Unlike the previous case, here our algorithm found all optimal solutions for all models. Also, we were faster than the APM in reaching the optimum for all models (see Fig. 3d-f ).
Again, like in the case of maximising for efficiency, the GA retrieves 100 % of lower cardinality MCSs (Fig. 4d-f ), Table 1 The GA parameters These parameters are used to control the running of the GA and also to get more specific results

threads
This parameter specifies the maximum number of threads to be used by the program and not many of the higher cardinality solutions, when compared to the solutions obtained using APM. This is a result of the fitness function, F 1 , which favors towards lower cardinality MCSs. The effect of this can be observed in Fig. 3d, e where the GA first finds higher cardinality solutions for the optimal ethanol yield and settles down to the lowest possible cardinality in subsequent generations.

Optimizing for a complex design
Although maximising for ethanol yield and efficiency, produces sub-optimal to optimal designs, these designs may not be the best to implement in vivo. For example, the EFMs which result in the maximum ethanol yield do not support growth. However, two of these EFMs provide maintenance energy. On the other hand, designs with maximum efficiency do not include maximum ethanol producing EFMs. It would be preferable to have a design which combines these features. We used the fitness function F 3 (Table 4) with the parameters shown in Table 2 to find optimal designs. A similar problem was looked at in [23] where the authors optimised M1 for efficiency while ensuring that at least one of the maximal ethanol producing EFMs survive in the final design. Their design included the most efficient ethanol producing EFMs as well as EFMs with maximum ethanol yield, achieved with an MCS cardinality of 6. A similar design was used by Trinh et al. [15] using 7 reaction knockouts. Our algorithm produces designs of similar functionality with MCSs of cardinality 5, Fig. 5b. Similar results were obtained for M2, and M3 as shown in Fig. 5d and f respectively, both with MCS cardinalities of 5. Also, our algorithm was very quick in finding these designs, taking a few minutes for M1 and M2 and a few hours for M3.

Conclusion
We have presented a method for the design of minimal microbial strains of desired functionality. The designs are minimal in the sense that only a few of the total number of pathways (EFMs) are active after deletion of the predicted cMCSs. Our GA uses the MHScalculator [30] to find cMCSs for a given set of desired and target EFMs. However, the optimal selection of such sets is non-intuitive. Hence, the aim was finding the best possible set of pathways which maximise a given engineering objective.
Another GA, called the OptGene method has been previously reported which finds reaction cuts to achieve a design objective [33]. This algorithm works by testing different combinations of reaction knockouts. In contrast, we test partitions of EFMs. Thus our search space is by orders of magnitude larger than theirs. OptGene finds many solutions too, but it cannot be guaranteed that these are minimal. Also, the knockout cardinalities are restricted to 1-10. Our approach is based on the concept of EFMs which enumerate all possible network states. OptGene however uses methods like FBA, MOMA [34], etc. to calculate the fitness which, unlike EFMs do not account for alternative pathways. Although methods which use FBA and MOMA predict optimal solutions, there is no guarantee that the predicted optimum will be achieved. In a similar vein, the method presented here has advantages over other methods which use a biased biological objective like OptKnock [35], RobustKnock [36] and tilting of the objective function [32].
Boghigian et al. [37] also use a GA and EFMs to design strains with higher product yields. Their approach however differs from the method presented  Fig. 1 Flowchart of the GA. The GA stops when a stopping condition is met, which here is if the number of generations reaches a prespecified maximum or if the maximum fitness remains unchanged for a pre-specified number of generations (Table 1) in this paper in a few major ways. First, the aim of the GA presented in [37] is to only improve product yields without considering the minimality of the knockouts.
Hence, in contrast to us their predicted knockouts are not guaranteed to be minimal. Second, the basic problems considered by both methods are different,  n1 is generated by randomly selecting from Si based on F and subjecting these to GA operations. n2 is generated by randomly selecting EFMs based on wt, which represents survival of corresponding EFMs in the previous generations. n3 is for elitism. A, B and C correspond to sections in the flowchart in Figure 1 with the same names.

Second generation and results
i

Table 3 Features of models used
Features of the networks on which the GA was tested. The maximum possible values for ethanol yield, Y Etoh and efficiency, η Etoh are presented. The minimal cardinality of MCSs which will force the network into these optimal values are also shown along with the total number of such MCSs and the number of EFMs which will survive after application of these MCSs. The corresponding fitness values, F i have been obtained using the fitness functions presented in Table 4 Model

M1 M2 M3
Model source [15] [15] [15] Growth    Table 4. M1, M2 and M3 indicates the associated models. The time axes in c and f is in logarithmic scale. Note that for the same objective with a lesser cMCS cardinality, the fitness will be higher Nair EFMs are one order of magnitude larger than that used in [37]. Tools which use EFMs to find intervention strategies include the MHScalculator [17,30] and a tool to calculate cMCSs as part of the CellNetAnalyzer, a MAT-LAB package providing comprehensive structural and functional analysis of biochemical networks [38]. These methods use EFMs and hence consider the entire metabolic landscape of the organism. The limitation of these methods is that the EFMs which must survive or be killed by an intervention have to be manually partitioned. . In e and f R_norm = R_GLCpts + R_MAN1 + R_TRA8 + R_TRA9 + R_TRA10