The goal of the experiment design task is to calculate a minimal set of epitopes to measure a given set of proteins in a complex mixture. The mixture is a digest, that was derived from a tryptic digest of the whole proteome. It is also assumed that the digest is complete (there are no missed or mis-cleavages) and that the proteome of the organism is fully elucidated. Another assumption is that the hypothetical antibody is specific to a given epitope, and does not bind variations or modifications of the epitope.
The process is divided in a filtering pipeline, where the search space is reduced, and the optimization step, where the problem is formulated and then solved.
Filter Pipeline
Starting from a proteome dataset (e.g. Uniprot or IPI) that is defined as the background, an in-silico tryptic digest is obtained. It is assumed that the background dataset holds information about all proteins found in the future sample.
Peptides must have certain properties to be detectable by a read-out method. The mass of the peptide has to be known and, in addition, mass-spectrometers have limits in resolution and mass range. Instead of including these limitations in optimization-constraints, a filter pipeline is applied where peptides and epitopes, which do not match the criteria, are removed.
Here, the digest of a proteome P is defined by a set of pairs D(P) = {(P
i
, p
j
)} where p
j
is the j-th peptide in protein P
i
, p
i
= a
1
a
2
... a
n
is an amino acid sequence composed of the single letter amino acid code. We define a peptide-antibody-combination as a quadruple labelled
:
Here, l defines the length of the epitope and t describes whether the terminus is n- or c-terminal. The set
with
contains all combinations for a given proteome, length range and termini. This combination set is the raw start input for the filter pipeline. The quadruple is not needed for every filter, but for reasons of formal continuity we use the definition through the whole specification of the pipeline.
Knowing the weight of captured peptides is essential for the mass spectrometry read-out. Therefore, the 'unknown-positions'-filter removes peptides containing unknown positions (symbol X ), as their weight cannot be calculated.
The methionine filter removes combinations with epitopes containing methionine (symbol M ), since chemical modifications of methionine may hamper the recognition of the target epitope by a binding molecule, especially by an antibody.
The high abundant epitope filter removes combinations with epitopes which would capture a large number of peptides. An antibody affine to such an epitope would be cluttered, and therefore be rather insensitive.
We define a subset Ce⊂ C which contains all combinations
where epitope
. If |Ce| is bigger than 600, the epitope e would not be considered for optimization.
The weight filter removes combinations which share the same terminus and have almost the same weight.
These peptides can not be measured with standard mass spectrometry read-out, because the resulting peaks would overlap in the spectrum. A reasonable value for Δ
min
is 2-10 Da for MALDI-TOF-spectrometers. In this filter, rather than excluding the terminus from the optimization only the almost isobaric peptides are not counted as identifiable by combining the specific epitope and mass information. For example the peptides AYEQLGYR and HLEILGYR could not be discriminated in a mass spectrum of a probe enriched with an antibody affine to the epitope LGYR, because the masses only differ by 1.068 Da, if the resolution of the mass spectrometer does not provide the adequate resolution.
The length filter removes combinations which do not fit in the detection range of the mass spectrometer. The detection range depends on the technical specifications of the mass-spectrometer, but a range from 8-30 amino acids is a good rule of thumb.
Some proteins occur with great abundance in the sample, such as actin or tubulin. Terminal epitopes of peptides from these proteins are unsuitable as epitopes for immunoaffinity experiments for the same reasons explained in the high abundant epitope filter. In this last filter step an epitope stop list, generated from a hand cured list of high-abundant proteins, is used to remove those from list of combinations. As shown in figure 2 filters are usually applied in a specific order. While the methionine, unknown positions, high abundant protein filters can be applied at any position in the pipeline, other filters are order-dependent. This is the case if a filter evaluates the expected peptide distribution Ceof an epitope e. These filters cannot be preceded by filters that change those distributions. The high abundant epitope filter must precede the weight filter, which must precede the length filter.
Through the application of this filter pipeline the preselection of epitopes is adjusted to the experimental setup and the problem dimension is significantly reduced.
The influence of the filters is shown in Table 1. While the unknown-positions-filter and the methionine-filter have a relatively small impact, the high abundant epitope filter and the weight filter remove a large number of combinations. The weight filters reduce the number of combinations by about 43%, while the number of epitopes is only reduced by 3%. The filter removes combinations from the set, which cannot contribute to the coverage (overlapping peaks). Still the corresponding antibody can capture peptides that are detectable by the mass spectrometer.
Some antibodies ('robinson antibodies') capture only one peptide from one protein. If there is an antibody that captures more peptides from the same protein and others, it is always better to choose this one over the 'robinson antibody'. Therefore all robinson antibodies are removed from the graph before the optimization starts.
Protein set cover problem formulation
The bipartite graph G = (P ∪ A, E) is constructed by adding proteins and epitopes as vertices, and by connecting a protein node from the protein set P and an epitope node from the epitope set A if a combination appears in the filtered set:
The problem is to select a minimal set of antibodies A
min
⊂ A so that every protein in P is covered by at least one epitope. The minimum set cover is a classical problem in computer science and complexity theory.
The set cover can be formulated as a decision problem, where the question is asked, if a covering set of size k or less exists. This problem was shown to be NP-complete and achieving approximation ratios is no easier than computing optimal solutions. [12] The optimization version where the smallest covering set has to be found is NP-hard. It was shown that a greedy algorithm (see appendix) has an approximation ratio of
where n is the size of the largest set. [13]
This the best approximation ratio for the set cover problem [14]. In this algorithm in each step the epitope in A covering the most yet uncovered proteins in P, is added to the solution set L, until all proteins are covered.
Another approach to the solution of the set cover problem is to formulate it as a binary linear program. The binary decision variables s
a
reflect the inclusion of an epitope a to the solution set. The number of the selected epitopes forms the objective function:
The linear program is subject to the constraint that every protein P has to be covered by one or more epitopes in the solution:
This program can be solved with available solvers such as CPLEX or GLPK. This will lead to optimal solutions, if the problem dimension is small.
To enhance the accuracy of the proteomics experiments, it would be beneficial to capture the same or multiple peptides from a protein by different binders. In addition it is beneficial to include alternative binders in the experimental planning, in case the generation of a binder affine to a specific epitope fails. The multicovering problem (MCP) is a generalization of the set covering problem. Several algorithms have been proposed by Dobson [15], Hochbaum, Hall [16] and Rajagopalan [17]. Those heuristics would solve the problem of covering each protein twice or more. As it would be cost-prohibitive to double the number of binders, it is not possible to cover all target proteins more than once. This is the case at least for proteins that are covered by a very specific epitope. The following approach solves the pragmatic variant of the problem.
The greedy algorithm can be modified to enhance the probability of the selection of an epitope set that meets the multicoverage requirement for the target proteins.
In this variant (see appendix) the scoring function combines two different optimization targets, minimality and redundancy, by summation to a one-dimensional multiobjective fitness function.
The function is a weighted sum of the number of proteins which are not yet covered
and the number of proteins which are covered again by this antibody
E denotes the edge set in the bipartite graph and P
cov
the set of already covered proteins. The influence of new and already covered proteins on the overall score of an epitope is weighted by the parameters s
mcov
and s
cov
:
Still the algorithm terminates with a total number of epitopes lower or equal as the number targets, because every added epitope is required to cover at least one new target protein.
The choice of the parameters s
mcov
and s
cov
has a high impact on the results, and depends heavily on the size of the dataset. The number of epitopes with high capacity is considerably lower in small datasets than in large datasets. Because of this the probability that a protein can be covered more than once by different high capacity epitopes is small. In large datasets the situation is the opposite. As many epitopes have a very large capacity, and possibly cover up to a few hundred peptides from many different proteins, it is more probable that the sets of captured proteins overlap. In this configuration it is better to score innovation over redundancy. While this is intuitively clear, it would be a big effort to determine the best values analytically. For large datasets should s
mcov
should be chosen smaller than s
cov
, for small datasets s
mcov
> s
cov
.
Multiple coverage can be integrated to the Integer Program formulation by changing the coverage constraints to
for all proteins that can be covered twice. However this will lead to inclusion of elongated, already selected, epitopes (e.g. IER and EIER), to satisfy the double coverage constraints.
This formulation requires that all proteins are multiply covered by the solution. A better formulation reads as follows: Maximize the number of multiply covered proteins in a valid covering of all proteins, by using a fixed number of epitopes. The objective function maximizes the number of proteins which are multi-covered.
If the binary variable S
i
is set to one, protein i has to be covered at least twice. This is guaranteed by using the following constraint:
If S
i
is selected, at least two covering epitopes have to be selected in order to satisfy the constraint. This problem would be easily solved just by picking two epitopes randomly for each protein. In order the get an optimal usage of the epitopes their number is restricted by an additional constraint:
Here cost
max
denotes the maximum number of antibodies to be chosen, and this has to be set by the user and may just depend on the available funding for antibody generation or purchase. An upper bound for cost
max
is the size of the optimal solution to the original multicover ILP, which already covers all proteins in the dataset twice or more. A lower bound is the minimal cost for the normal covering.