- Research
- Open access
- Published:
Evolving DNA motifs to predict GeneChip probe performance
Algorithms for Molecular Biology volume 4, Article number: 6 (2009)
Abstract
Background
Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. We use correlations between measurements for the same gene across 6685 human tissue samples from NCBI's GEO database to indicated the quality of individual HG-U133A probes. Low correlation indicates a poor probe.
Results
Regular expressions can be automatically created from a Backus-Naur form (BNF) context-free grammar using strongly typed genetic programming.
Conclusion
The automatically produced motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided.
Background
Typically Affymetrix GeneChips (e.g. HG-U133A) measure gene expression at least eleven points along the gene. Individual measurements are given by short (25 base) DNA sequences, known as probes. These are complementary to corresponding locations in genes. Being complementary, the gene product (messenger RNA) preferentially binds to the probe, cf. Figure 1. Half a million different probes are placed on a slide in a square grid pattern. A fluorescent dye is used to measure how much mRNA is bound to each probe.
To a first approximation, the amount of mRNA produced by a gene should be the same no matter which part of the mRNA molecule is bound to a probe. Affymetrix groups probes into probesets. Each probeset targets a gene. Therefore probe measurements for the same probeset should be correlated. Figure 2 shows the 110 correlations for a probeset as a "heatmap" (yellow/lighter corresponds to greater consistency between pairs of probes). Figure 2 suggests that in Affymetrix probeset 200660_at two probes do not measure the gene as well as the other nine.
There are several biological reasons which might lead to probes on the same gene giving consistently unrelated readings (alternative splicing, alternative polyadenylation and 3'-5' degradation, come to mind [1, 2]). However these do not explain all of the many cases of poor correlation. In [3] we found some technological reasons. In particular, [3] showed that probes containing a large ratio of Guanine (G) to Adenosine (A) bases are likely to perform badly. Subsequently we have found that runs of Gs (which will tend to have a high G/A ratio) also tend to indicate problem probes [4]. This has lead us to ask if there are other sequences which might indicate badly behaved probes. We set up an artificial evolutionary system [5, 6] to create DNA motifs using a formal computer language grammar [7] to search for DNA sequences which indicate poor probes.
Grammars and Genetic Programming
Existing research on using grammars to constrain the artificial evolution of programs can be broadly divided in two: "Grammatical Evolution" [8] based largely in Ireland and work in the far east by Whigham [9, 10], Wong [11] and McKay [12].
Research in molecular biological computing includes Ross, who induced stochastic regular expressions from a number of grammars to classify proteins from their amino acid sequence [13]. Typically his grammars had eight alternatives. In Stockholm regular expressions have been evolved to search for similarities between proteins, again based on their amino acid sequences [14]. Whilst Brameier in Denmark used amino acids sequences to predict the location of proteins by applying a multi-classifier [15] linear genetic programming based approach [16] (although this can be done without a grammar [17]). A similar technique has also been applied to study microRNAs [18].
Results and Discussion
By the end of the first run (cf. Table 1 and Figure 3) genetic programming (GP) had evolved a probe performance predictor (see Figure 4) equivalent to GGGG|CGCC|G(G|C){4}|CCC. It is obvious that it includes the previous rule (GGGG, [4]) but includes other possibilities. Therefore it finds more poor probes.
Inevitably it will also incorrectly predict more high correlation probes as being poor. However its reduced performance on the good probes is more than offset by better performance on the poor probes. See Figure 5. On the last generation, it has a score of 856 (410 true neg + 446 true pos). (GGGG has a score of 776 = 195 + 581.)
The confusion matrix for the evolved regular expression on the whole of the training set (including the 6677 positive middling values which GP never saw) is at the top left of Table 2. As will be described in the methods section, ambiguous middling probes are not used during training, cf. also Figure 7. Nevertheless, to avoid giving an in ated overly optimistic estimate of performance, we present results across the whole range of probe correlations. Whilst its confusion matrix on the verification data is in the middle of Table 2 (The corresponding matrices for GGGG are given in at the bottom of Table 2.) Unlike in many machine learning applications, there is no evidence of over fitting. Indeed the corresponding results for the test set (second matrix of each pair) are not significantly different (χ2, 3 dof) from those on the whole training set. The evolved regular expression picks up significantly more (χ2, 3 dof) (448 v. 209) poorly performing probes on the test set than the human produced regular expression. Figure 6 shows the number of DNA probes matching the evolved motif against their average correlation with the rest of their probeset.
As is common in optimisation [20], almost all the run time is taken by the time to find out the performance score of the motifs. In our case, elapse time is dominated by the command script which runs egrep -c. Typically this takes 8.5 mS per DNA motif. The time taken by gawk to process the BNF grammar, create new grammars, generate the regular expressions, etc., is negligible.
Discussion
Theoretical and empirical studies of GeneChips confirm that the behaviour of DNA probes tethered to a surface can be quite different from DNA behaviour in bulk solution. This is a new and difficult area and there are not deep pure Physics experimental results. Therefore experimental studies have concentrated on data gathered during normal operation of the chips.
Our automatically generated motif, suggests that in addition to Gs, Cs are important. Indeed the fact that only three consecutive Cs is predictive (whereas four Gs are needed) suggests that Cs are more important than Gs. It is known in GeneChips DNA C-G RNA binds more strongly than DNA G-C RNA [21]. We are tempted to suggest that a CCC sequence on a DNA probe can act as a nucleation site encouraging the probe to bind to GGG on RNA. Indeed the evolved motif suggests that four Gs and mixtures of five Cs and Gs might also form nucleation sites.
The sequence CCC is too short to be specific to a particular gene. GeneChips are designed on the assumption that only RNA sequences which are complementary to the full length of the probe will be stable. However studies have shown that nonspecific targets can be bound to GeneChip probes for several hours even if held only by the nucleation site. This may be why probes with quite short runs of either Cs or Gs can be poorly correlated with others designed to measure the same gene.
Conclusion
Access to the raw results of thousands of GeneChips (each of which costs several hundreds of pounds) makes new forms of bioinformatic data mining possible.
Millions of correlations between probes in the same probeset, which should be measuring the same gene, show wide variation [22]. Automatically generated regular expressions confirm previous work [3, 4] that the DNA sequences from which the probes themselves are formed can indicate poor probe performance. Indeed several new motifs (e.g. CCC) which predict probe quality have been automatically found.
Linux code is available via ftp://cs.ucl.ac.uk/genetic/gp-code/RE_gp.tar
Methods
Preparation of Training Data
Previously we had down loaded thousands of experiments from NCBI's GEO [23], normalised them, excluded spatial defects and calculated the correlation between millions of pairs of probes [3, 24]. To exclude genes which are never expressed, we selected probesets where ten or more non-overlapping probe pairs had correlations of 0.8 or more. For each probe we use the median value of all 10 of its correlations with other members of its probeset (excluding those it overlaps). This gave 4118 probesets, which were evenly split into three to provide independent training, test and validation data.
Previously we found the "mismatch" probes were often poorly correlated with other measurements for the same gene [3]. Since this is known, we excluded them from this study.
As Figure 7 shows, correlation coefficients cover a wide range. Since we are using correlation only as an indication of how well a probe is working we decided to exclude the middle values from training and instead use probe pairs that were highly correlated (≥ 0.8) or were very poorly correlated (≤ 0.3). Of the 15,092 available training examples, there are 7,832 probes highly correlated with the rest of their probeset but only 583 poorly correlated. To avoid unbalanced training sets, every generation all 583 negative training examples are used and 583 positive examples are randomly chosen from the 7,832 positive examples. Training examples are available via http://bioinformatics.essex.ac.uk/users/wlangdon/RE_gp_training.tar.gz
Evolving Regular Expression Motifs
BNF grammar of Regular Expression
The BNF grammar used (cf. Figure 8) is an extension of that given by Cameron http://www.cs.sfu.ca/people/Faculty/cameron/Teaching/384/99-3/regexp-plg.html. In particular, matching the beginning of strings (^) and the {n,m} form of Kleen closure, are also supported. The BNF has been customised for DNA strings. (I.e. <char> need only be A C G and T). Since various combinations of the start of string symbol, null strings and Kleen closure cause egrep to loop, care has been taken to ensure that the new BNF does not permit null strings after ^.
Brameier and Wiuf suggests that the traditional * and + form of Kleen closure are not suitable for bioinformatic applications [18]. Instead they recommend the {n,m} form which explicitly defines both lower (n) and upper (m) limits on the number of times the preceeding symbol must occur. However both {n,m} and traditional Kleen closures are used by evolved solutions. To avoid mutation.awk seeing "Hamming cliffs", the integer quantifiers used in the {n,m} are Gray coded [25]. Similarly the syntax groups together the chemically more similar Pyrimidines (T and C) and Purines (A and G).
Our system supports full positive integer values for the BNF grammar rule minmaxquantifier, however even modest values can lead egrep to hang the computer. Therefore n and m are limited to 1–9. Finally egrep rejects {n,m} if m < n. This is handled by a semantic rule which removes, m from the motif when m is less than n.
Using the BNF with Genetic Programming
For simplicity, the BNF is written so that grammar rules are either simple substitution rules (e.g. <minmaxquantifier>), rules with exactly two options (e.g. <RE>) or terminals (e.g. "*" and T). In BNF terms, a terminal is a symbol which cannot be substituted in the grammar. Therefore, unlike the BNF rules, it becomes part of the egrep regular expression. The simple substitution rules do not have any element of choice. They, like terminals, cannot be chosen as crossover points or targets for mutation. Their principle use is to enable the rules with options to be kept simple.
The binary choice rules are the active parts of the syntax. As they are always binary, each egrep regular expression created using the BNF has an equivalent binary string. Each bit in the string corresponds to a BNF rule with two options. The bit indicates which option should be invoked (cf. Figure 9). The BNF grammar is also used to give types to the choices. By using strong typing when creating new motifs from old ones we ensure not only that the new motif is syntatically correct but, since crossover respects the types, they also guide the evolutionary search [26].
Creating Random Motifs Using the BNF Grammar
The initial random population is created using ramped half-and-half [27]. It may help to think of this as applying the usual genetic programming ramped half-and-half algorithm to a binary tree (of choice nodes). We start from <start> (at the top of Figure 8) and recursively follow the BNF. However when we reach a rule with options we need to choose one. As in ramped half-and-half we keep track of how deep we are nested. If we have not reached the depth needed to terminate the recursion, we randomly choose one of the options. (As with other strongly typed GPs, if a chosen route through the syntax has no further choices to be made, we may be forced to terminate a recursive branch early.)
To terminate a recursion we choose the "simpler" option. Our BNF has been written so that the simpler option is always on the right. (This is flagged by RE in the rule name.) If there is no "simpler" choice, the choice is made randomly. This mechanism is also used for mutating existing regular expressions.
Although this may seem complex, gawk (Unix' free interpreted pattern scanning and processing language) can handle populations of a million motifs.
Creating New Motifs by Mixing BNF Grammars
Creating a new motif from two high scoring motifs is essentially subtree crossover [5] applied to the binary choice tree with the addition of strong type constraints [28]. This is implemented by scanning the grammar used to create the first parent for all the rules with two options. One of these is randomly chosen. For example, suppose the first parent starts <start> <RE> <union> and suppose <union> is chosen as the crossover point. For a grammatically correct child to be produced all that is necessary is that the crossover point chosen in the second parent should also be <union>. (There are complications to do with depth and size limits, which we shall ignore for the time being.) Therefore the second parent is scanned to find all occurrences of <union>. One of them is randomly chosen to be the second crossover point. (If there are none, this crossover is aborted and another initial crossover point is chosen. If we keep failing, eventually another pair of parents is chosen.)
Crossover is based on normal genetic programming (GP) subtree crossover, cf. [[5], Figure 2.5]. The new child is created by copying the start of the first parent, excluding the subtree at the first parent's crossover point. Then genetic material from the subtree at the second parent's crossover point is added. Finally the remainder of the first parent is appended to the child. This is implemented by crossing over the binary choice trees to create a binary choice tree for the new child. Apart from issues of tree size and depth, we are guaranteed that the new binary choice tree will represent a valid DNA motif.
The final step is to recursively trace through the BNF grammar. Each time we come to a rule with two options, we look at the next binary choice. If it is clear, we chose the first option. If it is set, we follow the second option. Each time an BNF terminal is encountered it is appended to the new regular expression. (If the BNF terminal is the null symbol, it is simply ignored.)
Evaluating the Performance Score of the DNA Motifs
Each generation, a command file is generated which contains a egrep -c -v 'RE' command for each motif in the population. (RE is the motif i.e. the regular expression.) The command is run on a file holding the DNA sequences of the 583 probes poorly correlated with the rest of their probeset. The same command is also run on a file holding the 583 positive probes selected for use in this generation. The score of the regular expression is based on the difference between the number of lines in the two files which match RE. Expressions which either match all probes or fail to match any are penalised by subtracting 583 from their score. See also Table 1. Implementation details can be found in [29].
References
Stalteri MA, Harrison AP: Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips. BMC Bioinformatics. 2007, 8: 13-
Langdon WB, da Silva Camargo R, Harrison AP: Spatial Defects in 5896 HG-U133A GeneChips. Critical Assesment of Microarray Data. Edited by: Dopazo J, Conesa A, Al Shahrour F, Montener D. 2007, [Presented at EMERALD Workshop], Valencia, http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/langdon_camda2007.ps
Langdon WB: Evolving GeneChip Correlation Predictors on Parallel Graphics Hardware. 2008 IEEE World Congress on Computational Intelligence. Edited by: Wang J. 2008, 4152-4157. IEEE Computational Intelligence Society, Hong Kong: IEEE Press, http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/langdon_2008_CIGPU2.pdf
Upton GJ, Langdon WB, Harrison AP: G-spots cause incorrect expression measurement in Affymetrix microarrays. BMC Genomics. 2008, 9: 613-
Poli R, Langdon WB, McPhee NF: A field guide to genetic programming. Published via http://lulu.com and freely available at http://www.gp-field-guide.org.uk 2008, [http://www.gp-field-guide.org.uk]. [With contributions by J. R. Koza]
Langdon WB: Genetic Programming and Data Structures. 1998, Kluwer
Langdon WB, Harrison AP: Evolving Regular Expressions for GeneChip Probe Performance Prediction. Parallel Problem Solving from Nature – PPSN X, Volume 5199 of LNCS. Edited by: Rudolph G, Jansen T, Lucas S, Poloni C, Beume N. 2008, 1061-1070. Dortmund: Springer
O'Neill M, Ryan C: Grammatical Evolution. IEEE Transactions on Evolutionary Computation. 2001, 5 (4): 349-358.
Whigham PA: Search Bias, Language Bias, and Genetic Programming. Genetic Programming 1996: Proceedings of the First Annual Conference. Edited by: Koza JR, Goldberg DE, Fogel DB, Riolo RL. 1996, 230-237. Stanford University, CA, USA: MIT Press, http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/whigham_1996_sblpGP.html
Whigham PA, Crapper PF: Time series Modelling Using Genetic Programming: An Application to Rainfall-Runoff Models. Advances in Genetic Programming 3. Edited by: Spector L, Langdon WB, O'Reilly UM, Angeline PJ. 1999, 89-104. MIT Press, http://www.cs.bham.ac.uk/~wbl/aigp3/ch05.pdf
Wong ML, Leung KS: Evolving Recursive Functions for the Even-Parity Problem Using Genetic Programming. Advances in Genetic Programming 2. Edited by: Angeline PJ, Kinnear, KE Jr. 1996, 221-240. MIT Press
McKay RI, Hoang TH, Essam DL, Nguyen XH: Developmental Evaluation in Genetic Programming: the Preliminary Results. Proceedings of the 9th European Conference on Genetic Programming, Volume 3905 of Lecture Notes in Computer Science. Edited by: Collet P, Tomassini M, Ebner M, Gustafson S, Ekárt A. 2006, 280-289. Budapest, Hungary: Springer
Ross BJ: The Evaluation of a Stochastic Regular Motif Language for Protein Sequences. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001). Edited by: Spector L, Goodman ED, Wu A, Langdon WB, Voigt HM, Gen M, Sen S, Dorigo M, Pezeshk S, Garzon MH, Burke E. 2001, 120-128. San Francisco, California, USA, http://www.cosc.brocku.ca/~bross/research/gp002.pdf
Handstad T, Hestnes AJH, Saetrom P: Motif kernel generated by genetic programming improves remote homology and fold detection. BMC Bioinformatics. 2007, 8 (23):
Langdon WB, Buxton BF: Evolving Receiver Operating Characteristics for Data Fusion. Genetic Programming, Proceedings of EuroGP'2001, Volume 2038 of LNCS. Edited by: Miller JF, Tomassini M, Lanzi PL, Ryan C, Tettamanzi AGB, Langdon WB. 2001, 87-96. Lake Como, Italy: Springer-Verlag, http://www.cs.ucl.ac.uk/staff/W.Langdon/ftp/papers/wbl_egp2001.ps.gz
Brameier M, Krings A, MacCallum RM: NucPred Predicting nuclear localization of proteins. Bioinformatics. 2007, 23 (9): 1159-1160.
Langdon WB, Banzhaf W: Repeated Sequences in Linear Genetic Programming Genomes. Complex Systems. 2005, 15 (4): 285-306.
Brameier M, Wiuf C: Ab initio identification of human microRNAs based on structure motifs. BMC Bioinformatics. 2007, 8: 478-
Langdon WB, Barrett SJ: Genetic Programming in Data Mining for Drug Discovery. Evolutionary Computing in Data Mining, Volume 163 of Studies in Fuzziness and Soft Computing. Edited by: Ghosh A, Jain LC. 2004, 211-235. Springer
Beyer HG: The Theory of Evolution Strategies. 2001, Springer
Naef F, Wijnen H, Magnasco M: Reply to "Comment on 'Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays' ". Physical Review E. 2006, 73 (6): 063902-
Langdon WB: A Map of Human Gene Expression. Tech Rep CES-486, Departments of Mathematical, Biological Sciences and Computing and Electronic Systems, University of Essex, Colchester, CO4 3SQ, UK. 2008, http://www.essex.ac.uk/dces/research/publications/technicalreports/2008/CES-486.pdf
Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Research. 2007, D760-D765. 35 Database
Langdon WB, Upton GJG, da Silva Camargo R, Harrison AP: A Survey of Spatial Defects in Homo Sapiens Affymetrix GeneChips. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2009
Bäck T: Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. 1996, New York: Oxford University Press
Radcliff NJ: Genetic Set Recombination. Foundations of Genetic Algorithms 2. Edited by: Whitley LD. 1993, 203-219. Vail, Colorado, USA: Morgan Kaufmann
Koza JR: Genetic Programming: On the Programming of Computers by Natural Selection. 1992, MIT press
Montana DJ: Strongly Typed Genetic Programming. Evolutionary Computation. 1995, 3 (2): 199-230. http://vishnu.bbn.com/papers/stgp.pdf
Langdon WB, Harrison AP: Evolving Regular Expressions for GeneChip Probe Performance Prediction. Tech Rep CES-483, Computing and Electronic Systems, University of Essex, Wivenhoe Park, Colchester CO4 3SQ, UK. 2008, http://www.essex.ac.uk/dces/research/publications/technicalreports/2008/CES-483.pdf
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors are funded by the people of the United Kingdom.
Authors' contributions
All authors are equally responsible.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Langdon, W., Harrison, A. Evolving DNA motifs to predict GeneChip probe performance. Algorithms Mol Biol 4, 6 (2009). https://doi.org/10.1186/1748-7188-4-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1748-7188-4-6