"Hook"-calibration of GeneChip-microarrays: Theory and algorithm

Background: The improvement of microarray calibration methods is an essential prerequisite for quantitative expression analysis. This issue requires the formulation of an appropriate model describing the basic relationship between the probe intensity and the specific transcript concentration in a complex environment of competing interactions, the estimation of the magnitude these effects and their correction using the intensity information of a given chip and, finally the development of practicable algorithms which judge the quality of a particular hybridization and estimate the expression degree from the intensity values. Results: We present the so-called hook-calibration method which co-processes the log-difference (delta) and -sum (sigma) of the perfect match (PM) and mismatch (MM) probe-intensities. The MM probes are utilized as an internal reference which is subjected to the same hybridization law as the PM, however with modified characteristics. After sequence-specific affinity correction the method fits the Langmuir-adsorption model to the smoothed delta-versus-sigma plot. The geometrical dimensions of this so-called hook-curve characterize the particular hybridization in terms of simple geometric parameters which provide information about the mean non-specific background intensity, the saturation value, the mean PM/MM-sensitivity gain and the fraction of absent probes. This graphical summary spans a metrics system for expression estimates in natural units such as the mean binding constants and the occupancy of the probe spots. The method is single-chip based, i.e. it separately uses the intensities for each selected chip. Conclusion: The hook-method corrects the raw intensities for the non-specific background hybridization in a sequence-specific manner, for the potential saturation of the probe-spots with bound transcripts and for the sequence-specific binding of specific transcripts. The obtained chip characteristics in combination with the sensitivity corrected probe-intensity values provide expression estimates scaled in natural units which are given by the binding constants of the particular hybridization.


Background
The basic mechanism underlying the functioning of DNA microarrays is that of hybridization. Hybridization is defined as the binding between complementary singlestranded nucleic acids. In the case of microarrays one strand is anchored at the surface and the second one is dis-solved in solution, referred to as probe and target, respectively. The experimental technique of detecting hybridized probes relies on the fluorescence intensity measurement to infer the transcript abundance specific for a selected gene. The relationship between transcript abundance and intensity is affected by parasitic effects owing to the "technical" variability of repeated measurements and systematic biases which disturb the one-to-one relationship between the input and the output quantity of the measurement [1].
The task of making estimates of the input quantity (transcript concentration) of a measurement from observations of its output (intensity) is called calibration. Calibration of microarray measurements thus aims at removing consistent and systematic sources of variations to allow mutual comparison of measurements acquired from different probes, arrays and experimental settings. Calibration is also called preprocessing because it usually constitutes the first step in the microarray analysis pipeline. It potentially influences the results of all subsequent steps of "higher-level" analyses as well as the biological interpretation of these results, and is therefore a crucial step in the processing of microarray data. The improvement of microarray calibration methods is an essential prerequisite for obtaining absolute expression estimates which in turn are required for the quantitative analysis of, e.g., transcriptional regulation.
Most of the established preprocessing methods rely on algorithms of mainly empirical nature based on the simple assumption of a linear signal response on the transcript concentration in the sample [2][3][4][5]. In the last years numerous studies on the physical background of microarray hybridization are published with the perspective of developing improved analysis algorithms [6][7][8][9][10][11][12]. For example it has been shown that the probes saturate at higher transcript concentrations which gives rise to a nonlinear relation between intensity and transcript concentration. Moreover, benchmark studies have indicated that the proper correction for non-specific background intensity contributions is presumably the most problematic preprocessing step with no satisfactory solution so far.
The immediate aim of most of these papers and also of our previous work [1,[13][14][15][16][17][18], has been to study the physical (and chemical) processes responsible for converting concentrations of specific target RNA of known sequences to measured fluorescence intensities after hybridization. However, the ultimate, still not-achieved aim of these physical approaches has been to provide scientists with feasible calibration methods which estimate absolute specific target concentrations in the presence of a complex non-specific background from fluorescence intensity data.
Proper calibration of microarray data includes several tasks: Firstly it requires the determination of the model describing the basic relationship between the probe intensity and the specific transcript concentration under consideration of relevant parasitic effects which should be straightened out.
Secondly, the magnitude of these effects should be estimated using the intensity information of a given chip or of a series of chips, and, thirdly, one needs practicable algorithms which judge the quality of a particular hybridization and estimate the expression degree from the intensity values.
Moreover, except MAS5 all popular preprocessing methods [2][3][4][5] rely on multichip-algorithms for calibration, i.e. they process a series of chips at once together to separate chip-and probe-level effects from each other. The obtained expression measures are consequently contextsensitive and require a minimum number of chips for appropriate data-processing (usually more than four). As a consequence the results are constricted to a particular series of chips, i.e. they depend on the particular selection of chips and require re-calculation upon adding or removing chips. The development of single chip calibration methods is therefore an important additional task to provide virtually context-insensitive expression measures which can be compared between chips and experimental series without reprocessing. This issue requires appropriate metrics for expression measures to enable direct comparison of data from different experiments in consistent units.
This paper addresses these tasks and presents a new singlechip calibration method for microarrays based on a physical model of hybridization. Our so-called hook-method provides a graphical summary of the hybridization characteristics of each microarray which directly transforms into a sort of natural metrics for intensity calibration with the potential to estimate expression values on an absolute scale. This metrics uses mismatched probes on Affymetrix GeneChip arrays as internal reference for judging the hybridization of the perfect matched probes over the whole potential concentration range.
In the first part of the paper we outline the calibration model and validate its relevance using single probe benchmark data. In the second part we apply the model to single chip data and describe the analysis algorithm step by step. Table 1 summarizes the essential notations and symbols used in the paper. Examples which illustrate the performance of the method are presented in the accompanying publication [19]. (1)   erage or occupancy), is directly related to the observed intensity, I p P * [14,18]. The proportionality constant, M c , specifies the maximum intensity referring to complete occupancy, Θ p P = 1, if all oligonucleotides of the respective probe spot on the given chip are dimerized. The minimum intensity referring to the absence of bound transcripts, Θ p P = 0, gives rise to the "optical" background intensity, O c . Throughout the paper we will consider only "net" intensities which have been corrected for the optical background before further analysis, , using, for example, the zone-algorithm provided by Affymetrix [23].
The surface coverage changes as a hyperbolic function of the "binding strength", X p P , which additively decomposes into contributions due to specific and non-specific hybridization Since the binding strengths follow the mass action law they are related to the concentration of specific and nonspecific transcripts, [S] p and [N] c , respectively, and to the respective effective association constants of duplex formation, K p P,h (h = S, N) (see [1] for details), The latter equation assumes that the large number of different non-specific RNA-fragments in the hybridization solution effectively acts like a single species with the common concentration [N] c for all probes of the chip [15,16]. Contrarily, the concentration of specific transcripts, [S] p , refers to a particular probe sequence, i.e., it represents a "single probe"-property. Microarrays of the GeneChiptype use so-called probe sets of several probes (usually N set = 11) for estimating the expression of each considered gene. One expects therefore that all probes of a set probe the same, common transcript concentration, i.e.
[S] set = [S] p for p ∈ set assuming that effects as alternative splicing have been appropriately considered during probe design.
The competitive two-species Langmuir adsorption isotherm (Eq. (1)) considers the effects of non-specific "background" hybridization and of saturation at small and large concentrations of specific transcripts, respectively. The maximum intensity at saturation, M c , depends on factors such as the number of oligonucleotides per probe spot (which in turn is related to the density of oligomers and to the spot size), the mean number of optical labels per bound target and the settings of the scanner. These factors affect the PM and MM nearly in the same fashion giving rise to virtually identical values of M c at complete saturation of the probe spots under equilibrium conditions (X p P >> 1) [16,18].
Recent studies report significantly higher limiting intensity values of the PM, compared with that of the MM, i.e. M PM > M MM [22]. They interpreted this result assuming a probe-dependent partial dissociation of the duplexes during the post-hybridization washing phase. Another, additional explanation might be the truncation of a considerable amount of the probe oligomers due to incomplete synthesis because this effect causes the asymptote-like flattening of the hybridization isotherms at intermediate and large transcript concentrations in a sequencedependent manner [1,9].
We will apply in the following analysis the special-case of the common intensity asymptote for all probes of the chip according to Eq. (1). Possible consequences of deviations from this assumption for the data analysis will be addressed in a separate study.

Matched and mismatched microarray probes
The probes on expression microarrays of the GeneChiptype are usually designed in a pairwise fashion. Each probe pair consists of 25-meric PM-and MM-probes where the PM-sequence is assumed to perfectly match a 25-meric section of the target gene. The MM-sequence differs from that of the PM by a single complementary mismatch in the centre of the sequence. The different middle bases of both probes of one pair cause different base pairings in the respective probe/target-duplexes and thus different binding constants (see below and [16]). Let us define the pairwise PM/MM ratio of the binding constants of specific and non-specific hybridization, respectively, which specify the noted effect of different base-pairings formed by the PM and MM. For example, the binding strength of the complementary Watson-Crick (WC) base-pairings in the middle of the specific duplexes of the PM exceeds that of the specific duplexes of the MM which form a weaker self-complementary mismatch at this position [15][16][17][18]. For the ratio of the specific binding constants one consequently obtains s p > 1. Contrarily, for the ratio of the non-specific binding constants one gets n p < 1 for purines (Adenine, Guanine) and n p > 1 for pyrimides (Thymine, Cytosine) in the middle of the PM sequence owing to the purine-pyrimidine asymmetry of Watson-Crick (WC) base-pair interactions in RNA/DNA duplexes [13,24]. Hence, the parameters s p and n p specify the PM/MM-affinity gain of a selected probe pair upon specific and non-specific binding, respectively. Both, PM and MM probes obey the hyperbolic adsorption isotherm, Eq. (1) [18]. With Eq. (4) one obtains for the binding strengths of the PM and MM probes Eq. (5) scales the intensity of the PM and MM probes as a function of the relative hybridization degree, This S/N-ratio, R, provides the specific binding strength of the PM in units of the non-specific one. It can serve as a relative measure of the expression degree because it is directly related to the concentration of specific transcripts, [S] p . It scales the expression degree in a probe-specific fashion.
Part a of Figure 1 shows the courses of the intensities of a typical PM/MM pair as a function of the parameter R (see Eq. (6)). The PM intensity sigmoidally increases from its minimum value, I p (R = 0), to I p (R = ∞) = M c , at small and large abscissa values, respectively. The respective probes referring to these limiting cases are either exclusively nonspecifically hybridized or completely saturated with surface coverages of Θ p PM (0) = X p PM,N /(1 + X p PM,N ) ≈ X p PM,N and Θ p PM (∞) = 1, respectively. The concentration and S/Nratio referring to the inflection point of the isotherm at half-way between these values are respectively. They specify the condition at which 50% of the free probes available in the absence of specific transcripts become occupied. The approximations at the righthand side of Eq. (7) refer to small X p PM,N << 1.
The MM intensity responds in a very similar fashion as that of the PM with increasing R (see part a of Figure 1). The limiting surface coverage of exclusively non-specifically hybridized MM probes at R = 0 is changed compared with that of the PM (see Eq. (4)), Θ p MM (0) = X p PM,N /(n p + X p PM,N ) ≈ X p PM,N /n p . The isotherm of the MM is clearly shifted to larger abscissa values in the intermediate Rrange owing to the smaller binding strength for specific hybridization (s p > 1, see above). For the inflection point of the isotherm one obtains in analogy to Eq. (7) which shows that the horizontal shift between the PMand MM-isotherms is log(s p ) and log(s p /n p ) in the logscale of [S] and R, respectively.
PM-and MM-probe intensities (part a), their log-difference (part b) and -sum (part c) as a function of the specific target concentration (S/N-ratio) Figure 1 PM-and MM-probe intensities (part a), their log-difference (part b) and -sum (part c) as a function of the specific target concentration (S/N-ratio). The curves are calculated using the Langmuir-model (Eq. (1)). The difference-plot reveals the hybridization regimes of non-specific (N), mixed (mix) and specific (S) hybridization, of saturation (sat) and the asymptotic (as) range. The parameters α and β used in Eq. (10) specify the data range spanned by the log-difference andsum.

The delta-and sigma-transformations
The MM probes were designed as reference for estimating the non-specific background contribution to the respective PM intensity [2,5,25]. The "simple" subtraction of the MM-intensity from that of the PM however partly failed as correction method because both probes differently respond to non-specific and specific hybridization due to their complementary middle bases which, for example, gives rise to negative PM-MM intensity differences [15].
According to the Langmuir model, the behaviour of the PM and MM can be understood on the basis of the same hybridization rules where both probe types however differ with respect to their effective association constants for probe/target dimerization (see above). The intensities of the PM and MM are consequently expected to correlate in a well defined fashion. This mutual relation is determined by the mismatch design of the reference probe, the particular probe sequences and by the concentrations of specific and of non-specific RNA fragments in the sample solution used for hybridization on the particular chip [18].
Let us empirically analyze the relation between the PM and MM signals in terms of two simple linear combinations of the log-intensities of a probe pair, namely their difference and average value, (log ≡ log 10 is the decadic logarithm). The intensity model predicts for this transformation (see Eqs. (1) and (5)) with the "start", "linear" and the "saturation terms" respectively. The limiting values of Σ(R) and Δ(R) in the absence of specific transcripts (R = 0) are In the limit of weak non-specific binding (X p P,N << 1) the o-terms vanish and the limiting Δand Σ-coordinates are given by their start values. The probe-specific exponents in Eq. (10)  Δ-vs-Σ plot of the PM-and MM-isotherms shown in Figure 1 (part a, see Eq. (9)) Figure 2 Δ-vs-Σ plot of the PM-and MM-isotherms shown in Figure 1 (part a, see Eq. (9)). The parameters α and β now specify the height and width of the Δ-vs-Σ trajectory of the PM/MMprobe pair. Part b shows the fraction of specific transcripts bound to PM-and MM-probes (x S P , P = PM, MM; and their log-mean, <...>) and the total fraction of occupied probe-oligomers (Θ P , and of their mean). Note that the fraction of specific transcripts steeply increases in the mix-range whereas the occupancy mainly increases in the sat-range (see Eqs. (17) and (16), respectively).
In summary, the hyperbolic intensity functions of the PM and MM can be transformed into Δ and Σ coordinates which are governed by essentially four parameters, the start values Δ p start ≅ Δ p (0) and Σ p start ≅ Σ p (0) and the exponents α p and β p . They were chosen to provide a simple geometrical interpretation of the Δ-vs-Σ trajectory in terms of its start-coordinates and its vertical and horizontal dimension with respect to the start values (see below and Figure 1 and Figure 2).

The hybridization regimes
Part b and c of Figure 1 show the transformed intensities taken from part a of the figure as a function of the parameter R = R p PM . The course of the log-difference, Δ p (R), can be roughly divided into five regimes which reflect different hybridization characteristics of the PM and the MM probes with increasing degree of specific hybridization (see Figure 1, part b): (1) N-regime: In the non-specific-regime, at small values R → 0, both, the PM and MM nearly exclusively hybridize with non-specific transcripts. Saturation can be typically neglected in this range (B p P ≈ 1, see Eqs. (10) and (11)). The limiting ordinate value for X p P,N << 1 estimates the ratio of the binding constants referring to the respective pair of complementary middle bases in the PM and MM sequences, Δ p (0) ≈ log n p (see Eq. (4)). We will use the approximation of weak non-specific binding throughout the paper.
(2) mix-regime: In the subsequent mixed-regime, both, specific and non-specific transcripts significantly contribute to the observed intensity of the probes. The log-difference Δ increases with increasing amount of specific transcripts. The positive slope of Δ p (R) implies ∂Δ/∂R ~ (1-10 -α ) > 0, and thus α p > 0 or equivalently s p > n p (see Eq. (12)). The increase of Δ p (R) consequently reflects the simple fact that the specific binding constant of the PM exceeds that of the respective MM, i.e., K p PM,S > K p MM,S , if one assumes K p PM,N ≈ K p MM,N (see below and Eq. (4)).
(3) S-regime: In the specific-regime the probes predominantly hybridize with specific transcripts. As a consequence, (4) sat-regime: In the saturation-regime the probes become progressively saturated with bound transcripts (B p P > 1). This effect first and foremost affects the PM due to their higher specific binding constant (see above). As a consequence Δ p starts to decrease.
(5) as-regime: At very large expression degrees both, PM and MM reach their maximum intensity upon complete saturation. In this asymptotic-regime the trajectory reaches the abscissa for R → ∞, Δ p (∞) ≈ 0 (see Eq. (10)).
The respective log-sum of the intensities, Σ p (R), is shown in part c of Figure 1. It varies in a similar, sigmoidal fashion as the individual log-intensities of the PM and MM (compare part a and c of Figure 1). Here the mix-, S-and sat-regimes merge into one region of increasing Σ whereas the N-and as-regimes provide the minimum and maximum values, and Σ p (∞) = log M c , respectively. With Eqs. (11) and (12) one obtains for the difference β p ≈ Σ p (∞) -Σ p (0) > 0 (14) β p specifies the span between the maximum and minimum Σ-values. The Σ-coordinate of the maximum of Δ p (R) at R = R max becomes Eqs. (15) and (14) provide , i.e., the maximum of Δ p (R) roughly bisects the total range of the Σ-coordinate.

The Δ-vs-Σ trajectory
In the next step we plot the transformed intensities into Δvs-Σ coordinates (see Figure 2, part a). This presentation, also known as M-vs-A plot (difference-vs-sum), reflects the binding isotherms of a PM/MM-probe pair. The obtained Δ-vs-Σ trajectory shows a characteristic curved shape with start-, end-and maximum-points referring to the S/N-ratios R = 0, R = ∞ and R = R max , respectively. They consequently define the N-, as-and S-hybridization regimes. The mix-and sat-regimes can be attributed to the increasing and decreasing parts of the Δ-vs-Σ trajectory, respectively.
The parameters α p and β p define the height and the width of the obtained Δ-vs-Σ curve (see also Eqs. (13) and (14)).
The Δand Σ-coordinates of the characteristic points depend on the PM/MM-ratios of the binding constants (see Eq. (4)), on the maximum intensity, Σ p (∞) = logM c , and on the mean intensity of the chemical background due to non-specific hybridization, Σ p (0) ∝ log(I p PM (0)) + log(I p MM (0)). Hence, the Δ-vs-Σ trajectory links the observed probe intensities with essential hybridization characteristics in terms of simple geometric parameters.

The horizontal scale of the Δ-vs-Σ trajectory
In the Appendix A we show that the difference between the actual Σ-coordinate and its "asymptotic-value", Σ p -Σ p (∞), estimates the mean probe coverage of the PM and MM probes whereas the difference between the Σ-coordinate and its "start value", Σ p -Σ p (0) characterizes the relation between the amount of specific and non-specific hybridization in terms of the fraction of specifically occupied binding sites of the respective probe spot Eqs. (16) and (17) provide mean values averaged over the respective PM/MM-probe pair. The "individual" coverages of the PM and MM probes, Θ p PM and Θ p MM , and the respective fraction of specifically hybridized oligomers, x p PM,S and x p MM,S , in addition depend on the relative Δcoordinates Δ-Δ(∞) and Δ-Δ(0), respectively (see Eqs. (42) and (45) in the Appendix A).
Part b of Figure 2 shows the surface coverage and the fraction of specifically occupied oligomers for the Δ-vs-Σ trajectory plotted in part a of the figure. Note that x P, S and Θ P exponentially scale with the coordinate differences Σ-Σ(0) and Σ-Σ(∞), respectively (see Eqs. (17) and (16), respectively).
Consequently, the fraction of specifically occupied probes steeply increases in the raising part of the Δ-vs-Σ trajectory (mix-regime) whereas the probe coverage steeply increases in its decaying part (sat-regime). The contribution of non-specific hybridization and/or the effect of saturation of a particular probe can be essentially neglected if the distance of its Σ-coordinate from the start and/or end points exceeds unity. Particularly, one obtains <x p S > > 0.9 for Σ-Σ(0) > 1 and < Θ p > < 0.1 for Σ(∞)-Σ < 1.
The horizontal shift between the PM-and MM-curves in part b of Figure 2 illustrates the "delayed response" of the MM with respect to the specific transcript concentration: The MM reach a certain ordinate-level of the surface coverage and of the fraction of specifically bound probes at larger abscissa values and thus at larger concentrations of specific transcript concentrations than the PM (see also Eqs. (7) and (8)).
The fraction of specifically bound probes directly transforms into the mean S/N-ratio of the PM and MM (see Appendix A and also Eq. (6) For intermediate abscissa values, Σ(0) + 1 < Σ < Σ(∞) -1, the occupancy of the probe spots (Eqs. (16) and (42)) provide an approximation of the binding strength of specific hybridization of the PM and MM probes (Θ p P ≈ X p P,S , see also Eq. (1)) and of their mean In summary, the position of a probe-point along the Σcoordinate estimates the hybridization degree of the respective probe spot in terms of relative concentration measures characterizing either the S/N-ratio (Eq. (18)), the relative occupancy of the probe oligomers with specific transcripts (Eq. (17)), their overall degree of occupancy of (Eq. (16)) and the specific binding strength of the considered probe pair (Eq. (19)).
The probe coverage (Eq. (16)) provides an additional interpretation of the horizontal dimensions of the Δ-vs-Σ trajectory: For the N-point one obtains with Σ = Σ(0) the Algorithms for Molecular Biology 2008, 3:12 http://www.almob.org/content/3/1/12 coverage due to non-specific hybridization, , because (almost) exclusively non-specific transcripts bind to the probes. Note that this "non-specific" coverage is exponentially related to the "width"-exponent, β p (see Eq. (14)), and thus to the horizontal distance between the N-and the aspoints. The remaining, not-occupied and thus free oligomers serve as potential binding sites for specific targets, i.e., . The horizontal dimension of the Δ-vs-Σ trajectory consequently specifies the maximum amount of free probes available for specific binding at R = 0 and thus the measurement range of the probe spots for estimating the expression degree. The narrowing of the model curves reflects the diminishing capacity of the respective probes for specific transcript binding. Figure 3 (part a) illustrates the narrowing of the Δ-vs-Σ trajectory upon increasing the non-specific background contribution. The special ideal case β = -∞ consequently refers to hybridization without non-specific background.

The vertical scale of the Δ-vs-Σ trajectory
The Δ-coordinate of a probe is directly related to the socalled discrimination score DS used by Affymetrix as a relative measure of the PM-MM intensity difference.
The discrimination score roughly estimates the fraction of the signal due to specific hybridization (see [26]). The approximation on the right hand side of Eq. (20) seems save for small values DS << 1.
The discrimination score serves as the basic parameter in the MAS5-algorithm to calculate the so-called detection call (DC) which judges the "presence" or "absence" of a gene. Hence, the vertical scale of the Δ-vs-Σ trajectory is related to the detection call: the higher the Δ p -value of a probe the higher the probability of the presence of the respective specific transcript in the hybridization solution.
We will discuss this point more in detail in the accompanying paper in connection with our alternative method for classifying the genes into present and absent ones (see below).
The vertical scale of the Δ-vs-Σ trajectory admits an additional interpretation in terms of different strengths of the base pairings of the PM and MM probes. Particularly, the Δ-coordinates of the N-and the S-points estimate the ratio of the binding constants of the PM and MM upon specific and non-specific hybridization according to Eqs. (4), (11) and (13). We have previously shown that the log-ratio of the binding constants of the PM and MM probes can be interpreted in terms of the effective free energy difference for duplex formation with the respective targets [15,16,18]. For the MM-design used for GeneChip expression arrays it roughly refers to the effective free energy change upon replacement of the Watson Crick (WC) pairing in the middle position of the probe/target duplexes with the respective self complementary (SC) pairing in the specific duplexes and with the complementary WC-pairing in the non-specific duplexes, respectively, i.e., Part a: Δ-vs-Σ trajectories upon decreasing contribution of the non-specific background Here Δε 13 WC-SC denotes the dimensionless free energy gain (given in units of the thermal energy, RT) upon replacements of the type B•b c → B c •b c (i.e. WC → SC) for the base B p = A, T, G, C at sequence position 13 of the probe (for example, C•g → G•g; upper case letters refer to the DNA-probe; lower case letters refer to the bound RNAfragment, b = a, u, g, c; the superscript "c" indicates the respective complement). Accordingly, Δε 13 WC-WC is the respective free energy change upon WC-reversals, Hence, the ordinate position of the starting point of the Δvs-Σ trajectory estimates the effective free energy change upon replacing the central base in complementary WCpairings, i.e. Δ p (0) ≈ -Δε WC-WC (B p ) (see Eqs. (11) and (21)). The relative ordinate value of the maximum is related to the respective free energy change upon replacing the central WC-pairing in the specific PM-duplexes by the respective SC-pairing in the MM-duplexes, i.e. Δ p (R max ) ≈ -Δε WC-SC . Figure 3 illustrates that the maximum height of the Δ-vs-Σ trajectory starts to decrease for relatively small widths referring to large strengths of non-specific hybridization (β p < 3) because saturation onsets almost in the mixrange. In such cases the observed vertical dimension of the trajectory potentially underestimates the height-parameter α p (see Eq. (13)) which however can be obtained by appropriate curve fitting using Eq. (10) (see below).
In summary, the Δ-vs-Σ trajectory spans a sort of natural or intrinsic metric system between distinctive points which characterizes the binding thermodynamics of the probes of the particular microarray. The horizontal dimension characterizes the measurement range of the respective probe whereas the vertical dimension reflects the free energy gain due to the change of the central base pairing in the respective duplexes of the PM and MM probes.

Δ-vs-Σ trajectories of individual probes
Each probe is characterized by its "individual" Δ-vs-Σ trajectory which describes the intensity change upon increasing content of S-transcripts in the range 0 ≤ R ≤ ∞. We used the Affymetrix HG-133 spiked-in data-set to study the Rdependence of selected probes http://www.affyme trix.com/support/technical/sample_data/datasets.affx. This data set was generated by Affymetrix to calibrate the observed intensities on the basis of known transcript concentrations. Particularly, transcripts referring to 42 selected genes were titrated with increased concentration onto a series of chips using the Latin-squares design. The non-specific background was taken into account by add-  (21) Upper panel: Δ-vs-Σ trajectories of six selected probe-pairs taken from the HG-U133 spiked-in experiment (Affymetrx) Figure 4 Upper panel: Δ-vs-Σ trajectories of six selected probe-pairs taken from the HG-U133 spiked-in experiment (Affymetrx). The sequences of the PM-probes are shown in the insertion below. The probe-pairs are numbered with increasing C and decreasing A content of the respective PM-probe (see of insertion on the right). The symbols are the experimental data. The lines are calculated using Eq. (10). Lower panel: Variability of the Δ-vs-Σ trajectories of all 498 spiked-in probes. The boxplot indicates the scattering of the start-and maximum-coordinates for R = 0 and R = R max , respectively. The Δ-vs-Σ trajectories are averaged over all probe-pairs with the middle base A, T, G and C of the PM, respectively. The variability due to the middle base refers to the 75%/25% quantiles of the distribution of the individual probe trajectories. The consideration of the nearest neighbours in terms of the middle triple extends this range: only the trajectories with the largest and smallest maximum value (for middle triples CCC and CGC, respectively) are shown (see also ref. [18]).
ing a HeLa-cell line extract to all hybridizations which does not contain the spiked-in transcripts.
Part a of Figure 4 shows the trajectories of six selected probes together with fits by means of Eq. (10) (compare curves and symbols). The probe-labels 1 to 6 are chosen to increase with increasing number of C and decreasing number of A per probe sequence (see Figure 4). The observed intensities and thus also the trajectories are functions of the binding constants for DNA/RNA duplex formation, which in turn depend on the sequences of the 25 meric probes. For example, the binding affinity of C•g WC-pairings exceeds that of A•u pairs in the hybrid duplexes. In general, the probes with a higher amount of cytosines are therefore expected to bind the RNA fragments more strongly than probes with a higher amount of adenines. Equation (3)  Indeed, the increase of the cytosine-content causes the narrowing of the trajectories by the shifting of their startpoint, Σ p (0), towards larger abscissa values at invariant Σ p (∞) = const., which is assumed to be constant across all probes because of their common maximum binding capacity. Note that the width of the trajectories and thus the binding strength of the non-specific background varies over about two orders of magnitude, logX p PM,N ≈ -4 to -2, for the six selected probes.
The Δ-coordinates of the starting-and maximum-points of the selected probe-trajectories show considerable variation without obvious correlation to their sequence characteristics. We calculated the trajectories of all spiked-in probes (~500) using the results of our previous analysis of the hybridization isotherms (see refs. [18] and [16] for details) to estimate the variance of the positions of their starting-and maximum-points. The boxplot in part b of Figure 4 visualizes the center and the width of the distributions of the obtained Δ p (0)-and Δ p (R max )-data in vertical and horizontal directions.
The respective coordinates of the individual probe-trajectories depend mainly on the particular probe pairings of the middle bases in the non-specific and specific duplexes, respectively (see Eqs. (11) - (13) and (21)). To filter out the underlying sequence effects we calculated "mean" trajectories for all probe pairs with a certain middle base (see Figure 4, part b). These middle-base related mean trajectories are shifted each to another in vertical direction according to C ≈ T > G ≈ A for the N-, and C > G ≈ T > A for the S-point, respectively. This systematic trend is in agreement with Eq.  (21)). Note that the specific binding constants of PM exceed that of the MM on the average by the factor of s ≈ 7 whereas for non-specific binding one obtains a mean ratio of n ≈ 1.2.
The comparison of the middle-base related trajectories with the width of the N-and S-boxes indicates that the systematic effect due to the middle-base explains the variability of the Δ p (0)-and Δ p (R max )-coordinates in the limits of their 25% and 75% quartiles. The consideration of the nearest neighbors of the middle base further broadens this range: For illustration we show the respective mean trajectories for the "middle triples" CCC and CGC which provide the strongest and weakest binding affinities among the 64 possible combinations of three adjacent bases, respectively (see [18] and [13]).
In summary, the transformed intensity data of individual probes are well described by the Δ-vs-Σ trajectories predicted by the Langmuir-isotherms. The presented data illustrate the probe-specific variability of the Δ-vs-Σ trajectories due to sequence effects. The positions of the startand maximum-points can be attributed to the differences between the PM and MM probe-sequences which affect the respective binding constants in a middle-base dependent fashion.

The hook-algorithm for single-chip calibration
The Δ-vs-Σ trajectories describe the behaviour of PM/MMprobe intensities as a function of the RNA-concentration on a relative scale. The analysis of probe-specific trajectories seems not applicable for probe data which are taken from a single chip because each probe pair refers exactly to only one concentration and thus to only one point along its "individual" Δ-vs-Σ trajectory. On the other hand, the large number of probes per chip (and the presence of considerable amounts of specific targets) lets us suggest that their hybridization degrees usually cover the whole potentially possible concentration range. Our idea is to characterize the performance of a particular hybridization experiment in terms of its mean Δ-vs-Σ trajectory averaged over all probes of one particular microarray in analogy to the "individual" Δ-vs-Σ trajectory of each single probe. The horizontal and vertical dimensions of this mean Δ-vs-Σ trajectory are expected to provide the hybridization metrics of the considered chip in terms of characteristic concentration measures (e.g. the mean level of non-specific background or the mean occupancy and saturation level of the probe spots) and of characteristic effective free energy differences due to the used mismatchdesign of the probe pairs.

The raw hook curve
To construct the mean Δ-vs-Σ trajectory we plot the PM-MM log-intensity difference of all probe pairs of a particular chip, Δ p , versus the set-averaged log intensity of the respective probe set, <Σ> set , in a first step (see Eq. (9) and part a of Figure 5). Additional set-averaging of the log-difference, <Δ> set , reduces the scatter width of the data points along the ordinate roughly by the factor of ~ √N set ~ 3 (see the yellow dots in Figure 5). In the next step we smoothed these data by calculating the moving average over a sliding window of N mov ≈ 100 subsequent probe sets along the abscissa to extract the mutual dependence between <Δ> set and <Σ> set . The resulting plot is called (raw-) hook-curve because of its typical shape (see part b of Figure 5). Each probe-set is characterized by its "hook"-coordinates, Σ hook = <Σ> p∈set and Δ hook = <<Δ p > p∈set > mov .
The shape of the hook curve basically agrees with that of the Δ-vs-Σ trajectory (compare with Figure 2). We attribute the rising and decaying part and the maximum inbetween to the mix-, sat-and S-regimes, respectively, which refer to the mean hybridization level of the respective probes on the chip. The hook-curve obviously does not reach the asymptotic as-regime with Δ(∞) ≈ 0. This result does not surprise because complete saturation is usually circumvented by reasonably chosen hybridization conditions. Otherwise the probes completely lose their sensitivity to detect changes of the transcript concentration [14].
At small abscissa values the hook curve starts with a virtually horizontal part (slope(N) < 0.1) which is separated from the subsequent mix-range (slope(mix) > 0.6) by a distinct break-point. In the appendix we show that the observed initial tiny slope can be explained by the relative strong correlation between the intensities of non-specifically hybridized PM-and MM-probes in the absence of specific transcripts (ρ > 0.7, see Eq. (49)) which in turn seems reasonable in view of the nearly equal sequences of both probe-types which suggest similar variations of the sequence-specific bindings strengths from probe pair to probe pair. On the other hand, for the mix-region the hybridization model predicts a considerably increased slope of slope(mix) ≈ 1.15·α p ≈ 0.9 > slope(N) ≈ 0.8 (see Eqs. (47) and (49) with·α > 0.8 and ρ > 0.7, respectively).
Hence, the break in the course of the hook-curve can be explained by the onset of specific hybridization for probes (and probe-sets) located on the right from this point.
Accordingly, the break-point between the N-and mixranges was used to classify the probe-sets into two subensembles: (i) "absent"-ones for Σ hook < Σ break , which are assumed to hybridize predominantly non-specifically owing to the absence of specific transcripts in the hybridization solution (R = 0); and (ii) "present"-ones for Σ hook > Σ break , which significantly hybridize specifically owing to the presence of specific transcripts (R > 0).
The exact position of the break point of a particular hook curve was estimated by a simple algorithm based on the joint least-squared fit of the Δ hook -data to a linear function for Σ hook < Σ break , and a quadratic function for Σ hook > Σ break (see Figure 5). The algorithm systematically varies the position of the break, Σ break finally returning the optimum value of Σ break which best fits the data.
Upper panel: Δ-vs-<Σ> set plot of the intensity data of one complete chip taken from the HG-U133 spiked-in experi-ment (Affymetrix): Probe intensity data (black dots), probe-set log-averaged data (light dots) and moving average over ~100 probe-sets (line) Figure 5 Upper panel: Δ-vs-<Σ> set plot of the intensity data of one complete chip taken from the HG-U133 spiked-in experiment (Affymetrix): Probe intensity data (black dots), probeset log-averaged data (light dots) and moving average over ~100 probe-sets (line). The latter plot is enlarged in the lower panel: It is called (raw) hook curve because of its characteristic shape. The N-, mix-, S-and sat-hybridization regimes are indicated (compare with Figure 2). The breakpoint is used to classify the probesets into absent and present ones (see text).

Sensitivity-corrected intensity-data and sensitivity profiles
The intensity of a probe and thus also its (Δ, Σ)-coordinates depend on the concentration of RNA-transcripts and on the binding constants for specific and non-specific hybridization as well (see Eq. (5)). These constants are functions of the respective probe sequences giving rise to the scattering of the individual probe intensities over oneto-three orders of magnitude [14]. This variability is not related to changes of the transcript concentration, and thus it introduces considerable uncertainty if one aims at interpreting the (Σ set hook , Δ set hook )-coordinates of a probe in terms of its hybridization degree. In the next step we therefore correct the intensities for probe specific effect according to where I P 0,p denotes the corrected intensity of probe p. The sequence-specific incremental contribution, δA p P (P = PM, MM), the so-called sensitivity of the probe, additively splits into two terms due to specific and non-specific hybridization, δA p P,N and δA p P,S , which are weighted by the fraction of non-specifically and specifically hybridized oligomers, and x p P,S = (1 -x p P,N ) with logL p P (0) ≈ logI p P (0) = <logI p P > p∈N + δA p P,N and L p P = Altogether, the intensity-correction requires four sets of such profiles, δε k P,h (b m ), referring to P = PM, MM and h = N, S, respectively. We used weighted multiple linear regression of the normalized intensity data taken from selected sub-ensembles of probe sets to estimate the required sensitivity-profiles (see Appendix C and [18]). These sub-ensembles were chosen to comprise probes which are predominantly hybridized with non-specific (p ∈ N) or specific (p ∈ S) transcripts for estimating δε k P,N (b m ) and δε k P,S (b m ), respectively. Probes of the former sub-ensemble are taken from the initial horizontal part of the hook curve which was assigned to the N-regime meeting the condition Σ p hook < Σ break (see Figure 5). The respective absent-probes are assumed to bind predominantly non-specific transcripts.
The second sub-ensemble of predominantly specifically hybridized probes (p ∈ S) was selected according to the condition Σ set hook > Σ 80% where Σ 80% defines a threshold referring to <x S > > 0.8, i.e. to probe spots with a fraction of specifically-hybridized oligomers of at minimum 80%. According to Eq. (17) the respective probes are selected to meet the "horizontal distances"-criterion with respect to the break-point (Σ 80% -Σ break ) > -log(0.2) ≈ 0.7. The somewhat arbitrary choice of <x S > > 0.8 turned out to be not very crucial with respect to the quality of the obtained correction. On one hand the higher the value of <x S > the smaller the residual contribution of non-specific hybridization in the selected sub-ensemble and the better the obtained sensitivity profiles characterize specific binding [16]. On the other hand, the number of probe sets in the sub-ensemble decreases with increasing <x S > giving rise to more noisy sensitivity profiles and thus to less precise correction terms [16]. The chosen value of <x S > >0.8 provides a good compromise (see Figure 5).
Typical SN-and NN-sensitivity profiles of the PM and MM referring to the N-and S-subsets are shown in Figure 6. Note their different shape, especially that of the S-profiles of the MM with the typical "dent" in the middle of the sequence. It is caused by the mismatched base pairing in the specific duplexes of the MM which give rise to changed interaction characteristics compared with WC-base pairings. We refer to our previous papers for the detailed discussion of the sensitivity profiles in terms of base-pair interactions in the respective probe-target duplexes [15,16,18]. In the present context it is important to discriminate between the four different sensitivity profiles to properly correct the intensity data for sequence specific effects.
Importantly, for perfect selection of the N-subset one expects the same sensitivity profiles for the PM and MM probes, since the nonspecific background bears no particular relationship to any of both kinds of probes. The profiles shown in Figure 6 confirm this prediction. The degree of similarity of the N-profiles of the PM and MM probes thus provides a criterion for the proper selection of the Nsubsets.

The corrected hook curve
In the next step, the corrected intensity values were transformed into (Δ 0,p , Σ 0,p )-coordinates using the corrected intensities in Eq. (9). Subsequently the (Δ 0,p , Σ 0,p )-data were set-averaged and smoothed in the same fashion as described above for the non-corrected intensities to get the sensitivity-corrected version of the hook-curve with the coordinates (Σ 0 hook , Δ 0 hook ).
The sensitivity-correction reduces the scattering of the probe-and probe-set-level intensity data about the smoothed hook-curve (see Figure 7). With the N-, mix-, Sand sat-hybridization regimes it essentially shows the same features as the uncorrected hook curve. The break between the N-and mix-regimes was again used to classify the probe sets into absent and present ones. The sensitivity correction affects the (Σ 0 hook , Δ 0 hook )-coordinates of The same as Figure 5 but for sensitivity-corrected intensity data (Eq. (22)). The upper panel shows the Δ-vs-<Σ > set plot of the sensitivity-corrected intensity data of one chip taken from the HG-U133 spiked-in experiment (Affymetrix) The lower panel shows the (corrected) hook-curve.
each probe set with possible consequences for its classification into absent and present ones. In a second iteration we therefore re-calculated the four sensitivity profiles on the basis of corrected hook-curve by applying the same criteria for probe selection as above, i.e., Σ 0 hook < Σ 0 break and Σ 0 hook > Σ 0 80% for the N-and S-profiles, respectively. Typically the re-calculated sensitivity profiles only marginally change compared with the profiles which were obtained on the basis of the uncorrected hook curve indicating the relative robustness of the chosen classification criteria.
The detailed comparison between both versions of the hook-curve reveals subtle differences especially of the initial N-regime: After correction it changes slope and, more importantly, considerably narrows in horizontal direction by roughly 50% (compare Figure 7 and Figure 5). The steeper slope of the N-region after correction reflects the reduced cross correlation between the corrected PM-and MM-probe level data (see appendix B). Typically, the coefficient of correlation decreases from values ρ ≈ 0.7-0.9 of the uncorrected data to ρ ≈ 0.4-0.7 after correction which gives rise to the change of the slope by the factor of 1.5 -3 (see Eq. (49)).
The narrowing of the N-range results from the partial removal of sequence effects which, to a large degree, cause the variability of probe signals [14]. Ideally, the correction of the intensities for sequence specific effects is expected to shrink the N-region into one point referring to one and the same corrected background intensity for all probes of the chip (see Eqs. (11) and (3)). The residual width of the N-region reflects the deficiency of the method due to at least three, essentially undistinguishable effects: Firstly, the "fit"-error of the position-dependent sensitivity model which only incompletely corrects the intensity data for sequence effects caused, e.g. by the specific folding of probe or target and/or bulk dimerization between different targets [1]; secondly, the "N-concentration" error due the simple assumption to consider all non-specific transcripts as one effective species (Eq. (3)) and thirdly, the "classification" error of the simple break-criterion which imperfectly distinguishes between "present" and "absent" probes due to its limited specificity.

Fit of the hybridization model
The corrected hook curve manifests the mean hybridization characteristics of the probes on the particular chip in sensitivity-corrected (Σ 0 hook , Δ 0 hook )-coordinates. It shows essentially the same properties as the Σ-vs-Δ trajectory of a single probe except the N-regime. For quantitative analysis we fit the hybridization model introduced above (Eq. (10)) to the corrected hook data in the mix-, S-, sat-and as-ranges (i.e. at Σ 0 hook > Σ 0 break , see Figure 8). The least-squares gradient descent algorithm searches for optimum values of α c , β c and Σ c (0) ≈ Σ 0 break and Δ c (0) ≈ <Σ 0 hook > p∈N .
Note that the break position Σ 0 break slightly deviates from the centre of the N-range <Σ 0 hook > p∈N because of the width of the N-regime (see Figure 7). We define the total width of the hook curve as .
Note that the obtained model parameters are now chipaveraged mean values in contrast to the single probe properties used above. We therefore substitute the probe-index "p" by the chip-index "c". Particularly, one gets the mean scales the concentration ratio of specific and non-specific transcripts with the respective ratio of chip-averaged mean binding constants of the PM for specific and non-specific binding. Note that this scaling is common for all probes of the chip in contrast to the probe-specific scaling of R p PM in Eq. (6).
In summary, the hook-curve analysis of the intensity data thus provides chip characteristics such as the mean contribution of the non-specific background, the asymptotic intensity maximum and the mean PM/MM-sensitivity gain, and, in addition, the expression degree on probe-set basis in terms of the S/N-ratio.

Signal distributions
Part a of Figure 9 shows the probability-density distribution to find a probe pair a at a given position of the hookabscissa, p(Σ 0 hook ) = ΔN/(N tot ·ΔΣ hook ·) (here ΔN/N tot is the fraction of probe-pairs found per abscissa interval ΔΣ hook ). About 40% of the probes fall in the N-range in this example and thus are classified "absent". We separately plot the density distribution of these absent probes as a function of their corrected log-intensities (part b of Figure 9). The obtained PM-and MM-probe-level data are The total distribution decays exponentially to a good approximation at Σ 0 hook > Σ c (0) in many cases (part a of Figure  9). Define the parameter , related to the S/N ratio R.
The exponential behaviour is approximately preserved with respect to this parameter in the mixed and S ranges since from Eq. The distribution can be empirically approximated by an exponential decay function for r-values greater than a certain threshold r' (and R' > R) In the intermediate range the signal is approximated by a constant with f 2 present = 1 -(f absent + f 1 present ) (f present = f 1 present + f 2 present ) whereas for r = 0 one gets the fraction of absent probes. The decay constant λ r defines the r-range which decreases the probability of the specific signal by one order of magnitude. It thus defines a characteristic S/ N-ratio (in logarithmic scale) of the chip which characterizes log-ratio of the mean specific and non-specific signals, or in other words, to which extend the specific signal exceeds the non-specific one. The S(pecific)/N(onspecific)-ratio thus can be also interpreted as a sort of S(ignal)/N(oise)-ratio of a given probe set. The value of the r-related decay constant λ r slightly exceeds that of the hook-related distribution owing to the larger range covered by the PMonly S/N-ratio (λ r = 1.0 versus λ = λ Σ = 0.75; compare part c and a of Figure 9).
An estimate of the mean decay constant can be obtained by simple averaging over the respective R-range λ ≈ ( log(R + 1) R > R' ·In 10) (29) Note that the exponential decay in Eq. (28) is equivalent with the power law, 10 -r/λr = (R+1) -1/λr , which has been shown to describe the probability distribution of expression values in a series of microarray experiments [28][29][30][31]. This power-law function is known as the Zipf's law, observed in many natural and social phenomena. The presence of such power-law function in principle prevents an intrinsic cut off point between "on" genes and "off" genes. The analysis of expression data in terms of their probability distribution therefore is of basic importance for judging the expression level in terms of global characteristics. The hook-transformation provides such characteristics in terms of the signal distribution along the Σ hook and/or r coordinates for the mix-, S-and sat-hybridization ranges.
Let us now consider the S/N-ratio of a particular probe at two conditions: (i) with x = 0, referring to the non-specific binding strength in the centre of the normal background distribution, X p PM,N ≈ 10 μ /M c : ; a n d (ii) with x ≠ 0, i.e. X p PM,N (x) ≈ 10 μ+x /M c and .
After transformation into logarithmic scale and combination of both situations we get log R(x) = log R p -x and for R > 1 to a good approximation r(x) ≈ r -x.
Now we combine the virtually independent background and signal distributions into the joint probability density function It refers to the logarithmic signal value r with the particular non-specific background contribution μ + x. Integration over the whole possible backgound range provides the probability for detecting the signal r for probe p, Note that the upper integration range is effectively restricted to x ≤ r because of p S (r-x,λ) = 0 for x > r (see Eq. (28)). Eqs. (30)-(31) define the convolution product of the background and specific-signal distributions. A similar approach was used in ref. [30] and partly also in the RMA-algorithm [29].

De-saturation of the signal and convolution based expression estimates
The hyperbolic intensity-response (Eq. (1)) can be linearized and transformed into the total binding strength using the asymptotic value M c : Figure 10 re-plots the hook curve shown in Figure 8 after linearization of the corrected intensities using Eq. (32).
The sat-range now levels off into the asymptotic value α c (compare with Figure 8). This "de-saturated" signal additively decomposes into contributions due to non-specific and specific hybridization (Eq. (2)). We now calculate the expression value as the weighted "glog"-mean of the specific signal, Particularly, Eq. (33) corrects the linearized signal (Eq. (32)) for the probe-specific background contribution using the mean background of the chip and the probespecific sensitivity for non-specific hybridization, (Eq. (23)). The use of the generalized logarithm, , for data transformation is advantageous in two respects: Firstly, it enables processing negative arguments which may appear due to data scattering and, secondly, it "stabilizes" the variance at small values of x provided that the transformation parameter c > 0 is properly chosen [32,33].
In Eq. (33) the glog-transformed data are averaged over the probability distribution representing the convolution product of the mutually independent probabilities of the non-specific Gaussian background and the specific expression signal (see previous section and [30]). This approach assumes that the probability distribution of each individual transcript is the same as the mean distribution averaged over all transcripts. Alternatively one can use the un-weighted kernel with p S = const. In this case Eq. (33) restricts the averaging to the background distribution.
In the next step, the obtained values are corrected for the sequence effects according to using the positional-dependent sensitivity model and the Z p P values instead of the log-intensities for parameterization (Eq. (50)). Then, the probe-level data are summarized for each probe set by calculating the Tukey-biweight median, [25], and finally by transforming them into linear scale to get the probe set-level expression degree, . Eq. (33) thus provides "PMonly" and "MMonly" estimates of the expression degree with P = PM and MM, respectively. Eq. (34) uses the background distribution of the PM and the conditional expectation of the bivariate normal distribution for the argument of the MM-background which gives rise to an effectively shrunken integration variable, (here ρ c is the coefficient of correlation, see appendix B). Note that this approach is to some degree similar to the GC-RMA algorithm [34] which however uses pseudo-MM representing subsets of PM-probes of the same GC-content as the considered PM. A second difference to Eq. (34) is that GC-RMA directly subtracts the (non-linear) MM-intensity without explicit consideration of the non-specific background.
In this context we stress the fact that the non-specific background intensity is rather a variable contribution which progressively decreases with increasing amount of specific binding than a constant (see appendix A and the figure shown there). The background correction of the intensity (as realized, e.g. by GC-RMA) neglects this trend and therefore it is expected to over-correct the signal in the Sand as-hybridization ranges. As a consequence, this approach underestimates the expression degree due to two reasons: Firstly, this over-correction of the background and, secondly, the neglect of saturation. On the other hand, the linearized signal used here (Eq. (32)) corrects the intensity for saturation and, in addition, it contains a constant background level corrected by simple subtraction in Eqs. (33) and (34).
The joint PM-MM-and especially the MMonly-expression measures are smaller than the PMonly estimates because of the smaller binding constants of the MM for specific binding. The former two measures can be scaled to values equivalent to the expression level of the PM according to (see also Eq. (24)), and finally transformed into the "set-averaged" binding strength in analogy with Eq. (32) Note that the PM-estimate exceeds the MM signal roughly by the factor s c ≈ 5 -10 at comparable variance of the residual background distribution (see Figure 9, part b). Note that also the S/N-ratio (Eq. (26)) transforms into an alternative estimate of the specific binding strength of the PM (see Eq. (25) and (26)) with Finally, the decay constant of the distribution of the specific signal relates to the logarithm of the S/N-ratio (see Eqs. (28) and (29)). One obtains the mean specific signal measured by the given chip as The characteristic expression index (or exponent of specific binding) φ c complements the respective exponents for non-specific hybridization β c and the characteristic S/ N-index λ r as the basic chip summary-characteristics of specific and non-specific binding.

Chip characteristics and expression estimates in natural units
Eqs. (36) and (37) provide probe set-estimates of the specific binding strength as a measure of the expression degree, which represents a dimensionless concentration measure in units of the mean specific binding constant of the chip. A value of S set = 1 consequently defines the condition of half-coverage to a good approximation (see Eq. (7)). Analogously, the horizontal dimensions of the hookcurve provide the non-specific binding strength as a measure of the non-specific background in units of the respective binding constant (see Eq. (25)), It specifies the mean occupancy of the probes in the absence of specific transcripts, Θ PM (0) ≈ X c PM,N .
Hence, the hook method measures both, the abundance of specific and non-specific transcripts in the hybridization solution in chip-related units such as the relative occupancy of the probe spots and the respective binding strengths.
The specific and non-specific signals are combined into the S/N-ratio, R (Eq. (26) degree in units of the non-specific background contribution in the particular chip experiment. The S/N-ratio is directly related to the hook-coordinates of a selected probe set and thus it can be roughly deduced by visual inspection of the particular hook curve (Eqs. (46) and (18)). Figure 8 illustrates the relation between the intrinsic expression measures and the hook coordinates for a typical microarray hybridization. The point of half coverage (Θ PM = 0.5, X PM, S ≈ 1) is found beyond the maximum in the sat-regime. Virtually no probe of the chosen chip meets this condition. Note that Θ PM scales with the distance relative to the end point referring to R → ∞ (see above). Vice versa, the mean fraction of specifically hybridized oligomers, x PM, S , scales with the distance relative to the N-point. It steeply increases in the mix-regime and reaches the conditions at which 50% of the bound transcripts belong to the specific ones, x PM, S = 0.5, at relatively small abscissa-values. Hence, specific hybridization starts to dominate over non-specific one always at the beginning of the mix-range. The S-range near the maximum of the hook-curve refers to probes with a 50 -100 fold excess of specific hybridization, R PM ≈ 50 -100. The width of the hook of about β = 2.7 is equivalent with the background strength of N c ≈ 10 -2.7 = 0.002 which in turn rescales the S/N-ratio into binding strengths, for example S c = 0.1-0.2 for R = 50-100. These rough estimation shows that the maximum of the hook is equivalent with the occupancy of the PM-spots in the order of 10 -25%.
The mean binding constants, K c PM,S and K c PM,N and thus also the used measures of specific and non-specific hybridization depend on the particular probe and chip design (e.g., the length of the probe-oligonucleotides, their density and the type of the mismatches used for the MM probes). Consequently they are specific for the used chip type, on one hand. On the other hand, also the conditions of a particular hybridization affect the K c PM,h because their values depend on all processes affecting the binding reaction between the probe-oligonucleotides and the targets. For example, the composition of a particular sample will affect K c PM,N and K c PM,S as well, because both constants depend on the extent of target-dimerization which is a function of the concentrations of the reacting species in the hybridization solution [1].

Summary and Conclusion
The improvement of microarray calibration methods in combination with the development of meaningful quality standards is an essential prerequisite for obtaining absolute expression estimates which in turn are required for the quantitative analysis of transcriptional regulation. In this publication we present a new method of microarray data analysis based on a physical model. This so-called hook method pre-processes the raw intensity data for further downstream analyses on one hand, and, on the other hand, provides chip characteristics with potential applications in hybridization quality control and array normalization.
The method is based on the Langmuir-hybridization model which provides a physically adequate and computationally feasible approach for microarray intensity calibration with the potency to improve existing methods. Our hook-calibration method uses this model together with the positional-dependent nearest-neighbour affinity correction. It is based on the linear transformation of the intensities of PM and MM probes from one chip into Δ-vs-Σ coordinates, probe-set averaging and smoothing. Here, the MM probes serve as an internal reference subjected essentially to the same hybridization law as the PM, however with modified characteristics. Figure 11 and Table 2 summarize the essential steps of the algorithm together with the output characteristics provided by the method.
The obtained hook-curve can be interpreted as a special representation of the binding isotherm where the explicit dependence of the probe intensities on the (usually unknown) transcript concentrations is replaced by the (experimentally available) relation between the PM-and the MM-probe intensities. It enables clear differentiation between different, well-defined regimes and it provides a set of chip summary characteristics which evaluate the performance of a given hybridization in terms simple parameters such as the mean non-specific background intensity, its saturation value, the mean PM/MM-sensitivity gain and the fraction of absent probes. The hook curve spans a natural metrics system for the expression estimates which reflects essential hybridization characteristics in terms of its geometric dimensions, width, height and "start"-coordinates.
The obtained single chip characteristics in combination with the sensitivity corrected probe-intensity values provide expression estimates scaled in natural units given by the binding constants of the particular hybridization. This way the method corrects the raw intensities for the nonspecific background hybridization in a sequence-specific manner, for the potential saturation of the probe-spots with bound transcripts and for the sequence-specific binding of specific transcripts.
In the accompanying publication we illustrate the performance and potential applications in terms of quality control and expression analysis using a series of selected chip-types, hybridization conditions and benchmark experiments [19].
The beta-version of the hook-program can be downloaded from http://www.izbi.de. The stand-alone JAVA program processes single-chips and chip-series in a batch-mode according to the scheme given in Figure 11 and Table 2. Chip and probe-set related characteristics such as expression degrees, hook-curves and sensitivity profiles are exported in html-as well as in tabular form and jpggraphics.

Derivation of Eqs. (16) -(18)
The relative Σand Δ-coordinates of the log-transformed probe intensities with respect to its asymptotic value at R = ∞ directly rescale into the respective log-sum and -difference of the surface coverage of the PM and MM probes (see Eqs. (1) and (9)) One obtains Eq. (16) for the mean coverage of the PM and MM with Eq. (41) and the definition, Θ ≡ 10 ΣlogΘ . Rearrangement provides the surface coverage of the PM and MM probes as a function of the relative Σand Δ-coordinates, The surface coverage splits into contributions due to nonspecific and specific transcripts, Θ P = Θ P,N + Θ P,S , with (see Eq. (2)). Both contributions are functions of the composition of the hybridization solution expressed in terms of the S/N-ratio R (see Eq. (5)), Eq. (43) shows that the specific coverage monotonously increases with the S/N-ratio whereas the non-specific "background" coverage remains virtually constant at R < 1/X PM, N for the PM (and at R < (n/(s·X PM, N ) for the MM, see Figure 12). Specific transcripts of large binding strength progressively replace the bound non-specific fragments with further increasing values of R. As a conse- (42)  Step Output (chip and probe-set characteristics) Eq.
Optical background (O, mean over all zones); the algorithm uses a scaled value with a scaling factor chosen between 0 and 1 2) Raw hook: Plot of the PM and MM probe intensity data into Δ-vs-Σ coordinates and smoothing over a sliding-window of ~100 probe sets. Classification into N-and S-probes using the breakpoint of the hook.
Raw hook curve (9) 3) Parameterization of the positional dependent sensitivity-model separately for the PM and MM in the N and S-ranges and correction of the intensities for probe-specific sensitivities.
Sensitivity profiles (optional SN, NN or NNN models) quence the non-specific coverage Θ P, N drops and finally disappears for R → ∞.
With Eq. (43) one can express the specific and non-specific coverage as follows The ratio of the "specific" and the total coverage defines the fraction of specifically hybridized probe-oligomers,  42) and (14)).

B. The slope of the hook-curve in the mix-and the Nranges
The initial part of the hook curve roughly divides into two segments of different slope. We assume that the probes from probe-sets with abscissa positions to the left from the break, Σ hook < Σ break , are predominantly hybridized with non-specific transcripts (N-range) whereas the probes from probe-sets with abscissa positions to the right from the break, Σ break < Σ hook , in addition, bind a certain amount of specific transcripts (mix-range). The slope of the hook curve in this mix-range reflects the change of the hook coordinates with increasing concentration of specific transcripts. From Eq. (10) one gets after differentiation with respect to the S/N-ratio R in the limit of R → 0, The initial slope of the mix range essentially depends on the "height"-parameter α c and thus on the PM/MM-ratio of the specific binding constants (see Eqs. (13) and (4)).
The slope of the hook-curve in the N-range can be rationalized in terms of cross-correlations between the non-specific background signals of the PM and MM probes referring to "absent" specific transcripts, i.e. R = 0. The variances of the hook-coordinates are given by   Eq. (42), part c). The total occupancy additively decomposes into the occupancy due to specific and non-specific hybridization (thin lines, see Eq. (43)). Note that the latter occupancy is not a constant but it depletes upon increasing total occupancy because specific binding progressively replaces non-specific one.
Here σ PM 2 , σ MM 2 and σ PM,MM denote the variance and covariance of the set-averaged log-intensities of the PM and MM probes in the N-range. The formula for the correlation coefficient ρ assumes to a good approximation (see Figure 9).
With Eq. (48) one obtains for the mean slope of the hookcurve in the N-range Hence, a tiny slope near zero indicates strongly correlated PM and MM signals (ρ~1) whereas the increase of slope(N) reflects "decoupling" of PM and MM intensities.

C. Fit of the position-dependent sensitivity profiles
The sensitivity terms δε k P,h (b m ) in Eq. (23) were estimated using the so-called probe sensitivity [14], which normalizes the log-intensities with respect to their log-mean over the probe set. Here the probe set was selected from the specific or non-specific sub-ensembles (h = N or S). P = PM, MM specifies the probe type. Insertion of Eq. (22) into (50) provides Here f k (b m ) is the probability to find the subsequence b m at sequence position k within the probes of a probe set. The weighted least squares algorithm fits Y(theory) to Y(exp) by optimizing the coefficients δε k P,h (b m ) using single value decomposition [35].