# "Hook"-calibration of GeneChip-microarrays: Theory and algorithm

- Hans Binder
^{1}Email author and - Stephan Preibisch
^{2}

**3**:12

https://doi.org/10.1186/1748-7188-3-12

© Binder and Preibisch; licensee BioMed Central Ltd. 2008

**Received: **27 May 2008

**Accepted: **29 August 2008

**Published: **29 August 2008

## Abstract

### Background:

The improvement of microarray calibration methods is an essential prerequisite for quantitative expression analysis. This issue requires the formulation of an appropriate model describing the basic relationship between the probe intensity and the specific transcript concentration in a complex environment of competing interactions, the estimation of the magnitude these effects and their correction using the intensity information of a given chip and, finally the development of practicable algorithms which judge the quality of a particular hybridization and estimate the expression degree from the intensity values.

### Results:

We present the so-called hook-calibration method which co-processes the log-difference (delta) and -sum (sigma) of the perfect match (PM) and mismatch (MM) probe-intensities. The MM probes are utilized as an internal reference which is subjected to the same hybridization law as the PM, however with modified characteristics. After sequence-specific affinity correction the method fits the Langmuir-adsorption model to the smoothed delta-versus-sigma plot. The geometrical dimensions of this so-called hook-curve characterize the particular hybridization in terms of simple geometric parameters which provide information about the mean non-specific background intensity, the saturation value, the mean PM/MM-sensitivity gain and the fraction of absent probes. This graphical summary spans a metrics system for expression estimates in natural units such as the mean binding constants and the occupancy of the probe spots. The method is single-chip based, i.e. it separately uses the intensities for each selected chip.

### Conclusion:

The hook-method corrects the raw intensities for the non-specific background hybridization in a sequence-specific manner, for the potential saturation of the probe-spots with bound transcripts and for the sequence-specific binding of specific transcripts. The obtained chip characteristics in combination with the sensitivity corrected probe-intensity values provide expression estimates scaled in natural units which are given by the binding constants of the particular hybridization.

## 1. Background

The basic mechanism underlying the functioning of DNA microarrays is that of hybridization. Hybridization is defined as the binding between complementary single-stranded nucleic acids. In the case of microarrays one strand is anchored at the surface and the second one is dissolved in solution, referred to as probe and target, respectively. The experimental technique of detecting hybridized probes relies on the fluorescence intensity measurement to infer the transcript abundance specific for a selected gene. The relationship between transcript abundance and intensity is affected by parasitic effects owing to the "technical" variability of repeated measurements and systematic biases which disturb the one-to-one relationship between the input and the output quantity of the measurement [1].

The task of making estimates of the input quantity (transcript concentration) of a measurement from observations of its output (intensity) is called calibration. Calibration of microarray measurements thus aims at removing consistent and systematic sources of variations to allow mutual comparison of measurements acquired from different probes, arrays and experimental settings. Calibration is also called preprocessing because it usually constitutes the first step in the microarray analysis pipeline. It potentially influences the results of all subsequent steps of "higher-level" analyses as well as the biological interpretation of these results, and is therefore a crucial step in the processing of microarray data. The improvement of microarray calibration methods is an essential prerequisite for obtaining absolute expression estimates which in turn are required for the quantitative analysis of, e.g., transcriptional regulation.

Most of the established preprocessing methods rely on algorithms of mainly empirical nature based on the simple assumption of a linear signal response on the transcript concentration in the sample [2–5]. In the last years numerous studies on the physical background of microarray hybridization are published with the perspective of developing improved analysis algorithms [6–12]. For example it has been shown that the probes saturate at higher transcript concentrations which gives rise to a non-linear relation between intensity and transcript concentration. Moreover, benchmark studies have indicated that the proper correction for non-specific background intensity contributions is presumably the most problematic preprocessing step with no satisfactory solution so far.

The immediate aim of most of these papers and also of our previous work [1, 13–18], has been to study the physical (and chemical) processes responsible for converting concentrations of specific target RNA of known sequences to measured fluorescence intensities after hybridization. However, the ultimate, still not-achieved aim of these physical approaches has been to provide scientists with feasible calibration methods which estimate absolute specific target concentrations in the presence of a complex non-specific background from fluorescence intensity data.

Proper calibration of microarray data includes several tasks: Firstly it requires the determination of the model describing the basic relationship between the probe intensity and the specific transcript concentration under consideration of relevant parasitic effects which should be straightened out.

Secondly, the magnitude of these effects should be estimated using the intensity information of a given chip or of a series of chips, and, thirdly, one needs practicable algorithms which judge the quality of a particular hybridization and estimate the expression degree from the intensity values.

Moreover, except MAS5 all popular preprocessing methods [2–5] rely on multichip-algorithms for calibration, i.e. they process a series of chips at once together to separate chip- and probe-level effects from each other. The obtained expression measures are consequently context-sensitive and require a minimum number of chips for appropriate data-processing (usually more than four). As a consequence the results are constricted to a particular series of chips, i.e. they depend on the particular selection of chips and require re-calculation upon adding or removing chips. The development of single chip calibration methods is therefore an important additional task to provide virtually context-insensitive expression measures which can be compared between chips and experimental series without reprocessing. This issue requires appropriate metrics for expression measures to enable direct comparison of data from different experiments in consistent units.

This paper addresses these tasks and presents a new single-chip calibration method for microarrays based on a physical model of hybridization. Our so-called hook-method provides a graphical summary of the hybridization characteristics of each microarray which directly transforms into a sort of natural metrics for intensity calibration with the potential to estimate expression values on an absolute scale. This metrics uses mismatched probes on Affymetrix GeneChip arrays as internal reference for judging the hybridization of the perfect matched probes over the whole potential concentration range.

Notations

Superscripts | Assigns main symbols to... |
---|---|

P = PM, MM | Probe type |

h = N, S | Non-specific or specific hybridization |

| Assigns main symbols to... |

p, c, set | Probe-, chip- or probe-set-level |

0 | Values after sequence correction |

| |

${\text{I}}_{\text{p}}^{\text{P}}$, M | Probe intensity, maximum intensity upon saturation and optical background |

${L}_{0,p}^{P}$, ${L}_{p}^{P,h}$ | Linearized (de-saturated) intensities |

S | Specific and non-specific signals |

${\Theta}_{\text{p}}^{\text{P}}$, ${\Theta}_{\text{p}}^{\text{P},\text{S}}$, ${\Theta}_{\text{p}}^{\text{P},\text{N}}$ | Probe occupancy: total and due to specific and non-specific hybridization |

x | Fraction of specific and of non-specific transcripts among the total amount of bound transcripts |

${\text{X}}_{\text{p}}^{\text{P}}$, ${\text{X}}_{\text{p}}^{\text{P},\text{S}}$, ${\text{X}}_{\text{p}}^{\text{P},\text{N}}$ | Binding strength: total and due to specific and non-specific hybridization |

[S] | Concentration of specific and non-specific transcript |

${\text{K}}_{\text{p}}^{\text{P},\text{S}}$, ${\text{K}}_{\text{p}}^{\text{P},\text{N}}$ | Binding constants of a probe for specific and non-specific transcripts |

s, n | PM/MM-sensitivity gain for specific and non-specific binding |

R | S/N-ratio |

Δ, Σ | Hook coordinates |

Δ | Starting values of the hook coordinates |

Δ(0), Σ(0) | |

Δ(∞), Σ(∞) | End values of the hook coordinates |

| Vertical and horizontal dimensions of the hook curve |

| Discrimination score |

${\epsilon}_{13}^{\text{P},\text{h}}$(B | Contribution of middle base B to the signal |

$\delta {\epsilon}_{\text{k}}^{\text{P},\text{h}}$(b | Positional dependent sensitivity propfile; k is the start position of sequence motiv b |

$\delta {\text{A}}_{\text{p}}^{\text{P}}$ | Sequence specific contribution to the intensity |

${\text{Y}}_{\text{p}}^{\text{P},\text{h}}$ | Sensitivity |

${p}_{p}^{N}$, ${p}_{c}^{S}$ | Density distributions of the non-specific and specific signals |

| Mean value, standard deviation and coefficient of PM/MM-correlation of the normal distributions of the non-specific background signals |

| decay lengths of the exponential distribution (referring to the S/N-ratio and to the Σ-signal) |

| Mean specific signal |

| |

g log(x) | Generalized logarithm |

<...> | Arithmetic mean |

## 2. Calibration model for microarray data

### The competitive two-species Langmuir model of microarray hybridization

Here the superscript denotes the probe-type (P = PM, MM). The indices "p" and "c" assign probe- and chip-specific parameters, respectively. The probe-index implies the chip specificity as well, i.e. p = p, c. This model predicts that the fraction of "occupied", i.e. dimerized oligonucleotides of a probe spot, Θ_{p}^{P} (also called surface probe coverage or occupancy), is directly related to the observed intensity, I_{p}^{P}* [14, 18]. The proportionality constant, M_{c}, specifies the maximum intensity referring to complete occupancy, Θ_{p}^{P} = 1, if all oligonucleotides of the respective probe spot on the given chip are dimerized. The minimum intensity referring to the absence of bound transcripts, Θ_{p}^{P} = 0, gives rise to the "optical" background intensity, O_{c}. Throughout the paper we will consider only "net" intensities which have been corrected for the optical background before further analysis, ${\text{I}}_{\text{p}}^{\text{P}}={\text{I}}_{\text{p}}^{\text{P}}\ast -{\text{O}}_{\text{c}}$, using, for example, the zone-algorithm provided by Affymetrix [23].

_{p}

^{P}, which additively decomposes into contributions due to specific and non-specific hybridization

_{p}and [N]

_{c}, respectively, and to the respective effective association constants of duplex formation, K

_{p}

^{P,h}(h = S, N) (see [1] for details),

The latter equation assumes that the large number of different non-specific RNA-fragments in the hybridization solution effectively acts like a single species with the common concentration [N]_{c} for all probes of the chip [15, 16]. Contrarily, the concentration of specific transcripts, [S]_{p}, refers to a particular probe sequence, i.e., it represents a "single probe"-property. Microarrays of the GeneChip-type use so-called probe sets of several probes (usually N_{set} = 11) for estimating the expression of each considered gene. One expects therefore that all probes of a set probe the same, common transcript concentration, i.e. [S]_{set} = [S]_{p} for p ∈ set assuming that effects as alternative splicing have been appropriately considered during probe design.

The competitive two-species Langmuir adsorption isotherm (Eq. (1)) considers the effects of non-specific "background" hybridization and of saturation at small and large concentrations of specific transcripts, respectively. The maximum intensity at saturation, M_{c}, depends on factors such as the number of oligonucleotides per probe spot (which in turn is related to the density of oligomers and to the spot size), the mean number of optical labels per bound target and the settings of the scanner. These factors affect the PM and MM nearly in the same fashion giving rise to virtually identical values of M_{c} at complete saturation of the probe spots under equilibrium conditions (X_{p}^{P} >> 1) [16, 18].

Recent studies report significantly higher limiting intensity values of the PM, compared with that of the MM, i.e. M^{PM} > M^{MM} [22]. They interpreted this result assuming a probe-dependent partial dissociation of the duplexes during the post-hybridization washing phase. Another, additional explanation might be the truncation of a considerable amount of the probe oligomers due to incomplete synthesis because this effect causes the asymptote-like flattening of the hybridization isotherms at intermediate and large transcript concentrations in a sequence-dependent manner [1, 9].

We will apply in the following analysis the special-case of the common intensity asymptote for all probes of the chip according to Eq. (1). Possible consequences of deviations from this assumption for the data analysis will be addressed in a separate study.

### Matched and mismatched microarray probes

_{p}> 1. Contrarily, for the ratio of the non-specific binding constants one gets n

_{p}< 1 for purines (Adenine, Guanine) and n

_{p}> 1 for pyrimides (Thymine, Cytosine) in the middle of the PM sequence owing to the purine-pyrimidine asymmetry of Watson-Crick (WC) base-pair interactions in RNA/DNA duplexes [13, 24]. Hence, the parameters s

_{p}and n

_{p}specify the PM/MM-affinity gain of a selected probe pair upon specific and non-specific binding, respectively. Both, PM and MM probes obey the hyperbolic adsorption isotherm, Eq. (1) [18]. With Eq. (4) one obtains for the binding strengths of the PM and MM probes

This S/N-ratio, R, provides the specific binding strength of the PM in units of the non-specific one. It can serve as a relative measure of the expression degree because it is directly related to the concentration of specific transcripts, [S]_{p}. It scales the expression degree in a probe-specific fashion.

_{p}(R = 0), to I

_{p}(R = ∞) = M

_{c}, at small and large abscissa values, respectively. The respective probes referring to these limiting cases are either exclusively non-specifically hybridized or completely saturated with surface coverages of Θ

_{p}

^{PM}(0) = X

_{p}

^{PM,N}/(1 + X

_{p}

^{PM,N}) ≈ X

_{p}

^{PM,N}and Θ

_{p}

^{PM}(∞) = 1, respectively. The concentration and S/N-ratio referring to the inflection point of the isotherm at half-way between these values are

respectively. They specify the condition at which 50% of the free probes available in the absence of specific transcripts become occupied. The approximations at the right-hand side of Eq. (7) refer to small X_{p}^{PM,N} << 1.

_{p}

^{MM}(0) = X

_{p}

^{PM,N}/(n

_{p}+ X

_{p}

^{PM,N}) ≈ X

_{p}

^{PM,N}/n

_{p}. The isotherm of the MM is clearly shifted to larger abscissa values in the intermediate R-range owing to the smaller binding strength for specific hybridization (s

_{p}> 1, see above). For the inflection point of the isotherm one obtains in analogy to Eq. (7)

which shows that the horizontal shift between the PM- and MM-isotherms is log(s_{p}) and log(s_{p}/n_{p}) in the log-scale of [S] and R, respectively.

### The delta- and sigma-transformations

The MM probes were designed as reference for estimating the non-specific background contribution to the respective PM intensity [2, 5, 25]. The "simple" subtraction of the MM-intensity from that of the PM however partly failed as correction method because both probes differently respond to non-specific and specific hybridization due to their complementary middle bases which, for example, gives rise to negative PM-MM intensity differences [15].

According to the Langmuir model, the behaviour of the PM and MM can be understood on the basis of the same hybridization rules where both probe types however differ with respect to their effective association constants for probe/target dimerization (see above). The intensities of the PM and MM are consequently expected to correlate in a well defined fashion. This mutual relation is determined by the mismatch design of the reference probe, the particular probe sequences and by the concentrations of specific and of non-specific RNA fragments in the sample solution used for hybridization on the particular chip [18].

_{10}is the decadic logarithm). The intensity model predicts for this transformation (see Eqs. (1) and (5))

_{p}

^{P,N}<< 1) the o-terms vanish and the limiting Δ- and Σ-coordinates are given by their start values. The probe-specific exponents in Eq. (10) are defined as

_{p}

^{start}≅ Δ

_{p}(0) and Σ

_{p}

^{start}≅ Σ

_{p}(0) and the exponents

*α*

_{p}and

*β*

_{p}. They were chosen to provide a simple geometrical interpretation of the Δ-vs-Σ trajectory in terms of its start-coordinates and its vertical and horizontal dimension with respect to the start values (see below and Figure 1 and Figure 2).

### The hybridization regimes

Part b and c of Figure 1 show the transformed intensities taken from part a of the figure as a function of the parameter R = R_{p}^{PM}. The course of the log-difference, Δ_{p}(R), can be roughly divided into five regimes which reflect different hybridization characteristics of the PM and the MM probes with increasing degree of specific hybridization (see Figure 1, part b):

(1) **N-regime**: In the non-specific-regime, at small values R → 0, both, the PM and MM nearly exclusively hybridize with non-specific transcripts. Saturation can be typically neglected in this range (B_{p}^{P} ≈ 1, see Eqs. (10) and (11)). The limiting ordinate value for X_{p}^{P,N} << 1 estimates the ratio of the binding constants referring to the respective pair of complementary middle bases in the PM and MM sequences, Δ_{p}(0) ≈ log n_{p} (see Eq. (4)). We will use the approximation of weak non-specific binding throughout the paper.

(2) **mix-regime**: In the subsequent mixed-regime, both, specific and non-specific transcripts significantly contribute to the observed intensity of the probes. The log-difference Δ increases with increasing amount of specific transcripts. The positive slope of Δ_{p}(R) implies ∂Δ/∂R ~ (1-10^{-α}) > 0, and thus *α*_{p} > 0 or equivalently s_{p} > n_{p} (see Eq. (12)). The increase of Δ_{p}(R) consequently reflects the simple fact that the specific binding constant of the PM exceeds that of the respective MM, i.e., K_{p}^{PM,S} > K_{p}^{MM,S}, if one assumes K_{p}^{PM,N} ≈ K_{p}^{MM,N} (see below and Eq. (4)).

**S-regime**: In the specific-regime the probes predominantly hybridize with specific transcripts. As a consequence, Δ

_{p}reaches a maximum at $\text{R}={\text{R}}_{\mathrm{max}}\approx {10}^{0.5\left({\alpha}_{\text{p}}+{\beta}_{\text{p}}\right)}$ with the ordinate value

This rough approximation assumes Δ_{p}(0) << 1 <*β*_{p} and R_{max} >> 1. At conditions of weak saturation Eq. (13) simplifies with 0.5 (*β*_{p} - *α*_{p}) >> 1 into ${\Delta}_{p}\left({R}_{\mathrm{max}}\right)\approx {\Delta}_{p}^{Linear}\left(\infty \right)={\Delta}_{p}(0)+{\alpha}_{p}>0$. At these conditions the height of the maximum directly provides the log-transformed PM/MM-ratio of the specific binding constants, *α*_{p}.

(4) **sat-regime**: In the saturation-regime the probes become progressively saturated with bound transcripts (B_{p}^{P} > 1). This effect first and foremost affects the PM due to their higher specific binding constant (see above). As a consequence Δ_{p} starts to decrease.

(5) **as-regime**: At very large expression degrees both, PM and MM reach their maximum intensity upon complete saturation. In this asymptotic-regime the trajectory reaches the abscissa for R → ∞, Δ_{p}(∞) ≈ 0 (see Eq. (10)).

The respective log-sum of the intensities, Σ_{p}(R), is shown in part c of Figure 1. It varies in a similar, sigmoidal fashion as the individual log-intensities of the PM and MM (compare part a and c of Figure 1). Here the mix-, S- and sat-regimes merge into one region of increasing Σ whereas the N- and as-regimes provide the minimum and maximum values, ${\Sigma}_{\text{p}}(0)\approx {\Sigma}_{\text{p}}^{\text{start}}$ and Σ_{p}(∞) = log M_{c}, respectively. With Eqs. (11) and (12) one obtains for the difference

*β*_{p} ≈ Σ_{p}(∞) - Σ_{p}(0) > 0

*β*

_{p}specifies the span between the maximum and minimum Σ-values. The Σ-coordinate of the maximum of Δ

_{p}(R) at R = R

_{max}becomes

Eqs. (15) and (14) provide ${\Sigma}_{\text{p}}\left({\text{R}}_{\mathrm{max}}\right)-{\Sigma}_{\text{p}}(0)\approx \frac{1}{2}\left({\Sigma}_{\text{p}}\left(\infty \right)-{\Sigma}_{\text{p}}(0)\right)>0$, i.e., the maximum of Δ_{
p
}(*R*) roughly bisects the total range of the Σ-coordinate.

### The Δ-vs-Σ trajectory

In the next step we plot the transformed intensities into Δ-vs-Σ coordinates (see Figure 2, part a). This presentation, also known as M-vs-A plot (difference-vs-sum), reflects the binding isotherms of a PM/MM-probe pair. The obtained Δ-vs-Σ trajectory shows a characteristic curved shape with start-, end- and maximum-points referring to the S/N-ratios R = 0, R = ∞ and R = R_{max}, respectively. They consequently define the N-, as- and S-hybridization regimes. The mix- and sat-regimes can be attributed to the increasing and decreasing parts of the Δ-vs-Σ trajectory, respectively.

The parameters *α*_{p} and *β*_{p} define the height and the width of the obtained Δ-vs-Σ curve (see also Eqs. (13) and (14)). The Δ- and Σ-coordinates of the characteristic points depend on the PM/MM-ratios of the binding constants (see Eq. (4)), on the maximum intensity, Σ_{p}(∞) = logM_{c}, and on the mean intensity of the chemical background due to non-specific hybridization, Σ_{p}(0) ∝ log(I_{p}^{PM}(0)) + log(I_{p}^{MM}(0)). Hence, the Δ-vs-Σ trajectory links the observed probe intensities with essential hybridization characteristics in terms of simple geometric parameters.

**The horizontal scale of the Δ-vs-Σ trajectory**

_{p}-Σ

_{p}(∞), estimates the mean probe coverage of the PM and MM probes

_{p}-Σ

_{p}(0) characterizes the relation between the amount of specific and non-specific hybridization in terms of the fraction of specifically occupied binding sites of the respective probe spot

Eqs. (16) and (17) provide mean values averaged over the respective PM/MM-probe pair. The "individual" coverages of the PM and MM probes, Θ_{p}^{PM} and Θ_{p}^{MM}, and the respective fraction of specifically hybridized oligomers, x_{p}^{PM,S} and x_{p}^{MM,S}, in addition depend on the relative Δ-coordinates Δ-Δ(∞) and Δ-Δ(0), respectively (see Eqs. (42) and (45) in the Appendix A).

Part b of Figure 2 shows the surface coverage and the fraction of specifically occupied oligomers for the Δ-vs-Σ trajectory plotted in part a of the figure. Note that x^{P, S} and Θ^{P} exponentially scale with the coordinate differences Σ-Σ(0) and Σ-Σ(∞), respectively (see Eqs. (17) and (16), respectively).

Consequently, the fraction of specifically occupied probes steeply increases in the raising part of the Δ-vs-Σ trajectory (mix-regime) whereas the probe coverage steeply increases in its decaying part (sat-regime). The contribution of non-specific hybridization and/or the effect of saturation of a particular probe can be essentially neglected if the distance of its Σ-coordinate from the start and/or end points exceeds unity. Particularly, one obtains <x_{p}^{S}> > 0.9 for Σ-Σ(0) > 1 and < Θ_{p}> < 0.1 for Σ(∞)-Σ < 1.

The horizontal shift between the PM-and MM-curves in part b of Figure 2 illustrates the "delayed response" of the MM with respect to the specific transcript concentration: The MM reach a certain ordinate-level of the surface coverage and of the fraction of specifically bound probes at larger abscissa values and thus at larger concentrations of specific transcript concentrations than the PM (see also Eqs. (7) and (8)).

For abscissa values Σ < Σ(∞) -1, Eq. (18) simplifies into log(<R> + 1) ≈ Σ-Σ(0). Hence, the Σ-axis nearly linearly scales with the logarithm of the mean S/N-ratio. For the S/N-ratio of the PM, this equation modifies into $\mathrm{log}({\text{R}}_{\text{p}}^{\text{PM}}+1)\approx \left(\Sigma -\Sigma (0)\right)+\frac{1}{2}\left(\Delta -\Delta (0)\right)$ (see Eq. (46) below), i.e., it depends in addition on the vertical coordinate of the Δ-vs-Σ trajectory.

_{p}

^{P}≈ X

_{p}

^{P,S}, see also Eq. (1)) and of their mean

In summary, the position of a probe-point along the Σ-coordinate estimates the hybridization degree of the respective probe spot in terms of relative concentration measures characterizing either the S/N-ratio (Eq. (18)), the relative occupancy of the probe oligomers with specific transcripts (Eq. (17)), their overall degree of occupancy of (Eq. (16)) and the specific binding strength of the considered probe pair (Eq. (19)).

*β*

_{p}(see Eq. (14)), and thus to the horizontal distance between the N- and the as-points. The remaining, not-occupied and thus free oligomers serve as potential binding sites for specific targets, i.e., $\u3008{\Theta}_{\text{p}}^{\text{free}}(\text{R}=0)\u3009=1-{10}^{-{\beta}_{\text{p}}}$. The horizontal dimension of the Δ-vs-Σ trajectory consequently specifies the maximum amount of free probes available for specific binding at R = 0 and thus the measurement range of the probe spots for estimating the expression degree. The narrowing of the model curves reflects the diminishing capacity of the respective probes for specific transcript binding. Figure 3 (part a) illustrates the narrowing of the Δ-vs-Σ trajectory upon increasing the non-specific background contribution. The special ideal case

*β*= -∞ consequently refers to hybridization without non-specific background.

### The vertical scale of the Δ-vs-Σ trajectory

The discrimination score roughly estimates the fraction of the signal due to specific hybridization (see [26]). The approximation on the right hand side of Eq. (20) seems save for small values DS << 1.

The discrimination score serves as the basic parameter in the MAS5-algorithm to calculate the so-called detection call (DC) which judges the "presence" or "absence" of a gene. Hence, the vertical scale of the Δ-vs-Σ trajectory is related to the detection call: the higher the Δ_{p}-value of a probe the higher the probability of the presence of the respective specific transcript in the hybridization solution. We will discuss this point more in detail in the accompanying paper in connection with our alternative method for classifying the genes into present and absent ones (see below).

Here Δ*ε*_{13}^{WC-SC} denotes the dimensionless free energy gain (given in units of the thermal energy, RT) upon replacements of the type B•b^{c} → B^{c}•b^{c} (i.e. WC → SC) for the base B_{p} = A, T, G, C at sequence position 13 of the probe (for example, C•g → G•g; upper case letters refer to the DNA-probe; lower case letters refer to the bound RNA-fragment, b = a, u, g, c; the superscript "c" indicates the respective complement). Accordingly, Δ*ε*_{13}^{WC-WC} is the respective free energy change upon WC-reversals, B•b^{c} → B^{c}•b (for example, C•g → G•c).

Hence, the ordinate position of the starting point of the Δ-vs-Σ trajectory estimates the effective free energy change upon replacing the central base in complementary WC-pairings, i.e. Δ_{p}(0) ≈ -Δ*ε*^{WC-WC}(B_{p}) (see Eqs. (11) and (21)). The relative ordinate value of the maximum is related to the respective free energy change upon replacing the central WC-pairing in the specific PM-duplexes by the respective SC-pairing in the MM-duplexes, i.e. Δ_{p}(R_{max}) ≈ -Δ*ε*^{WC-SC}.

Figure 3 illustrates that the maximum height of the Δ-vs-Σ trajectory starts to decrease for relatively small widths referring to large strengths of non-specific hybridization (*β*_{p} < 3) because saturation onsets almost in the mix-range. In such cases the observed vertical dimension of the trajectory potentially underestimates the height-parameter *α*_{p} (see Eq. (13)) which however can be obtained by appropriate curve fitting using Eq. (10) (see below).

In summary, the Δ-vs-Σ trajectory spans a sort of natural or intrinsic metric system between distinctive points which characterizes the binding thermodynamics of the probes of the particular microarray. The horizontal dimension characterizes the measurement range of the respective probe whereas the vertical dimension reflects the free energy gain due to the change of the central base pairing in the respective duplexes of the PM and MM probes.

### Δ-vs-Σ trajectories of individual probes

Each probe is characterized by its "individual" Δ-vs-Σ trajectory which describes the intensity change upon increasing content of S-transcripts in the range 0 ≤ R ≤ ∞. We used the Affymetrix HG-133 spiked-in data-set to study the R-dependence of selected probes http://www.affymetrix.com/support/technical/sample_data/datasets.affx. This data set was generated by Affymetrix to calibrate the observed intensities on the basis of known transcript concentrations. Particularly, transcripts referring to 42 selected genes were titrated with increased concentration onto a series of chips using the Latin-squares design. The non-specific background was taken into account by adding a HeLa-cell line extract to all hybridizations which does not contain the spiked-in transcripts.

_{p}

^{P,N}(and of logX

_{p}

^{PM,N}, see Eq. (3)) the decrease of the horizontal dimensions,

*β*

_{p}, of the respective probe-trajectory.

Indeed, the increase of the cytosine-content causes the narrowing of the trajectories by the shifting of their start-point, Σ_{p}(0), towards larger abscissa values at invariant Σ_{p}(∞) = const., which is assumed to be constant across all probes because of their common maximum binding capacity. Note that the width of the trajectories and thus the binding strength of the non-specific background varies over about two orders of magnitude, logX_{p}^{PM,N} ≈ -4 to -2, for the six selected probes.

The Δ-coordinates of the starting- and maximum-points of the selected probe-trajectories show considerable variation without obvious correlation to their sequence characteristics. We calculated the trajectories of all spiked-in probes (~500) using the results of our previous analysis of the hybridization isotherms (see refs. [18] and [16] for details) to estimate the variance of the positions of their starting- and maximum-points. The boxplot in part b of Figure 4 visualizes the center and the width of the distributions of the obtained Δ_{p}(0)- and Δ_{p}(R_{max})-data in vertical and horizontal directions.

The respective coordinates of the individual probe-trajectories depend mainly on the particular probe pairings of the middle bases in the non-specific and specific duplexes, respectively (see Eqs. (11) – (13) and (21)). To filter out the underlying sequence effects we calculated "mean" trajectories for all probe pairs with a certain middle base (see Figure 4, part b). These middle-base related mean trajectories are shifted each to another in vertical direction according to C ≈ T > G ≈ A for the N-, and C > G ≈ T > A for the S-point, respectively. This systematic trend is in agreement with Eq. (21) which predicts that the vertical positions of the N- and S-points are functions of the middle base of the respective probe sequences. The observed relations reflect the purine-pyrimidine asymmetry of binding strength of complementary WC-pairings at the N-point (i.e., Δ*ε*_{13}^{WC-WC}(B) ≠ Δ*ε*_{13}^{WC-WC}(B^{c})) and the higher stability of the WC-pairings compared with SC-mismatches at the S-point, Δ*ε*_{13}^{WC-SC}(B) (see Eq. (21)). Note that the specific binding constants of PM exceed that of the MM on the average by the factor of s ≈ 7 whereas for non-specific binding one obtains a mean ratio of n ≈ 1.2.

The comparison of the middle-base related trajectories with the width of the N- and S-boxes indicates that the systematic effect due to the middle-base explains the variability of the Δ_{p}(0)- and Δ_{p}(R_{max})-coordinates in the limits of their 25% and 75% quartiles. The consideration of the nearest neighbors of the middle base further broadens this range: For illustration we show the respective mean trajectories for the "middle triples" CCC and CGC which provide the strongest and weakest binding affinities among the 64 possible combinations of three adjacent bases, respectively (see [18] and [13]).

In summary, the transformed intensity data of individual probes are well described by the Δ-vs-Σ trajectories predicted by the Langmuir-isotherms. The presented data illustrate the probe-specific variability of the Δ-vs-Σ trajectories due to sequence effects. The positions of the start- and maximum-points can be attributed to the differences between the PM and MM probe-sequences which affect the respective binding constants in a middle-base dependent fashion.

## 3. The hook-algorithm for single-chip calibration

The Δ-vs-Σ trajectories describe the behaviour of PM/MM-probe intensities as a function of the RNA-concentration on a relative scale. The analysis of probe-specific trajectories seems not applicable for probe data which are taken from a single chip because each probe pair refers exactly to only one concentration and thus to only one point along its "individual" Δ-vs-Σ trajectory. On the other hand, the large number of probes per chip (and the presence of considerable amounts of specific targets) lets us suggest that their hybridization degrees usually cover the whole potentially possible concentration range. Our idea is to characterize the performance of a particular hybridization experiment in terms of its mean Δ-vs-Σ trajectory averaged over all probes of one particular microarray in analogy to the "individual" Δ-vs-Σ trajectory of each single probe. The horizontal and vertical dimensions of this mean Δ-vs-Σ trajectory are expected to provide the hybridization metrics of the considered chip in terms of characteristic concentration measures (e.g. the mean level of non-specific background or the mean occupancy and saturation level of the probe spots) and of characteristic effective free energy differences due to the used mismatch-design of the probe pairs.

### The raw hook curve

_{p}, versus the set-averaged log intensity of the respective probe set, <Σ>

_{set}, in a first step (see Eq. (9) and part a of Figure 5). Additional set-averaging of the log-difference, <Δ>

_{set}, reduces the scatter width of the data points along the ordinate roughly by the factor of ~ √N

_{set}~ 3 (see the yellow dots in Figure 5). In the next step we smoothed these data by calculating the moving average over a sliding window of N

_{mov}≈ 100 subsequent probe sets along the abscissa to extract the mutual dependence between <Δ>

_{set}and <Σ>

_{set}. The resulting plot is called (raw-) hook-curve because of its typical shape (see part b of Figure 5). Each probe-set is characterized by its "hook"-coordinates, Σ

^{hook}= <Σ>

_{p∈set}and Δ

^{hook}= <<Δ

_{p}>

_{p∈set}>

_{mov}.

The shape of the hook curve basically agrees with that of the Δ-vs-Σ trajectory (compare with Figure 2). We attribute the rising and decaying part and the maximum in-between to the mix-, sat- and S-regimes, respectively, which refer to the mean hybridization level of the respective probes on the chip. The hook-curve obviously does not reach the asymptotic as-regime with Δ(∞) ≈ 0. This result does not surprise because complete saturation is usually circumvented by reasonably chosen hybridization conditions. Otherwise the probes completely lose their sensitivity to detect changes of the transcript concentration [14].

At small abscissa values the hook curve starts with a virtually horizontal part (slope(N) < 0.1) which is separated from the subsequent mix-range (slope(mix) > 0.6) by a distinct break-point. In the appendix we show that the observed initial tiny slope can be explained by the relative strong correlation between the intensities of non-specifically hybridized PM- and MM-probes in the absence of specific transcripts (*ρ* > 0.7, see Eq. (49)) which in turn seems reasonable in view of the nearly equal sequences of both probe-types which suggest similar variations of the sequence-specific bindings strengths from probe pair to probe pair. On the other hand, for the mix-region the hybridization model predicts a considerably increased slope of slope(mix) ≈ 1.15·*α*_{p} ≈ 0.9 > slope(N) ≈ 0.8 (see Eqs. (47) and (49) with·*α* > 0.8 and *ρ* > 0.7, respectively). Hence, the break in the course of the hook-curve can be explained by the onset of specific hybridization for probes (and probe-sets) located on the right from this point.

Accordingly, the break-point between the N- and mix-ranges was used to classify the probe-sets into two sub-ensembles: (i) "absent"-ones for Σ^{hook} < Σ^{break}, which are assumed to hybridize predominantly non-specifically owing to the absence of specific transcripts in the hybridization solution (R = 0); and (ii) "present"-ones for Σ^{hook} > Σ^{break}, which significantly hybridize specifically owing to the presence of specific transcripts (R > 0).

The exact position of the break point of a particular hook curve was estimated by a simple algorithm based on the joint least-squared fit of the Δ^{hook}-data to a linear function for Σ^{hook} < Σ^{break}, and a quadratic function for Σ^{hook} > Σ^{break} (see Figure 5). The algorithm systematically varies the position of the break, Σ^{break} finally returning the optimum value of Σ^{break} which best fits the data.

### Sensitivity-corrected intensity-data and sensitivity profiles

_{set}

^{hook}, Δ

_{set}

^{hook})-coordinates of a probe in terms of its hybridization degree. In the next step we therefore correct the intensities for probe specific effect according to

where I^{P}_{0,p} denotes the corrected intensity of probe p. The sequence-specific incremental contribution, *δ* A_{p}^{P} (P = PM, MM), the so-called sensitivity of the probe, additively splits into two terms due to specific and non-specific hybridization, *δ* A_{p}^{P,N} and *δ* A_{p}^{P,S}, which are weighted by the fraction of non-specifically and specifically hybridized oligomers, ${x}_{p}^{P,N}=\mathrm{min}({L}_{p}^{P}(0)/{L}_{p}^{P},1)$ and x_{p}^{P,S} = (1 - x_{p}^{P,N}) with logL_{p}^{P}(0) ≈ logI_{p}^{P}(0) = <logI_{p}^{P}>_{p∈N} + *δ* A_{p}^{P,N} and L_{p}^{P} = I_{p}^{P}/(1-I_{p}^{P}/M_{c}), respectively. The brackets <...>_{p∈N} denote averaging over all probes from the N-range of the hook curve.

*δ*A

_{p}

^{P,S}and

*δ*A

_{p}

^{P,N}(P = PM, MM), respectively. They were estimated using either the so-called single-nucleotide (SN, m = 1), the nearest neighbour (NN, m = 2) or the next to nearest neighbour (NNN, m = 3) model. This approach additively decomposes the increments

*δ*A

_{p}

^{P,h}into positional and sequence-dependent sensitivity terms,

*δε*

_{k}

^{P,h}(b

_{m}) (m = 1,2,3) referring either to single nucleotides (b

_{1}= B), to adjacent duplets (b

_{2}= BB') or triplets (b

_{3}= BB'B"; B,B',B"' = A, T, G, C; see also [27]),

with *δ*_{k}(b_{m}, *ξ*^{P}_{k,m}) = 1 for b_{m} = *ξ*^{P}_{k,m} and *δ*_{k}(b_{m}, *ξ*^{P}_{k,m}) = 0, otherwise. Here *ξ*^{P}_{k,m} denotes a subsequence of m consecutive bases starting at position k of the respective probe sequence. Each set of sensitivity terms consequently comprises 25 × 4 = 100, 24 × 16 = 384 and 23 × 64 = 1472 parameter values for the SN, NN and NNN models, respectively.

Altogether, the intensity-correction requires four sets of such profiles, *δε*_{k}^{P,h}(b_{m}), referring to P = PM, MM and h = N, S, respectively. We used weighted multiple linear regression of the normalized intensity data taken from selected sub-ensembles of probe sets to estimate the required sensitivity-profiles (see Appendix C and [18]). These sub-ensembles were chosen to comprise probes which are predominantly hybridized with non-specific (p ∈ N) or specific (p ∈ S) transcripts for estimating *δε*_{k}^{P,N}(b_{m}) and *δε*_{k}^{P,S}(b_{m}), respectively. Probes of the former sub-ensemble are taken from the initial horizontal part of the hook curve which was assigned to the N-regime meeting the condition Σ_{p}^{hook} < Σ^{break} (see Figure 5). The respective absent-probes are assumed to bind predominantly non-specific transcripts.

The second sub-ensemble of predominantly specifically hybridized probes (p ∈ S) was selected according to the condition Σ_{set}^{hook} > Σ^{80%} where Σ^{80%} defines a threshold referring to <x^{S}> > 0.8, i.e. to probe spots with a fraction of specifically-hybridized oligomers of at minimum 80%. According to Eq. (17) the respective probes are selected to meet the "horizontal distances"-criterion with respect to the break-point (Σ^{80%} - Σ^{break}) > -log(0.2) ≈ 0.7. The somewhat arbitrary choice of <x^{S}> > 0.8 turned out to be not very crucial with respect to the quality of the obtained correction. On one hand the higher the value of <x^{S}> the smaller the residual contribution of non-specific hybridization in the selected sub-ensemble and the better the obtained sensitivity profiles characterize specific binding [16]. On the other hand, the number of probe sets in the sub-ensemble decreases with increasing <x^{S}> giving rise to more noisy sensitivity profiles and thus to less precise correction terms [16]. The chosen value of <x^{S}> >0.8 provides a good compromise (see Figure 5).

Importantly, for perfect selection of the N-subset one expects the same sensitivity profiles for the PM and MM probes, since the nonspecific background bears no particular relationship to any of both kinds of probes. The profiles shown in Figure 6 confirm this prediction. The degree of similarity of the N-profiles of the PM and MM probes thus provides a criterion for the proper selection of the N-subsets.

### The corrected hook curve

In the next step, the corrected intensity values were transformed into (Δ_{0,p}, Σ_{0,p})-coordinates using the corrected intensities in Eq. (9). Subsequently the (Δ_{0,p}, Σ_{0,p})-data were set-averaged and smoothed in the same fashion as described above for the non-corrected intensities to get the sensitivity-corrected version of the hook-curve with the coordinates (Σ_{0}^{hook}, Δ_{0}^{hook}).

_{0}

^{hook}, Δ

_{0}

^{hook})-coordinates of each probe set with possible consequences for its classification into absent and present ones. In a second iteration we therefore re-calculated the four sensitivity profiles on the basis of corrected hook-curve by applying the same criteria for probe selection as above, i.e., Σ

_{0}

^{hook}< Σ

_{0}

^{break}and Σ

_{0}

^{hook}> Σ

_{0}

^{80%}for the N- and S-profiles, respectively. Typically the re-calculated sensitivity profiles only marginally change compared with the profiles which were obtained on the basis of the uncorrected hook curve indicating the relative robustness of the chosen classification criteria.

The detailed comparison between both versions of the hook-curve reveals subtle differences especially of the initial N-regime: After correction it changes slope and, more importantly, considerably narrows in horizontal direction by roughly 50% (compare Figure 7 and Figure 5). The steeper slope of the N-region after correction reflects the reduced cross correlation between the corrected PM- and MM-probe level data (see appendix B). Typically, the coefficient of correlation decreases from values *ρ* ≈ 0.7–0.9 of the uncorrected data to *ρ* ≈ 0.4–0.7 after correction which gives rise to the change of the slope by the factor of 1.5 – 3 (see Eq. (49)).

The narrowing of the N-range results from the partial removal of sequence effects which, to a large degree, cause the variability of probe signals [14]. Ideally, the correction of the intensities for sequence specific effects is expected to shrink the N-region into one point referring to one and the same corrected background intensity for all probes of the chip (see Eqs. (11) and (3)). The residual width of the N-region reflects the deficiency of the method due to at least three, essentially undistinguishable effects: Firstly, the "fit"-error of the position-dependent sensitivity model which only incompletely corrects the intensity data for sequence effects caused, e.g. by the specific folding of probe or target and/or bulk dimerization between different targets [1]; secondly, the "N-concentration" error due the simple assumption to consider all non-specific transcripts as one effective species (Eq. (3)) and thirdly, the "classification" error of the simple break-criterion which imperfectly distinguishes between "present" and "absent" probes due to its limited specificity.

### Fit of the hybridization model

_{0}

^{hook}, Δ

_{0}

^{hook})-coordinates. It shows essentially the same properties as the Σ-vs-Δ trajectory of a single probe except the N-regime. For quantitative analysis we fit the hybridization model introduced above (Eq. (10)) to the corrected hook data in the mix-, S-, sat- and as-ranges (i.e. at Σ

_{0}

^{hook}> Σ

_{0}

^{break}, see Figure 8). The least-squares gradient descent algorithm searches for optimum values of

*α*

_{c},

*β*

_{c}and Σ

_{c}(0) ≈ Σ

_{0}

^{break}and Δ

_{c}(0) ≈ <Σ

_{0}

^{hook}>

_{p∈N}. Note that the break position Σ

_{0}

^{break}slightly deviates from the centre of the N-range <Σ

_{0}

^{hook}>

_{p∈N}because of the width of the N-regime (see Figure 7). We define the total width of the hook curve as ${\beta}_{c}={\Sigma}_{c}\left(\infty \right)-{\u3008{\Sigma}_{0}^{hook}\u3009}_{p\in N}$.

*α*

_{c}and

*β*

_{c}, characterize the vertical and horizontal dimensions of the corrected hook curve of chip "c" (see Figure 7 and Figure 8) which in turn are related to the mean binding characteristics averaged over all probes of the chip, i.e., $\mathrm{log}{\text{K}}_{\text{c}}^{\text{P},\text{h}}\approx {\u3008\mathrm{log}{\text{K}}_{\text{p}}^{\text{P},\text{h}}\u3009}_{\text{chip}}$ (P = PM, MM, h = N, S). The sensitivity-corrected S/N-ratio

scales the concentration ratio of specific and non-specific transcripts with the respective ratio of chip-averaged mean binding constants of the PM for specific and non-specific binding. Note that this scaling is common for all probes of the chip in contrast to the probe-specific scaling of R_{p}^{PM} in Eq. (6).

In summary, the hook-curve analysis of the intensity data thus provides chip characteristics such as the mean contribution of the non-specific background, the asymptotic intensity maximum and the mean PM/MM-sensitivity gain, and, in addition, the expression degree on probe-set basis in terms of the S/N-ratio.

### Signal distributions

_{0}

^{hook}) = ΔN/(N

_{tot}·ΔΣ

^{hook}·) (here ΔN/N

_{tot}is the fraction of probe-pairs found per abscissa interval ΔΣ

^{hook}). About 40% of the probes fall in the N-range in this example and thus are classified "absent". We separately plot the density distribution of these absent probes as a function of their corrected log-intensities (part b of Figure 9). The obtained PM- and MM-probe-level data are well described by normal distributions (P

_{p}

^{N}(x,

*σ*

_{c}) = N(

*μ*

_{c},

*σ*

_{c}) with $x\equiv {x}_{p}={\left(\mathrm{log}{I}_{0,p}^{P}-{\mu}_{c}^{P}\right)}_{p\in N}$) of with virtually the same width (

*σ*

^{PM}≈

*σ*

^{MM}) and slightly shifted centres, ${\mu}_{\text{c}}^{\text{PM}}>{\mu}_{\text{c}}^{\text{MM}}({\mu}_{c}^{P}={\u3008\mathrm{log}{I}_{0,p}^{P}\u3009}_{p\in N})$.

The total distribution decays exponentially to a good approximation at Σ_{0}^{hook} > Σ_{c}(0) in many cases (part a of Figure 9). Define the parameter $r\equiv \mathrm{log}{X}_{set}^{PM}-{\mu}^{PM}=\mathrm{log}(R+1)$, related to the S/N ratio R.

The exponential behaviour is approximately preserved with respect to this parameter in the mixed and S ranges since $r\approx \left({\Sigma}^{hook}-{\Sigma}_{c}(0)\right)+\frac{1}{2}\left({\Delta}^{hook}-{\Delta}_{c}(0)\right)$ from Eq. (18) (see part c of Figure 9). Note that r directly relates to the binding strength of specific hybridization in the relevant range of intermediate and large R-values, r ~ logR ~ log(X_{set}^{PM,S}). The initial value of the distribution at r = R = 0 provides the fraction of virtually absent probes, f^{absent}, whereas the area under the normalized distribution for r > 0 consequently gives the fraction of present probes, f^{present} = (1 - f^{absent}).

In the intermediate range the signal is approximated by a constant with f_{2}^{present} = 1 - (f^{absent} + f_{1}^{present}) (f^{present} = f_{1}^{present} + f_{2}^{present}) whereas for r = 0 one gets the fraction of absent probes. The decay constant *λ*_{r} defines the r-range which decreases the probability of the specific signal by one order of magnitude. It thus defines a characteristic S/N-ratio (in logarithmic scale) of the chip which characterizes log-ratio of the mean specific and non-specific signals, or in other words, to which extend the specific signal exceeds the non-specific one. The S(pecific)/N(onspecific)-ratio thus can be also interpreted as a sort of S(ignal)/N(oise)-ratio of a given probe set. The value of the r-related decay constant *λ*_{r} slightly exceeds that of the hook-related distribution owing to the larger range covered by the PMonly S/N-ratio (*λ*_{r} = 1.0 versus *λ* = *λ*_{Σ} = 0.75; compare part c and a of Figure 9).

An estimate of the mean decay constant can be obtained by simple averaging over the respective R-range

⟨*λ*⟩ ≈ (⟨log(*R* + 1)⟩_{R > R'}·In 10)

Note that the exponential decay in Eq. (28) is equivalent with the power law, 10^{-r/λ r}= (R+1)^{-1/λ r}, which has been shown to describe the probability distribution of expression values in a series of microarray experiments [28–31]. This power-law function is known as the Zipf's law, observed in many natural and social phenomena. The presence of such power-law function in principle prevents an intrinsic cut off point between "on" genes and "off" genes. The analysis of expression data in terms of their probability distribution therefore is of basic importance for judging the expression level in terms of global characteristics. The hook-transformation provides such characteristics in terms of the signal distribution along the Σ^{hook} and/or r coordinates for the mix-, S- and sat-hybridization ranges.

Let us now consider the S/N-ratio of a particular probe at two conditions: (i) with x = 0, referring to the non-specific binding strength in the centre of the normal background distribution, X_{p}^{PM,N} ≈ 10^{
μ
}/M_{c}:${R}_{p}={M}_{c}\cdot {X}_{p}^{PM,S}/{10}^{\mu}$; and (ii) with x ≠ 0, i.e. X_{p}^{PM,N}(x) ≈ 10^{μ+x}/M_{c} and $R(x)={M}_{c}\cdot {X}_{p}^{PM,S}/{10}^{x+\mu}$.

After transformation into logarithmic scale and combination of both situations we get log *R*(*x*) = log *R*_{
p
}- *x* and for R > 1 to a good approximation *r*(*x*) ≈ *r* - *x*.

*μ*+ x. Integration over the whole possible backgound range provides the probability for detecting the signal r for probe p,

Note that the upper integration range is effectively restricted to x ≤ r because of p^{S}(r-x,*λ*) = 0 for x > r (see Eq. (28)). Eqs. (30)–(31) define the convolution product of the background and specific-signal distributions. A similar approach was used in ref. [30] and partly also in the RMA-algorithm [29].

### De-saturation of the signal and convolution based expression estimates

_{c}:

*α*

_{c}(compare with Figure 8). This "de-saturated" signal additively decomposes into contributions due to non-specific and specific hybridization (Eq. (2)). We now calculate the expression value as the weighted "glog"-mean of the specific signal,

Particularly, Eq. (33) corrects the linearized signal (Eq. (32)) for the probe-specific background contribution using the mean background of the chip and the probe-specific sensitivity for non-specific hybridization, ${L}_{p}^{P,N}(x,{\mu}_{c}^{P})={10}^{\delta {A}_{p}^{P,N}+{\mu}_{c}^{P}+x}$ (Eq. (23)). The use of the generalized logarithm, $\text{g}\mathrm{log}(\text{x})\equiv \mathrm{log}\left(\frac{1}{2}(\text{x}+\sqrt{{\text{x}}^{2}+\text{c}})\right)$, for data transformation is advantageous in two respects: Firstly, it enables processing negative arguments which may appear due to data scattering and, secondly, it "stabilizes" the variance at small values of x provided that the transformation parameter c > 0 is properly chosen [32, 33].

In Eq. (33) the glog-transformed data are averaged over the probability distribution ${p}_{p}^{P}(x,r-x)$ representing the convolution product of the mutually independent probabilities of the non-specific Gaussian background and the specific expression signal (see previous section and [30]). This approach assumes that the probability distribution of each individual transcript is the same as the mean distribution averaged over all transcripts. Alternatively one can use the un-weighted kernel with p^{S} = const. In this case Eq. (33) restricts the averaging to the background distribution.

In the next step, the obtained values are corrected for the sequence effects according to ${Z}_{0,p}^{P}={Z}_{p}^{P}-\delta {A}_{p}^{P,S}$ using the positional-dependent sensitivity model and the Z_{p}^{P} values instead of the log-intensities for parameterization (Eq. (50)). Then, the probe-level data are summarized for each probe set by calculating the Tukey-biweight median, ${Z}_{0,set}^{P}=TB{({Z}_{0,p}^{P})}_{p\in set}$[25], and finally by transforming them into linear scale to get the probe set-level expression degree, ${L}_{set}^{P,S}={10}^{{Z}_{set}^{P}}$. Eq. (33) thus provides "PMonly" and "MMonly" estimates of the expression degree with P = PM and MM, respectively.

with $\Delta {L}_{0,p}=\left({L}_{0,p}^{PM}-{L}_{0,p}^{MM}\right)$ and $\Delta {L}_{p}^{N}(x,{\mu}_{c}^{P})=\left({L}_{p}^{PM,N}(x,{\mu}_{c}^{PM})-{L}_{p}^{MM,N}(x,{\mu}_{c}^{MM})\cdot {10}^{-x\cdot (1-\rho )}\right)$

Eq. (34) uses the background distribution of the PM and the conditional expectation of the bivariate normal distribution for the argument of the MM-background which gives rise to an effectively shrunken integration variable, ${\mu}_{c}^{MM}+x\to {\mu}_{c}^{MM}+{\rho}_{c}x$ (here *ρ*_{c} is the coefficient of correlation, see appendix B). Note that this approach is to some degree similar to the GC-RMA algorithm [34] which however uses pseudo-MM representing subsets of PM-probes of the same GC-content as the considered PM. A second difference to Eq. (34) is that GC-RMA directly subtracts the (non-linear) MM-intensity without explicit consideration of the non-specific background.

In this context we stress the fact that the non-specific background intensity is rather a variable contribution which progressively decreases with increasing amount of specific binding than a constant (see appendix A and the figure shown there). The background correction of the intensity (as realized, e.g. by GC-RMA) neglects this trend and therefore it is expected to over-correct the signal in the S- and as-hybridization ranges. As a consequence, this approach underestimates the expression degree due to two reasons: Firstly, this over-correction of the background and, secondly, the neglect of saturation. On the other hand, the linearized signal used here (Eq. (32)) corrects the intensity for saturation and, in addition, it contains a constant background level corrected by simple subtraction in Eqs. (33) and (34).

Note that the PM-estimate exceeds the MM signal roughly by the factor s_{c} ≈ 5 – 10 at comparable variance of the residual background distribution (see Figure 9, part b). For the coefficient of variation, CV ≈ *σ*/S, one gets roughly CV(PM) ≈ CV(PM-MM) << CV(MM). The MM consequently are expected to provide considerably less accurate expression estimates compared with the respective PMonly and PM-MM values.

The characteristic expression index (or exponent of specific binding) *φ*_{c} complements the respective exponents for non-specific hybridization *β*_{c} and the characteristic S/N-index *λ*_{r} as the basic chip summary-characteristics of specific and non-specific binding.

### Chip characteristics and expression estimates in natural units

_{set}= 1 consequently defines the condition of half-coverage to a good approximation (see Eq. (7)). Analogously, the horizontal dimensions of the hook-curve provide the non-specific binding strength as a measure of the non-specific background in units of the respective binding constant (see Eq. (25)),

It specifies the mean occupancy of the probes in the absence of specific transcripts, Θ^{PM}(0) ≈ X_{c}^{PM,N}.

Hence, the hook method measures both, the abundance of specific and non-specific transcripts in the hybridization solution in chip-related units such as the relative occupancy of the probe spots and the respective binding strengths.

The specific and non-specific signals are combined into the S/N-ratio, R (Eq. (26)), which provides the expression degree in units of the non-specific background contribution in the particular chip experiment. The S/N-ratio is directly related to the hook-coordinates of a selected probe set and thus it can be roughly deduced by visual inspection of the particular hook curve (Eqs. (46) and (18)). Figure 8 illustrates the relation between the intrinsic expression measures and the hook coordinates for a typical microarray hybridization. The point of half coverage (Θ^{PM} = 0.5, X^{PM, S} ≈ 1) is found beyond the maximum in the sat-regime. Virtually no probe of the chosen chip meets this condition. Note that Θ^{PM} scales with the distance relative to the end point referring to R → ∞ (see above). Vice versa, the mean fraction of specifically hybridized oligomers, x^{PM, S}, scales with the distance relative to the N-point. It steeply increases in the mix-regime and reaches the conditions at which 50% of the bound transcripts belong to the specific ones, x^{PM, S} = 0.5, at relatively small abscissa-values. Hence, specific hybridization starts to dominate over non-specific one always at the beginning of the mix-range. The S-range near the maximum of the hook-curve refers to probes with a 50 – 100 fold excess of specific hybridization, R^{PM} ≈ 50 – 100. The width of the hook of about *β* = 2.7 is equivalent with the background strength of N_{c} ≈ 10^{-2.7} = 0.002 which in turn rescales the S/N-ratio into binding strengths, for example S_{c} = 0.1–0.2 for R = 50–100. These rough estimation shows that the maximum of the hook is equivalent with the occupancy of the PM-spots in the order of 10 – 25%.

The mean binding constants, K_{c}^{PM,S} and K_{c}^{PM,N} and thus also the used measures of specific and non-specific hybridization depend on the particular probe and chip design (e.g., the length of the probe-oligonucleotides, their density and the type of the mismatches used for the MM probes). Consequently they are specific for the used chip type, on one hand. On the other hand, also the conditions of a particular hybridization affect the K_{c}^{PM,h} because their values depend on all processes affecting the binding reaction between the probe-oligonucleotides and the targets. For example, the composition of a particular sample will affect K_{c}^{PM,N} and K_{c}^{PM,S} as well, because both constants depend on the extent of target-dimerization which is a function of the concentrations of the reacting species in the hybridization solution [1].

## 4. Summary and Conclusion

The improvement of microarray calibration methods in combination with the development of meaningful quality standards is an essential prerequisite for obtaining absolute expression estimates which in turn are required for the quantitative analysis of transcriptional regulation. In this publication we present a new method of microarray data analysis based on a physical model. This so-called hook method pre-processes the raw intensity data for further downstream analyses on one hand, and, on the other hand, provides chip characteristics with potential applications in hybridization quality control and array normalization.

Hook method: algorithm and output characteristics

Step | Output (chip and probe-set characteristics) | Eq. |
---|---|---|

1) optical background correction using the Affymetrix zone-algorithm (see Ref. [26]). | Optical background (O, mean over all zones); the algorithm uses a scaled value with a scaling factor chosen between 0 and 1 | (1) |

2) Raw hook: Plot of the PM and MM probe intensity data into Δ-vs-Σ coordinates and smoothing over a sliding-window of ~100 probe sets. Classification into N- and S-probes using the breakpoint of the hook. | Raw hook curve | (9) |

3) Parameterization of the positional dependent sensitivity-model separately for the PM and MM in the N and S-ranges and correction of the intensities for probe-specific sensitivities. | Sensitivity profiles (optional SN, NN or NNN models) | (22), (23); App. C |

4) Corrected hook: re-iterate steps (2)–(3) with the corrected intensities to improve the sensitivity correction and the classification of the probes into absent and present ones | Corrected hook-curve Fraction of absent probes (%N), mean N-background level and width of the N-range | (27) |

5) Fit of the hook-equation to the mix-, S-and sat-ranges of the corrected hook curve and analysis of the probe-level hook coordinates | Maximum intensity (M | (10), (12), (24) – (26), (29); App. D, App. A |

6) Calculation of probe-set related expression estimates (alternatively PMonly or PM-MM) by the joint processing of the intensity data and selected chip characteristics which corrects for the non-specific background, sequence-specific sensitivity and saturation | Expression measures (L | (33) – (36) |

The obtained hook-curve can be interpreted as a special representation of the binding isotherm where the explicit dependence of the probe intensities on the (usually unknown) transcript concentrations is replaced by the (experimentally available) relation between the PM- and the MM-probe intensities. It enables clear differentiation between different, well-defined regimes and it provides a set of chip summary characteristics which evaluate the performance of a given hybridization in terms simple parameters such as the mean non-specific background intensity, its saturation value, the mean PM/MM-sensitivity gain and the fraction of absent probes. The hook curve spans a natural metrics system for the expression estimates which reflects essential hybridization characteristics in terms of its geometric dimensions, width, height and "start"-coordinates.

The obtained single chip characteristics in combination with the sensitivity corrected probe-intensity values provide expression estimates scaled in natural units given by the binding constants of the particular hybridization. This way the method corrects the raw intensities for the non-specific background hybridization in a sequence-specific manner, for the potential saturation of the probe-spots with bound transcripts and for the sequence-specific binding of specific transcripts.

In the accompanying publication we illustrate the performance and potential applications in terms of quality control and expression analysis using a series of selected chip-types, hybridization conditions and benchmark experiments [19].

The beta-version of the hook-program can be downloaded from http://www.izbi.de. The stand-alone JAVA program processes single-chips and chip-series in a batch-mode according to the scheme given in Figure 11 and Table 2. Chip and probe-set related characteristics such as expression degrees, hook-curves and sensitivity profiles are exported in html- as well as in tabular form and jpg-graphics.

## 5. Appendix

### A. Concentration scales of the Δ-vs-Σ-trajectory: Derivation of Eqs. (16) – (18)

^{ΣlogΘ}. Rearrangement provides the surface coverage of the PM and MM probes as a function of the relative Σ- and Δ-coordinates,

^{ P }= Θ

^{ P,N }+ Θ

^{ P,S }, with ${\Theta}^{P,h}=\frac{{X}^{P,h}}{1+{X}^{P}}$ (see Eq. (2)). Both contributions are functions of the composition of the hybridization solution expressed in terms of the S/N-ratio R (see Eq. (5)),

^{PM, N}for the PM (and at R < (n/(s·X

^{PM, N}) for the MM, see Figure 12). Specific transcripts of large binding strength progressively replace the bound non-specific fragments with further increasing values of R. As a consequence the non-specific coverage Θ

^{P, N}drops and finally disappears for R → ∞.

and Eq. (17) for the mean fraction, <x^{S}>. Here we make use of ${\Theta}^{\text{PM},\text{N}}(0)={10}^{-\beta -\frac{1}{2}\mathrm{log}\text{n}}$ and ${\Theta}^{\text{MM},\text{N}}(0)={10}^{-\beta +\frac{1}{2}\mathrm{log}\text{n}}$ (see Eqs. (42) and (14)).

and Eq. (18) for the respective average $\u3008\text{R}\u3009\equiv {10}^{0.5\left(\mathrm{log}{\text{R}}^{\text{PM}}+\mathrm{log}{\text{R}}^{\text{MM}}\right)}$.

### B. The slope of the hook-curve in the mix- and the N-ranges

^{hook}< Σ

^{break}, are predominantly hybridized with non-specific transcripts (N-range) whereas the probes from probe-sets with abscissa positions to the right from the break, Σ

^{break}< Σ

^{hook}, in addition, bind a certain amount of specific transcripts (mix-range). The slope of the hook curve in this mix-range reflects the change of the hook coordinates with increasing concentration of specific transcripts. From Eq. (10) one gets after differentiation with respect to the S/N-ratio R in the limit of R → 0,

The initial slope of the mix range essentially depends on the "height"-parameter *α*_{c} and thus on the PM/MM-ratio of the specific binding constants (see Eqs. (13) and (4)).

Here *σ*_{PM}^{2}, *σ*_{MM}^{2} and *σ*_{PM,MM} denote the variance and covariance of the set-averaged log-intensities of the PM and MM probes in the N-range. The formula for the correlation coefficient *ρ* assumes $<{\sigma}^{2}>\approx {\sigma}_{\text{PM}}^{2}\approx {\sigma}_{\text{MM}}^{2}$ to a good approximation (see Figure 9).

Hence, a tiny slope near zero indicates strongly correlated PM and MM signals (*ρ*~1) whereas the increase of slope(N) reflects "decoupling" of PM and MM intensities.

### C. Fit of the position-dependent sensitivity profiles

*δε*

_{k}

^{P,h}(b

_{m}) in Eq. (23) were estimated using the so-called probe sensitivity [14],

Here f_{k}(b_{m}) is the probability to find the subsequence b_{m} at sequence position k within the probes of a probe set. The weighted least squares algorithm fits Y(theory) to Y(exp) by optimizing the coefficients *δε*_{k}^{P,h}(b_{m}) using single value decomposition [35].

*δε*

_{k}

^{P,h}(BB') (k = 1...24) and the NNN model 64 profiles

*δε*

_{k}

^{P,h}(BB'B") (k = 1...23) per sub-ensemble (N or S) and probe type (PM or MM). To extract the basic trends for a qualitative discussion we reduce the number of data simply by transforming the NN- and NNN-profiles into single-base (SN)-profiles by appropriate averaging

### D. Fit of the Langmuir model to the Δ-vs-Σ data

Equation (53) thus returns the S/N-ratio R for the Σ_{0}^{hook}-coordinate of a probe set and a given set of model-parameters {*α*_{c}, *β*_{c}, Σ_{c}(0), Δ_{c}(0)}. The parametric equation for the hook-ordinate (Eq. (10)) then provides an estimate of Δ_{0}(R) referring to the respective probe set. The fit-algorithm minimizes the sum of least squares between the calculated and measured Δ-values, $\text{SSQ}={\displaystyle \sum _{\text{i}}{\text{w}}_{\text{i}}\cdot {\left({\Delta}_{\text{i},0}({\text{R}}_{\text{i}})-{\Delta}_{\text{i},0}^{\text{hook}}\right)}^{2}}$, by adjusting the parameters *α*_{c}, *β*_{c} and Σ_{c}(0) using an iterative gradient method with linearly decreasing step size. The ordinate value of the start-point was set to the break between the N- and mix-range, ${\Delta}_{\text{c}}(0)={\Delta}_{0}^{\text{break}}$ (see above).

We divide the Σ_{0}^{hook}-axis into i = 10 – 30 equally-spaced sampling points Σ_{0}^{hook} = Σ_{i} by averaging over the probe/probe set data within the respective sampling interval of width *δ* Σ, i.e., ${\Delta}_{\text{i},0}^{\text{hook}}=\u3008{\Delta}_{\text{p},0}^{\text{hook}}\u3009$ for all probes with ${\Sigma}_{\text{i}}-\delta \Sigma \le {\Sigma}_{\text{p},0}^{\text{hook}}<{\Sigma}_{\text{i}}+\delta \Sigma $ which were used in Eq. (53) to calculate the respective S/N-ratios R_{i} and theoretical ordinate points Δ_{i,0}(R_{i}) using Eq. (10). The weighting factor, ${\text{w}}_{\text{i}}\approx \frac{\rho ({\Sigma}_{\text{i}})}{{\sigma}_{\Delta}{({\Sigma}_{\text{i}})}^{2}}$, considers the data-density, *ρ*(Σ_{i}), and variance, *σ*_{Δ}(Σ_{i})^{2}, per interval. Both, *ρ*(Σ_{i}) and *σ*_{Δ}(Σ_{i})^{2}, are given by decaying functions ([30, 32]) which partly compensate each other to a good approximation. By default the weighting factor is therefore set to w_{i} = 1.

## Declarations

### Acknowledgements

HB thanks Conrad Burden from Australian National University in Canberra for useful discussions and advice. The work was supported by the Deutsche Forschungsgemeinschaft under grant no. BIZ 6/4. SP thanks the International Max Planck Research School for Molecular Cell Biology and Bioengineering (IMPRS-MCBB) Dresden for funding.

## Authors’ Affiliations

## References

- Binder H: Thermodynamics of competitive surface adsorption on DNA microarrays – theoretical aspects. J Phys Cond Mat. 2006, 18: S491-S523.View ArticleGoogle Scholar
- Affymetrix: Affymetrix Microarray Suite 5.0. User Guide. 2001, Santa Clara, CA: Affymetrix, IncGoogle Scholar
- Wu Z, Irizarry RA: Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Microarrays. RECOMB'04: 2004. 2004, SanDiego, CaliforniaGoogle Scholar
- Wu Z, Irizarry RA: A Statistical Framework for the Analysis of Microarray Probe-Level Data. John Hopkins University, Dept of Biostatistics Working Paper. 2005, 73: 1-31.Google Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001, 98 (1): 31-36.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol. 2003, 21: 818-828.PubMedView ArticleGoogle Scholar
- Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003, 68: 11906-View ArticleGoogle Scholar
- Held GA, Grinstein G, Tu Y: Modeling of DNA microarray data by using physical properties of hybridization. Proc Natl Acad Sci USA. 2003, 100 (13): 7575-7580.PubMedPubMed CentralView ArticleGoogle Scholar
- Heim T, Tranchevent L-C, Carlon E, Barkema ET: Physical-Chemistry-Based Analysis of Affymetrix Microarray Data. J Phys Chem B. 2006, 110: 22786-22795.PubMedView ArticleGoogle Scholar
- Carlon E, Heim T: Thermodynamics of RNA/DNA hybridization in high-density oligonucleotide microarrays. Physica A. 2006, 362: 433-449.View ArticleGoogle Scholar
- Held GA, Grinstein G, Tu Y: Relationship between gene expression and observed intensities in DNA microarrays – a modeling study. Nucl Acids Res. 2006, 34: e70-PubMedPubMed CentralView ArticleGoogle Scholar
- Burden CJ, Pittelkow YE, Wilson SR: Statistical analysis of adsorption models for oligonucleotide microarrays. Stat Appl Genet Mol Biol. 2004, 3: Article35-PubMedGoogle Scholar
- Binder H, Kirsten T, Hofacker I, Stadler P, Loeffler M: Interactions in oligonucleotide duplexes upon hybridisation of microarrays. J Phys Chem B. 2004, 108 (46): 18015-18025.View ArticleGoogle Scholar
- Binder H, Kirsten T, Loeffler M, Stadler P: The sensitivity of microarray oligonucleotide probes – variability and the effect of base composition. J Phys Chem B. 2004, 108 (46): 18003-18014.View ArticleGoogle Scholar
- Binder H, Preibisch S: Specific and non-specific hybridization of oligonucleotide probes on microarrays. Biophys J. 2005, 89: 337-352.PubMedPubMed CentralView ArticleGoogle Scholar
- Binder H, Preibisch S, Kirsten T: Base pair interactions and hybridization isotherms of matched and mismatched oligonucleotide probes on microarrays. Langmuir. 2005, 21: 9287-9302.PubMedView ArticleGoogle Scholar
- Binder H: Probing gene expression – sequence specific hybridization on microarrays. Bioinformatics of Gene Regulation II. Edited by: Kolchanov N, Hofestaedt R. 2006, 451-466. Springer Sciences and Business MediaGoogle Scholar
- Binder H, Preibisch S: GeneChip microarrays – signal intensities, RNA concentrations and probe sequences. J Phys Cond Mat. 2006, 18: S537-S566.View ArticleGoogle Scholar
- Binder H, Krohn K, Preibisch S: "Hook" calibration of GeneChip-microarrays: chip characteristics and expression measures. Algorithms for Molecular Biology. 2008, 3: 11-PubMedPubMed CentralView ArticleGoogle Scholar
- Halperin A, Buhot A, Zhulina EB: Sensitivity, Specificity, and the Hybridization Isotherms of DNA Chips. Biophys J. 2004, 86 (2): 718-730.PubMedPubMed CentralView ArticleGoogle Scholar
- Hekstra D, Taussig AR, Magnasco M, Naef F: Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucl Acids Res. 2003, 31 (7): 1962-1968.PubMedPubMed CentralView ArticleGoogle Scholar
- Burden CJ, Pittelkow YE, Wilson SR: Adsorption models of hybridization and post-hybridization behaviour on oligonucleotide microarrays. J Phys Cond Mat. 2006, 18: 5545-5565.View ArticleGoogle Scholar
- Affymetrix : Affymetrix Microarray Suite 5.0. User Guide. 2001, Santa Clara, CA: Affymetrix, IncGoogle Scholar
- Sugimoto N, Nakano S, Katoh M, Matsumura A, Nakamuta H, Ohmichi T, Yoneyama M, Sasaki M: Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry. 1995, 34 (35): 11211-11216.PubMedView ArticleGoogle Scholar
- Affymetrix : New Statistical Algorithms for Monitoring Gene Expression on GeneChip
^{®}Probe Arrays. Technical Note. 2001Google Scholar - Affymetrix: Statistical Algorithms Description Document. Technical Note. 2002, 28-Google Scholar
- Binder H, Kirsten T, Loeffler M, Stadler P: Sequence specific sensitivity of oligonucleotide probes. Proceedings of the German Bioinformatics Conference. 2003, 2: 145-147.Google Scholar
- Hoyle DC, Rattray M, Jupp R, Brass A: Making sense of microarray data distributions. Bioinformatics. 2002, 18 (4): 576-584.PubMedView ArticleGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucl Acids Res. 2003, 31 (4): e15-PubMedPubMed CentralView ArticleGoogle Scholar
- Havilio M: Signal deconvolution based expression-detection and background adjustment for microarray data. J Comput Biol. 2006, 13 (1): 63-80.PubMedView ArticleGoogle Scholar
- Lu T, Costello C, Croucher P, Hasler R, Deuschl G, Schreiber S: Can Zipf's law be adapted to normalize microarrays?. BMC Bioinformatics. 2005, 6 (1): 37-PubMedPubMed CentralView ArticleGoogle Scholar
- Durbin BP, Hardin JS, Hawkins DM, Rocke DM: A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics. 2002, 18 Suppl 1: S105-S110.PubMedView ArticleGoogle Scholar
- Huber W, von Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics. 2002, 18 Suppl 1: S96-S104.PubMedView ArticleGoogle Scholar
- Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F: A Model Based Background Adjustment for Oligonucleotide Expression Arrays. John Hopkins University, Dept of Biostatistics Working Paper. 2004, 1:Google Scholar
- Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical Recipes. 1989, New York: Cambridge University PressGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.