The Δ-vs-Σ trajectories describe the behaviour of PM/MM-probe intensities as a function of the RNA-concentration on a relative scale. The analysis of probe-specific trajectories seems not applicable for probe data which are taken from a single chip because each probe pair refers exactly to only one concentration and thus to only one point along its "individual" Δ-vs-Σ trajectory. On the other hand, the large number of probes per chip (and the presence of considerable amounts of specific targets) lets us suggest that their hybridization degrees usually cover the whole potentially possible concentration range. Our idea is to characterize the performance of a particular hybridization experiment in terms of its mean Δ-vs-Σ trajectory averaged over all probes of one particular microarray in analogy to the "individual" Δ-vs-Σ trajectory of each single probe. The horizontal and vertical dimensions of this mean Δ-vs-Σ trajectory are expected to provide the hybridization metrics of the considered chip in terms of characteristic concentration measures (e.g. the mean level of non-specific background or the mean occupancy and saturation level of the probe spots) and of characteristic effective free energy differences due to the used mismatch-design of the probe pairs.
The raw hook curve
To construct the mean Δ-vs-Σ trajectory we plot the PM-MM log-intensity difference of all probe pairs of a particular chip, Δp, versus the set-averaged log intensity of the respective probe set, <Σ>set, in a first step (see Eq. (9) and part a of Figure 5). Additional set-averaging of the log-difference, <Δ>set, reduces the scatter width of the data points along the ordinate roughly by the factor of ~ √Nset ~ 3 (see the yellow dots in Figure 5). In the next step we smoothed these data by calculating the moving average over a sliding window of Nmov ≈ 100 subsequent probe sets along the abscissa to extract the mutual dependence between <Δ>set and <Σ>set. The resulting plot is called (raw-) hook-curve because of its typical shape (see part b of Figure 5). Each probe-set is characterized by its "hook"-coordinates, Σhook = <Σ>p∈set and Δhook = <<Δp>p∈set>mov.
The shape of the hook curve basically agrees with that of the Δ-vs-Σ trajectory (compare with Figure 2). We attribute the rising and decaying part and the maximum in-between to the mix-, sat- and S-regimes, respectively, which refer to the mean hybridization level of the respective probes on the chip. The hook-curve obviously does not reach the asymptotic as-regime with Δ(∞) ≈ 0. This result does not surprise because complete saturation is usually circumvented by reasonably chosen hybridization conditions. Otherwise the probes completely lose their sensitivity to detect changes of the transcript concentration [14].
At small abscissa values the hook curve starts with a virtually horizontal part (slope(N) < 0.1) which is separated from the subsequent mix-range (slope(mix) > 0.6) by a distinct break-point. In the appendix we show that the observed initial tiny slope can be explained by the relative strong correlation between the intensities of non-specifically hybridized PM- and MM-probes in the absence of specific transcripts (ρ > 0.7, see Eq. (49)) which in turn seems reasonable in view of the nearly equal sequences of both probe-types which suggest similar variations of the sequence-specific bindings strengths from probe pair to probe pair. On the other hand, for the mix-region the hybridization model predicts a considerably increased slope of slope(mix) ≈ 1.15·αp ≈ 0.9 > slope(N) ≈ 0.8 (see Eqs. (47) and (49) with·α > 0.8 and ρ > 0.7, respectively). Hence, the break in the course of the hook-curve can be explained by the onset of specific hybridization for probes (and probe-sets) located on the right from this point.
Accordingly, the break-point between the N- and mix-ranges was used to classify the probe-sets into two sub-ensembles: (i) "absent"-ones for Σhook < Σbreak, which are assumed to hybridize predominantly non-specifically owing to the absence of specific transcripts in the hybridization solution (R = 0); and (ii) "present"-ones for Σhook > Σbreak, which significantly hybridize specifically owing to the presence of specific transcripts (R > 0).
The exact position of the break point of a particular hook curve was estimated by a simple algorithm based on the joint least-squared fit of the Δhook-data to a linear function for Σhook < Σbreak, and a quadratic function for Σhook > Σbreak (see Figure 5). The algorithm systematically varies the position of the break, Σbreak finally returning the optimum value of Σbreak which best fits the data.
Sensitivity-corrected intensity-data and sensitivity profiles
The intensity of a probe and thus also its (Δ, Σ)-coordinates depend on the concentration of RNA-transcripts and on the binding constants for specific and non-specific hybridization as well (see Eq. (5)). These constants are functions of the respective probe sequences giving rise to the scattering of the individual probe intensities over one-to-three orders of magnitude [14]. This variability is not related to changes of the transcript concentration, and thus it introduces considerable uncertainty if one aims at interpreting the (Σsethook, Δsethook)-coordinates of a probe in terms of its hybridization degree. In the next step we therefore correct the intensities for probe specific effect according to
(22)
where IP0,p denotes the corrected intensity of probe p. The sequence-specific incremental contribution, δ ApP (P = PM, MM), the so-called sensitivity of the probe, additively splits into two terms due to specific and non-specific hybridization, δ ApP,N and δ ApP,S, which are weighted by the fraction of non-specifically and specifically hybridized oligomers, and xpP,S = (1 - xpP,N) with logLpP(0) ≈ logIpP(0) = <logIpP>p∈N + δ ApP,N and LpP = IpP/(1-IpP/Mc), respectively. The brackets <...>p∈N denote averaging over all probes from the N-range of the hook curve.
The intensity-correction according to Eq. (22) requires the knowledge of two sensitivity-values for each PM and each MM probe, δ ApP,S and δ ApP,N (P = PM, MM), respectively. They were estimated using either the so-called single-nucleotide (SN, m = 1), the nearest neighbour (NN, m = 2) or the next to nearest neighbour (NNN, m = 3) model. This approach additively decomposes the increments δ ApP,h into positional and sequence-dependent sensitivity terms, δεkP,h(bm) (m = 1,2,3) referring either to single nucleotides (b1 = B), to adjacent duplets (b2 = BB') or triplets (b3 = BB'B"; B,B',B"' = A, T, G, C; see also [27]),
(23)
with δk(bm, ξPk,m) = 1 for bm = ξPk,m and δk(bm, ξPk,m) = 0, otherwise. Here ξPk,m denotes a subsequence of m consecutive bases starting at position k of the respective probe sequence. Each set of sensitivity terms consequently comprises 25 × 4 = 100, 24 × 16 = 384 and 23 × 64 = 1472 parameter values for the SN, NN and NNN models, respectively.
Altogether, the intensity-correction requires four sets of such profiles, δεkP,h(bm), referring to P = PM, MM and h = N, S, respectively. We used weighted multiple linear regression of the normalized intensity data taken from selected sub-ensembles of probe sets to estimate the required sensitivity-profiles (see Appendix C and [18]). These sub-ensembles were chosen to comprise probes which are predominantly hybridized with non-specific (p ∈ N) or specific (p ∈ S) transcripts for estimating δεkP,N(bm) and δεkP,S(bm), respectively. Probes of the former sub-ensemble are taken from the initial horizontal part of the hook curve which was assigned to the N-regime meeting the condition Σphook < Σbreak (see Figure 5). The respective absent-probes are assumed to bind predominantly non-specific transcripts.
The second sub-ensemble of predominantly specifically hybridized probes (p ∈ S) was selected according to the condition Σsethook > Σ80% where Σ80% defines a threshold referring to <xS> > 0.8, i.e. to probe spots with a fraction of specifically-hybridized oligomers of at minimum 80%. According to Eq. (17) the respective probes are selected to meet the "horizontal distances"-criterion with respect to the break-point (Σ80% - Σbreak) > -log(0.2) ≈ 0.7. The somewhat arbitrary choice of <xS> > 0.8 turned out to be not very crucial with respect to the quality of the obtained correction. On one hand the higher the value of <xS> the smaller the residual contribution of non-specific hybridization in the selected sub-ensemble and the better the obtained sensitivity profiles characterize specific binding [16]. On the other hand, the number of probe sets in the sub-ensemble decreases with increasing <xS> giving rise to more noisy sensitivity profiles and thus to less precise correction terms [16]. The chosen value of <xS> >0.8 provides a good compromise (see Figure 5).
Typical SN- and NN-sensitivity profiles of the PM and MM referring to the N- and S-subsets are shown in Figure 6. Note their different shape, especially that of the S-profiles of the MM with the typical "dent" in the middle of the sequence. It is caused by the mismatched base pairing in the specific duplexes of the MM which give rise to changed interaction characteristics compared with WC-base pairings. We refer to our previous papers for the detailed discussion of the sensitivity profiles in terms of base-pair interactions in the respective probe-target duplexes [15, 16, 18]. In the present context it is important to discriminate between the four different sensitivity profiles to properly correct the intensity data for sequence specific effects.
Importantly, for perfect selection of the N-subset one expects the same sensitivity profiles for the PM and MM probes, since the nonspecific background bears no particular relationship to any of both kinds of probes. The profiles shown in Figure 6 confirm this prediction. The degree of similarity of the N-profiles of the PM and MM probes thus provides a criterion for the proper selection of the N-subsets.
The corrected hook curve
In the next step, the corrected intensity values were transformed into (Δ0,p, Σ0,p)-coordinates using the corrected intensities in Eq. (9). Subsequently the (Δ0,p, Σ0,p)-data were set-averaged and smoothed in the same fashion as described above for the non-corrected intensities to get the sensitivity-corrected version of the hook-curve with the coordinates (Σ0hook, Δ0hook).
The sensitivity-correction reduces the scattering of the probe- and probe-set-level intensity data about the smoothed hook-curve (see Figure 7). With the N-, mix-, S- and sat-hybridization regimes it essentially shows the same features as the uncorrected hook curve. The break between the N- and mix-regimes was again used to classify the probe sets into absent and present ones. The sensitivity correction affects the (Σ0hook, Δ0hook)-coordinates of each probe set with possible consequences for its classification into absent and present ones. In a second iteration we therefore re-calculated the four sensitivity profiles on the basis of corrected hook-curve by applying the same criteria for probe selection as above, i.e., Σ0hook < Σ0break and Σ0hook > Σ080% for the N- and S-profiles, respectively. Typically the re-calculated sensitivity profiles only marginally change compared with the profiles which were obtained on the basis of the uncorrected hook curve indicating the relative robustness of the chosen classification criteria.
The detailed comparison between both versions of the hook-curve reveals subtle differences especially of the initial N-regime: After correction it changes slope and, more importantly, considerably narrows in horizontal direction by roughly 50% (compare Figure 7 and Figure 5). The steeper slope of the N-region after correction reflects the reduced cross correlation between the corrected PM- and MM-probe level data (see appendix B). Typically, the coefficient of correlation decreases from values ρ ≈ 0.7–0.9 of the uncorrected data to ρ ≈ 0.4–0.7 after correction which gives rise to the change of the slope by the factor of 1.5 – 3 (see Eq. (49)).
The narrowing of the N-range results from the partial removal of sequence effects which, to a large degree, cause the variability of probe signals [14]. Ideally, the correction of the intensities for sequence specific effects is expected to shrink the N-region into one point referring to one and the same corrected background intensity for all probes of the chip (see Eqs. (11) and (3)). The residual width of the N-region reflects the deficiency of the method due to at least three, essentially undistinguishable effects: Firstly, the "fit"-error of the position-dependent sensitivity model which only incompletely corrects the intensity data for sequence effects caused, e.g. by the specific folding of probe or target and/or bulk dimerization between different targets [1]; secondly, the "N-concentration" error due the simple assumption to consider all non-specific transcripts as one effective species (Eq. (3)) and thirdly, the "classification" error of the simple break-criterion which imperfectly distinguishes between "present" and "absent" probes due to its limited specificity.
Fit of the hybridization model
The corrected hook curve manifests the mean hybridization characteristics of the probes on the particular chip in sensitivity-corrected (Σ0hook, Δ0hook)-coordinates. It shows essentially the same properties as the Σ-vs-Δ trajectory of a single probe except the N-regime. For quantitative analysis we fit the hybridization model introduced above (Eq. (10)) to the corrected hook data in the mix-, S-, sat- and as-ranges (i.e. at Σ0hook > Σ0break, see Figure 8). The least-squares gradient descent algorithm searches for optimum values of αc, βc and Σc(0) ≈ Σ0break and Δc(0) ≈ <Σ0hook>p∈N. Note that the break position Σ0break slightly deviates from the centre of the N-range <Σ0hook>p∈N because of the width of the N-regime (see Figure 7). We define the total width of the hook curve as .
Note that the obtained model parameters are now chip-averaged mean values in contrast to the single probe properties used above. We therefore substitute the probe-index "p" by the chip-index "c". Particularly, one gets the mean PM/MM-ratios of the binding constants for specific and non-specific hybridization
(24)
respectively, and the mean binding strength for non-specific binding of the particular chip,
(25)
in analogy with Eqs. (4) and (11) – (12). Here the "height" and "width" parameters, αc and βc, characterize the vertical and horizontal dimensions of the corrected hook curve of chip "c" (see Figure 7 and Figure 8) which in turn are related to the mean binding characteristics averaged over all probes of the chip, i.e., (P = PM, MM, h = N, S). The sensitivity-corrected S/N-ratio
(26)
scales the concentration ratio of specific and non-specific transcripts with the respective ratio of chip-averaged mean binding constants of the PM for specific and non-specific binding. Note that this scaling is common for all probes of the chip in contrast to the probe-specific scaling of RpPM in Eq. (6).
In summary, the hook-curve analysis of the intensity data thus provides chip characteristics such as the mean contribution of the non-specific background, the asymptotic intensity maximum and the mean PM/MM-sensitivity gain, and, in addition, the expression degree on probe-set basis in terms of the S/N-ratio.
Signal distributions
Part a of Figure 9 shows the probability-density distribution to find a probe pair a at a given position of the hook-abscissa, p(Σ0hook) = ΔN/(Ntot·ΔΣhook·) (here ΔN/Ntot is the fraction of probe-pairs found per abscissa interval ΔΣhook). About 40% of the probes fall in the N-range in this example and thus are classified "absent". We separately plot the density distribution of these absent probes as a function of their corrected log-intensities (part b of Figure 9). The obtained PM- and MM-probe-level data are well described by normal distributions (PpN(x, σc) = N(μc, σc) with ) of with virtually the same width (σPM ≈ σMM) and slightly shifted centres, .
One obtains the coordinates of the centre of the N-region of the hook-curve (see also Eq. (11)),
(27)
The total distribution decays exponentially to a good approximation at Σ0hook > Σc(0) in many cases (part a of Figure 9). Define the parameter , related to the S/N ratio R.
The exponential behaviour is approximately preserved with respect to this parameter in the mixed and S ranges since from Eq. (18) (see part c of Figure 9). Note that r directly relates to the binding strength of specific hybridization in the relevant range of intermediate and large R-values, r ~ logR ~ log(XsetPM,S). The initial value of the distribution at r = R = 0 provides the fraction of virtually absent probes, fabsent, whereas the area under the normalized distribution for r > 0 consequently gives the fraction of present probes, fpresent = (1 - fabsent).
The distribution can be empirically approximated by an exponential decay function for r-values greater than a certain threshold r' (and R' > R)
(28)
In the intermediate range the signal is approximated by a constant with f2present = 1 - (fabsent + f1present) (fpresent = f1present + f2present) whereas for r = 0 one gets the fraction of absent probes. The decay constant λr defines the r-range which decreases the probability of the specific signal by one order of magnitude. It thus defines a characteristic S/N-ratio (in logarithmic scale) of the chip which characterizes log-ratio of the mean specific and non-specific signals, or in other words, to which extend the specific signal exceeds the non-specific one. The S(pecific)/N(onspecific)-ratio thus can be also interpreted as a sort of S(ignal)/N(oise)-ratio of a given probe set. The value of the r-related decay constant λr slightly exceeds that of the hook-related distribution owing to the larger range covered by the PMonly S/N-ratio (λr = 1.0 versus λ = λΣ = 0.75; compare part c and a of Figure 9).
An estimate of the mean decay constant can be obtained by simple averaging over the respective R-range
⟨λ⟩ ≈ (⟨log(R + 1)⟩R > R'·In 10)
Note that the exponential decay in Eq. (28) is equivalent with the power law, 10-r/λ r= (R+1)-1/λ r, which has been shown to describe the probability distribution of expression values in a series of microarray experiments [28–31]. This power-law function is known as the Zipf's law, observed in many natural and social phenomena. The presence of such power-law function in principle prevents an intrinsic cut off point between "on" genes and "off" genes. The analysis of expression data in terms of their probability distribution therefore is of basic importance for judging the expression level in terms of global characteristics. The hook-transformation provides such characteristics in terms of the signal distribution along the Σhook and/or r coordinates for the mix-, S- and sat-hybridization ranges.
Let us now consider the S/N-ratio of a particular probe at two conditions: (i) with x = 0, referring to the non-specific binding strength in the centre of the normal background distribution, XpPM,N ≈ 10μ/Mc:; and (ii) with x ≠ 0, i.e. XpPM,N(x) ≈ 10μ+x/Mc and .
After transformation into logarithmic scale and combination of both situations we get log R(x) = log R
p
- x and for R > 1 to a good approximation r(x) ≈ r - x.
Now we combine the virtually independent background and signal distributions into the joint probability density function
(30)
It refers to the logarithmic signal value r with the particular non-specific background contribution μ + x. Integration over the whole possible backgound range provides the probability for detecting the signal r for probe p,
(31)
Note that the upper integration range is effectively restricted to x ≤ r because of pS(r-x,λ) = 0 for x > r (see Eq. (28)). Eqs. (30)–(31) define the convolution product of the background and specific-signal distributions. A similar approach was used in ref. [30] and partly also in the RMA-algorithm [29].
De-saturation of the signal and convolution based expression estimates
The hyperbolic intensity-response (Eq. (1)) can be linearized and transformed into the total binding strength using the asymptotic value Mc:
(32)
Figure 10 re-plots the hook curve shown in Figure 8 after linearization of the corrected intensities using Eq. (32). The sat-range now levels off into the asymptotic value αc (compare with Figure 8). This "de-saturated" signal additively decomposes into contributions due to non-specific and specific hybridization (Eq. (2)). We now calculate the expression value as the weighted "glog"-mean of the specific signal,
(33)
Particularly, Eq. (33) corrects the linearized signal (Eq. (32)) for the probe-specific background contribution using the mean background of the chip and the probe-specific sensitivity for non-specific hybridization, (Eq. (23)). The use of the generalized logarithm, , for data transformation is advantageous in two respects: Firstly, it enables processing negative arguments which may appear due to data scattering and, secondly, it "stabilizes" the variance at small values of x provided that the transformation parameter c > 0 is properly chosen [32, 33].
In Eq. (33) the glog-transformed data are averaged over the probability distribution representing the convolution product of the mutually independent probabilities of the non-specific Gaussian background and the specific expression signal (see previous section and [30]). This approach assumes that the probability distribution of each individual transcript is the same as the mean distribution averaged over all transcripts. Alternatively one can use the un-weighted kernel with pS = const. In this case Eq. (33) restricts the averaging to the background distribution.
In the next step, the obtained values are corrected for the sequence effects according to using the positional-dependent sensitivity model and the ZpP values instead of the log-intensities for parameterization (Eq. (50)). Then, the probe-level data are summarized for each probe set by calculating the Tukey-biweight median, [25], and finally by transforming them into linear scale to get the probe set-level expression degree, . Eq. (33) thus provides "PMonly" and "MMonly" estimates of the expression degree with P = PM and MM, respectively.
Considering the fact that the MM intensity comprises information about the non-specific background one can correct the linearized intensity and background of the PM by subtracting that of the respective MM:
(34)
with and
Eq. (34) uses the background distribution of the PM and the conditional expectation of the bivariate normal distribution for the argument of the MM-background which gives rise to an effectively shrunken integration variable, (here ρc is the coefficient of correlation, see appendix B). Note that this approach is to some degree similar to the GC-RMA algorithm [34] which however uses pseudo-MM representing subsets of PM-probes of the same GC-content as the considered PM. A second difference to Eq. (34) is that GC-RMA directly subtracts the (non-linear) MM-intensity without explicit consideration of the non-specific background.
In this context we stress the fact that the non-specific background intensity is rather a variable contribution which progressively decreases with increasing amount of specific binding than a constant (see appendix A and the figure shown there). The background correction of the intensity (as realized, e.g. by GC-RMA) neglects this trend and therefore it is expected to over-correct the signal in the S- and as-hybridization ranges. As a consequence, this approach underestimates the expression degree due to two reasons: Firstly, this over-correction of the background and, secondly, the neglect of saturation. On the other hand, the linearized signal used here (Eq. (32)) corrects the intensity for saturation and, in addition, it contains a constant background level corrected by simple subtraction in Eqs. (33) and (34).
The joint PM-MM- and especially the MMonly-expression measures are smaller than the PMonly estimates because of the smaller binding constants of the MM for specific binding. The former two measures can be scaled to values equivalent to the expression level of the PM according to (see also Eq. (24)),
(35)
and finally transformed into the "set-averaged" binding strength in analogy with Eq. (32)
(36)
Note that the PM-estimate exceeds the MM signal roughly by the factor sc ≈ 5 – 10 at comparable variance of the residual background distribution (see Figure 9, part b). For the coefficient of variation, CV ≈ σ/S, one gets roughly CV(PM) ≈ CV(PM-MM) << CV(MM). The MM consequently are expected to provide considerably less accurate expression estimates compared with the respective PMonly and PM-MM values.
Note that also the S/N-ratio (Eq. (26)) transforms into an alternative estimate of the specific binding strength of the PM (see Eq. (25) and (26)) with
(37)
Finally, the decay constant of the distribution of the specific signal relates to the logarithm of the S/N-ratio (see Eqs. (28) and (29)). One obtains the mean specific signal measured by the given chip as
(38)
The characteristic expression index (or exponent of specific binding) φc complements the respective exponents for non-specific hybridization βc and the characteristic S/N-index λr as the basic chip summary-characteristics of specific and non-specific binding.
Chip characteristics and expression estimates in natural units
Eqs. (36) and (37) provide probe set-estimates of the specific binding strength as a measure of the expression degree,
(39)
which represents a dimensionless concentration measure in units of the mean specific binding constant of the chip. A value of Sset = 1 consequently defines the condition of half-coverage to a good approximation (see Eq. (7)). Analogously, the horizontal dimensions of the hook-curve provide the non-specific binding strength as a measure of the non-specific background in units of the respective binding constant (see Eq. (25)),
(40)
It specifies the mean occupancy of the probes in the absence of specific transcripts, ΘPM(0) ≈ XcPM,N.
Hence, the hook method measures both, the abundance of specific and non-specific transcripts in the hybridization solution in chip-related units such as the relative occupancy of the probe spots and the respective binding strengths.
The specific and non-specific signals are combined into the S/N-ratio, R (Eq. (26)), which provides the expression degree in units of the non-specific background contribution in the particular chip experiment. The S/N-ratio is directly related to the hook-coordinates of a selected probe set and thus it can be roughly deduced by visual inspection of the particular hook curve (Eqs. (46) and (18)). Figure 8 illustrates the relation between the intrinsic expression measures and the hook coordinates for a typical microarray hybridization. The point of half coverage (ΘPM = 0.5, XPM, S ≈ 1) is found beyond the maximum in the sat-regime. Virtually no probe of the chosen chip meets this condition. Note that ΘPM scales with the distance relative to the end point referring to R → ∞ (see above). Vice versa, the mean fraction of specifically hybridized oligomers, xPM, S, scales with the distance relative to the N-point. It steeply increases in the mix-regime and reaches the conditions at which 50% of the bound transcripts belong to the specific ones, xPM, S = 0.5, at relatively small abscissa-values. Hence, specific hybridization starts to dominate over non-specific one always at the beginning of the mix-range. The S-range near the maximum of the hook-curve refers to probes with a 50 – 100 fold excess of specific hybridization, RPM ≈ 50 – 100. The width of the hook of about β = 2.7 is equivalent with the background strength of Nc ≈ 10-2.7 = 0.002 which in turn rescales the S/N-ratio into binding strengths, for example Sc = 0.1–0.2 for R = 50–100. These rough estimation shows that the maximum of the hook is equivalent with the occupancy of the PM-spots in the order of 10 – 25%.
The mean binding constants, KcPM,S and KcPM,N and thus also the used measures of specific and non-specific hybridization depend on the particular probe and chip design (e.g., the length of the probe-oligonucleotides, their density and the type of the mismatches used for the MM probes). Consequently they are specific for the used chip type, on one hand. On the other hand, also the conditions of a particular hybridization affect the KcPM,h because their values depend on all processes affecting the binding reaction between the probe-oligonucleotides and the targets. For example, the composition of a particular sample will affect KcPM,N and KcPM,S as well, because both constants depend on the extent of target-dimerization which is a function of the concentrations of the reacting species in the hybridization solution [1].