An online peak extraction algorithm for ion mobility spectrometry data

Ion mobility (IM) spectrometry (IMS), coupled with multi-capillary columns (MCCs), has been gaining importance for biotechnological and medical applications because of its ability to detect and quantify volatile organic compounds (VOC) at low concentrations in the air or in exhaled breath at ambient pressure and temperature. Ongoing miniaturization of spectrometers creates the need for reliable data analysis on-the-fly in small embedded low-power devices. We present the first fully automated online peak extraction method for MCC/IMS measurements consisting of several thousand individual spectra. Each individual spectrum is processed as it arrives, removing the need to store the measurement before starting the analysis, as is currently the state of the art. Thus the analysis device can be an inexpensive low-power system such as the Raspberry Pi. The key idea is to extract one-dimensional peak models (with four parameters) from each spectrum and then merge these into peak chains and finally two-dimensional peak models. We describe the different algorithmic steps in detail and evaluate the online method against state-of-the-art peak extraction methods.


Introduction
Ion mobility (IM) spectrometry (IMS), coupled with multi-capillary columns (MCCs), MCC/IMS for short, has been gaining importance for biotechnological and medical applications. With MCC/IMS, one can measure the presence and concentration of volatile organic compounds (VOCs) in the air or in exhaled breath with high sensitivity. In contrast to other technologies, such as mass spectrometry coupled with gas chromatography (GC/MS), MCC/IMS works at ambient pressure and temperature. Several diseases like chronic obstructive pulmonary disease (COPD) [1], sarcoidosis [2] or lung cancer [3] can potentially be diagnosed early with MCC/IMS technology. IMS is also used for the detection of drugs [4] and explosives [5]. Constant monitoring of VOC levels is of interest in biotechnology, e.g., for watching fermenters with yeast producing desired compounds [6] and in medicine, e.g., monitoring propofol levels in the exhaled breath of patients during surgery [7].
IMS technology is moving towards miniaturization and small mobile devices. This creates new challenges for data analysis: The analysis should be possible within the measuring device without requiring additional hardware like an external laptop or a compute server. Ideally, the spectra can be processed on a small embedded chip or small device like a Raspberry Pi or similar hardware with restricted resources. Algorithms in small mobile hardware face constraints, such as the need to use little energy (hence little random access memory), while maintaining prescribed time constraints.
The basis of MCC/IMS analysis is peak extraction, by which we mean a representation of all high-intensity regions (peaks) in the measurement by using a few descriptive parameters per peak instead of the full measurement data. State-of-the-art software (like IPHEx [8], Visual Now [9], PEAX [10]) only extracts peaks when the whole measurement is available, which may take up to 10 minutes because of the pre-separation of the analytes in the MCC. Our own PEAX software in fact defines modular pipelines for fully automatic peak extraction and compares favorably with a human domain expert doing the same work manually when presented with a whole MCC/IMS measurement. However, storing the whole measurement is not desirable or possible when the memory and CPU power is restricted. Here we introduce a method to extract peaks and estimate a parametric representation while the measurement is being captured. This is called online peak extraction, and this article presents the first algorithm for this purpose on MCC/IMS data. An extended abstract of this work has been published at WABI'14 [11].
Section 'Background' introduces the necessary background on the data produced by an MCC/IMS experiment, on peak modeling and on optimization methods. The basic idea of our algorithm is to process each IM spectrum as soon as it arrives (and before the next one arrives). After appropriate pre-processing including denoising and baseline correction described in Section 'Denoising and baseline correction' , the single spectra are reduced into a mixture of parametric one-dimensional peak models, described in Section 'Reducing a spectrum to peak models' . Accordingly, in Section 'Aligning consecutive spectrum peak lists' the approach of connecting models from two subsequent spectra into peak chains is explained. The main challenge is then to merge the peak chains into two-dimensional peak models, described in Section 'Estimating 2-D peak models' . In Section 'Peak clustering' we introduce a novel approach for clustering peaks among several measurements e.g. for time series. An evaluation of our approach is presented in Section 'Evaluation' including a listing of all settings of the MCC/IMS as well as an explanation of all adjustable parameters, while Section 'Discussion and conclusion' contains a concluding discussion.

Background
Ion mobility spectrometers and their functions are well documented [12], and we do not go into technical details. Instead, we characterize the data generated by an MCC/IMS experiment (Section 'Data from MCC/IMS measurements'). In Section 'Peak models' we describe a previously used parametric peak model, and in Section 'Optimization methods' we review two optimization methods that are being used as subroutines in this work.

Data from MCC/IMS measurements
In an MCC/IMS experiment, a mixture of several unknown volatile organic compounds (VOCs) is separated in two dimensions: first by retention time r in the MCC (the time required for a particular compound to pass through the MCC) and second by drift time d through the IM spectrometer. Instead of the drift time itself, a quantity normalized for pressure and temperature called the inverse reduced mobility (IRM) t is used to compare spectra taken under different or changing conditions. Thus we obtain a time series of IM spectra (one spectrum each 100 ms at each retention time point), and each spectrum is a vector of ion concentrations (measured by voltage change on a Faraday plate) at each IRM.
Let R be the set of (equidistant) retention time points and let T be the set of (equidistant) IRMs where a measurement is made. If D is the corresponding set of drift times (each 1/250000 second for 50 ms, that is 12 500 time points), there exists a constant C t|d > 0 depending on external conditions [12] such that T = C t|d · D. Then the data is an |R| × |T| matrix S = (S r,t ) of measured ion intensities, which we call an IM spectrum-chromatogram (IMSC). The matrix can be visualized as a heat map ( Figure 1). A row of S is a spectrum, while a column of S is a chromatogram.
Areas of high intensity in S are called peaks, and our goal is to discover them and to describe them by parametric models. Comparing peak coordinates with reference databases may reveal the identity of the corresponding compound. A peak caused by a VOC occurs over several IM spectra. We mention some properties of MCC/IMS data that complicate the analysis.
• An IM spectrometer uses an ionized carrier gas.
These ions are present in every spectrum in addition to the analyte ions, and they create the reactant ion peak (RIP). In the whole IMSC it is present as high-intensity chromatogram at a specific IRM ( Figure 1). When no analytes are injected into the device, the spectra contain only the RIP and are called RIP-only spectra. • Every spectrum contains a tailing of the RIP, so the RIP is right-skewed ( Figure 2). To extract peaks, the effect of the RIP and its tailing must be estimated and removed. • At higher concentrations, compounds can form dimer ions, and one may observe both the monomer and dimer peak from one compound. This means that there is not necessarily a one-to-one correspondence between peaks and compounds, and our work focuses on peak detection, not compound identification. • An IM spectrometer may operate in positive or negative mode, depending on which type of ions (positive or negative) one wants to detect. In either case, signals are reported in positive units. All experiments described here were done in positive mode.

Peak models
For our purpose of analyzing MCC/IMS measurements, a peak is characterized by the following assumptions.
Assumptions 1. An n-dimensional peak P is a product of n log-concave functions with two inflection points in each dimension. The peak width at half height ω1 /2,i can be Figure 1 Visualization of a raw measurement (IMSC) as a heat map; signal color: white (lowest) < blue < purple < red < yellow (highest). The constantly present reactant ion peak (RIP) with mode at 0.48 Vs/cm 2 and exemplarily one VOC peak are annotated.
calculated with respect to the mode for each dimension i. At its mode (m 1 , . . . , m n ), peak P exceeds the average background noise level by a certain factor times the standard deviation of the noise.
For MCC/IMS measurements, we have n = 2 dimensions, and in both retention time dimension and IRM dimension, we use the shifted Inverse Gaussian distribution g [13] as peak model function: Its parameters are the shift (or offset) o, the relative mean μ > 0 (to the right of o) and the shape parameter λ > 0. A peak is then given as the product of two shifted Inverse Gaussians, scaled by a volume factor v, i.e., by seven parameters; so the density function of a peak is p(r, t) := v·g(r, μ r , λ r , o r )·g(t, μ t , λ t , o t ) for all r ∈ R, t ∈ T.
Since the parameters μ, λ, o of a shifted Inverse Gaussian may be different even though the resulting distributions have a similar shape, it is more intuitive to describe the shifted Inverse Gaussian in terms of three different descriptors: the (absolute) mean μ = o + μ, the standard deviation σ and the mode m. There is a bijection between (μ, λ, o) and (μ , σ , m) [13] summarized in Appendix A.
We also make use of the following empirically observable properties of peaks in real IMSCs that concern the peak widths on both the IRM axis and the retention time axis. The width can be described as the length ω1 /2 of the interval around the mode where the peak height is at least half of its maximum height. For a (symmetric) Gaussian distribution, there is a linear relation between the standard deviation σ and ω1 /2 : This relation approximately holds as well for not too skewed Inverse Gaussian distributions and is a good approximation to estimate its descriptor σ approximately from an empirically observed ω1 /2 .
Given the mode d * of a peak in drift time (in ms), we can estimate its descriptors (m, σ , μ ) in IRM units as follows. Recall that the IRM mode (in V s cm −2 ) is simply m = C t|d · d * , where C t|d is the conversion constant between drift time and IRM (see Section 'Data from MCC/IMS measurements'). Spangler et al. [14] empirically derived that ω1 /2 = (11.09 D d * )/v 2 d + d 2 grid , where D is the diffusion coefficient, v d the drift velocity. Using the Einstein relation [15], D can be computed as D = kK B T /q, where k is the ion mobility, K B the Boltzmann constant, T the absolute temperature and q the electric charge. We then use (2) to estimate σ ≈ ω1 /2 /φ. Finally, the mean is empirically found to be μ ≈ C t|d · d * + (4.246 · 10 −5 ) 2 + (d * ) 2 /585048.1633 . On the retention time axis, the peak width ω1 /2 grows approximately linearly with retention time, i.e., there are constants r_width_offset > 0 and r_width_factor > 0 such that width of a peak with maximum at retention time r is approximately ξ(r) := r · r_width_factor + r_width_offset . (3)

Optimization methods
The online peak extraction algorithm makes use of nonlinear unconstrained minimization, similar to non-linear least squares, and of the EM algorithm. Both methods are summarized here.

Non-linear Least Squares
The NLLS method is an iterative method to estimate parameters θ = (θ 1 , . . . , θ q ) of a supposed parametric function f , given n observed data points (x 1 , y 1 ), . . . , (x n , y n ) with y i = f (x i ; θ). The idea is to minimize the quadratic error n i=1 r 2 i (θ) between the function and the observed data, where r i (θ) := y i − f (x i ; θ) is the residual of the i-the datapoint. The necessary optimality condition is i r i (θ) · ∂r i (θ)/∂θ j = 0 for all j. If f is linear in θ (e.g., a polynomial in x with θ being the polynomial coefficients, a setting called polynomial regression), then the optimality condition results in a linear system, which can be solved in closed form. However, often f is not linear in θ and we obtain a non-linear system, which is solved iteratively, given initial parameter values, by linearizing it in each iteration. Details and different algorithms for NLLS can be found in the literature ( [16], Chapter 10). In this paper, we use a different, non-symmetric loss function, but apply similar techniques to solve the problem (see below).

The EM algorithm for mixtures with heterogeneous components
The observed data x = (x 1 , . . . , x n ) is viewed as a sample from a mixture of probability distributions, where the mixture density is specified by where θ c denotes the parameters of f c , and θ = (θ 1 , . . . , θ C ) is the collection of all parameters. The mixture coefficients satisfy ω c ≥ 0 for all c, and c ω c = 1. Unlike in most applications, where all component distributions f c are multivariate Gaussians, here the f c are of different types (e.g., uniform and Inverse Gaussian). The goal is to determine the parameters ω and θ such that the probability of the observed sample is maximal (maximum likelihood paradigm). Since the resulting optimization problem is non-convex in (ω, θ), the EM algorithm is an iterative method that will converge to a local optimum [17] in parameter space. The EM algorithm consists of two repeated steps: The E-step (expectation) estimates the expected membership of each data point in each component and then the component weights ω, given the current model parameters θ. The M-step (maximization) estimates maximum likelihood parameters θ c for each parametric component f c individually, using the expected memberships as hidden variables that decouple the model. , Convergence. After each M-step of an EM cycle, we compare θ c,q (old parameter value) and θ + c,q (updated parameter value), where q indexes the elements of θ c , the parameters of component c. We say that the algorithm has converged when the relative change κ c,q := Having reviewed the necessary background, we now describe the methods we use for peak extraction from IMSCs.

Denoising and baseline correction
Background A major challenge during peak detection in an IM spectrum is to find peaks that only slightly exceed the background noise level in a spectrum S = (S t ). To determine whether the intensity S t at coordinate t belongs to a peak region or can be solely explained by background noise, we propose a method based on the EM algorithm. It runs in O(τ |T|) time where τ is the number of EM iterations.

Mixture model
Based on observations of IM spectra signal intensities, we assume that • the noise intensity has a Gaussian distribution over low intensity values with mean μ N and standard deviation σ N , • the true signal intensity has an Inverse Gaussian distribution with mean μ S and shape parameter λ S , i.e., • there is an unspecific background component which is not well captured by either of the two previous distributions; we model it by the uniform distribution over all intensities, and we expect the weight ω B of this component to be close to zero in standard IM spectra. High weights indicate an anomaly during the measurement.
We interpret the observed spectrum S as a sample of a mixture of these three components with unknown mixture coefficients. To illustrate this approach, consider Figure 3, which shows the empirical intensity distribution (histogram) of an arbitrary spectrum, together with the estimated components (except the uniform distribution, which has the expected coefficient of almost zero).
It follows that there are six independent parameters to estimate:

Initial parameter values
Background noise intensities are assumed to follow a Gaussian distribution at small intensity values. We can determine its approximate mean μ N and standard deviation σ N by considering the first and last 10% of data points in each spectrum.
The initial weight of the noise component is set to cover most points covered by this Gaussian distribution, i.e., We assume that almost all of the remaining weight belongs to the signal component, thus To obtain initial parameters for the signal model, let

E-step
The hidden parameters W t,c are computed using (4), where the three component distributions f c are the three component densities p N , p S , p B with their parameters and the data x is a mean-smoothed version of the original spectrum S: where the smoothing window margin is α := (1/2) · d grid · C t|d · |T|/T last . (Here d grid is the grid opening time of the spectrometer and T last is the maximum IRM in T).

Maximum likelihood estimators
In the maximization step (M-step) we estimate maximum likelihood parameters for the non-uniform components using the original intensities of S again.
for all t ∈ T.

Final step
After convergence, we correct the baseline and remove noise: We first subtract μ N from the signal value and then reduce the remaining value by the estimated noise weight. The corrected spectrum S + is Reducing a spectrum to peak models

Background
The idea of processing a single (noise-reduced) IM spectrum S is to deconvolute it into separate components described with statistical distribution functions. Several components appear in each spectrum besides the peaks, namely the previously described RIP and the tailing described in Section 'Data from MCC/IMS measurements' . and background noise. We first determine and remove the RIP tailing function and then determine the peak parameters (including the RIP).

Determining the tailing function
The tailing function appears as a baseline in every spectrum (see Figure 2 for an example). Its shape and scale changes from spectrum to spectrum; so it has to be determined in each spectrum and subtracted in order to extract peaks from the remaining signal in the next step. Empirically, we observe that the tailing function f (t) can be described by a scaled shifted Inverse Gaussian, be the residual function for a given choice θ of parameters. As we want to penalize r(t) < 0 but not (severely) r(t) > 0, we use the following non-symmetric loss function that depends on a threshold parameter γ > 0: That is, the loss at time t is the residual squared when it has a negative or small positive value less than the given threshold γ > 0, but becomes a linear function of the residual for larger positive residuals.
The goal is to find θ to minimize the total loss L(θ) := t e t (θ; γ ) for the given spectrum S and given γ > 0. We use gradient descent to solve this nonlinear optimization problem, to which we refer as non-linear loss minimization (NLLM). To estimate the tailing function, 1. we determine reasonable initial values for the parameters θ = (v, μ, λ, o) (see below), 2. we solve NLLM with γ = σ 2 N to estimate the scaling factor v, leaving the other parameters fixed, 3. we solve NLLM with γ = σ 2 N to estimate all four parameters, 4. we solve NLLM with γ = σ 2 N /100 to re-estimate the scaling factor v where σ N is the standard deviation of the noise as described in Section 'Denoising and baseline correction' .
The initial parameter values (v, μ, λ, o) are determined as follows: For the scaling factor, we initially set v = (1/2) t |T| S t . For the other parameters, we first estimate the descriptors (μ , σ , m) as described below and then use the correspondence to the parameters listed in Appendix A. The initial σ is set to the standard deviation of the whole RIP-only spectrum. We determine the initial m as the RIP mode. It is a property of the Inverse Gaussian distributions under consideration such that the mean μ can only range within the interval I = [ m, m + 0.7σ ]. To obtain an appropriate value for μ , an auxiliary offset variable o is set to the largest IRM left of the RIP mode where the signal is below σ N , and μ it is increased in small steps within I. The candidate descriptors (μ , σ , m) are converted into corresponding param- eters (μ, λ, o) until o ≥ o . The so obtained parameters constitute the initial parameter values.

Extracting peak parameters from a single spectrum
To extract all peaks from a spectrum (from left to right), we repeat three sub-steps: 1. scanning for a potential peak, starting where the previous iteration stopped; 2. determining peak parameters (Inverse Gaussian distribution); 3. subtracting the peak from the spectrum and continuing with the remainder.
Scanning. The algorithm scans for peaks, starting at the left end of S, by sliding a window of a given width across S and fitting a quadratic polynomial model to the data points within the window. The window width (in index units) is related to the grid opening time d grid of the spectrometer and given as d grid /D last · |D| data points, where D last is the maximum (last) drift time measured. The model is built in drift time (not IRM). Let f (d; θ) = θ 2 d 2 + θ 1 d + θ 0 be the fitted quadratic polynomial inside the window. We call a window a peak window if the following conditions are fulfilled: • the extreme drift time d * = θ 1 /(2θ 2 ) lies within the drift times of the window; • the extreme drift time d * indicates a maximum (i.e., θ 2 < 0); • the maximum is sufficiently high above the noise level (which is zero after preprocessing): The first condition can be more strongly restricted to achieve more reliable results, by shrinking the interval towards the center of the window. If no peak is found, the moving window is shifted one index forward. If a peak is detected, the window is shifted half the window length forward before the next scan begins, but first the peak parameters are computed.
Determining peak parameters. As described in Section 'Peak models' , we can estimate all peak descriptors (m, σ , μ ) from its mode d * in drift time. We convert them into the parameters (μ, λ, o) of the Inverse Gaussian parameterization (see Appendix A). The scaling factor v for the peak is v = f (d * ; θ)/g(m; μ, λ, o).
The model function is subtracted from the spectrum, and the next iteration is started with a window shifted by α index units (consider Section 'E-step'). For each spectrum, the output of this step is a spectrum peak list, which is a set of parameters for a mixture of weighted Inverse Gaussian models describing the peaks.

Background
Having a set of peak parameters for each spectrum, the question arises how to merge the sets P = (P i ) and P + = (P + j ) of two consecutive spectra. For each peak P i , we have stored the Inverse Gaussian parameters μ i , λ i , o i , the peak descriptors μ i , σ i , m i (mean, standard deviation, mode) and the scaling factor v i , and similarly so for the peaks P + j . The idea is to compute a global alignment similar to the Needleman-Wunsch method [18] between P and P + . We need to specify how to score aligning P i to P + j and how to score leaving a peak unaligned (i.e., a gap).

Scoring peak alignments
The score Z ij for aligning P i to P + j is chosen by evaluating P i 's density function at the new mode m + j and comparing it to "typical" value an approximate standard deviation away from the mode (at m i + δ, where δ := d grid · C t|d /φ), resulting in the log-odds score Alternatively, leaving a peak unmatched results in a gap score of zero.
Applying Needleman-Wunsch global alignment, we can compute the optimal score of aligning the first i peaks in the former spectrum with the first j peaks in the current spectrum by dynamic programming. We initialize a matrix Z, setting all Z i,0 and Z 0,j to zero and then compute, for i ≥ 1 and j ≥ 1,

Obtaining peak chains
The alignment is obtained with a traceback, recording the optimal case in each cell, as usual. There are three cases to consider.
• If P + j is not aligned with a peak in P, potentially a new peak starts at this retention time. Thus model P + j is put into a new peak chain.
• If P + j is aligned with a peak P i , the chain containing P i is extended with P + j . • All peaks P i that are not aligned to any peak in P + indicate the end of a peak chain at the current retention time.
All completed peak chains are forwarded to the next step, two-dimensional peak model estimation.

Estimating 2-D peak models
Background Let C = (P 1 , . . . , P n ) be a chain of one-dimensional Inverse Gaussian models. The goal of this step is to estimate a two-dimensional peak model (product of two one-dimensional Inverse Gaussians) from the chain, as described in Section 'Peak models' , or to reject the chain if the chain does not fit such a model well. Potential problems are that a peak chain may contain noisy 1-D peaks truncated at their borders, consist only of noise or in fact consist of several consecutive 2-D peaks at the same drift time and successive retention times.

Estimating the parameters
As discussed in Section 'Peak models' , the half-height width ω1 /2 in retention time of a peak centered at retention time r can be described by an affine function ξ(r), Eq. (3), and ω1 /2 can be converted to the corresponding number of data points (window width).
We have the parameters (v i ,μ i,t ,λ i,t ,ô i,t ) for each individual peak i = 1, . . . , n in a peak chain, and the corresponding descriptors (μ i,t ,σ i,t ,m i,t ), as well as the associated retention time r i and peak height h i =v i · g(m i,t ;μ i,t ,λ i,t ,ô i,t ).
We proceed similarly to Section 'Extracting peak parameters from a single spectrum' by fitting quadratic polynomials b(r; θ) = θ 2 r 2 + θ 1 r + θ 0 in sliding windows of the appropriate width ξ(r i ) such that h i ≈ b(r i ; θ).
Having found a window that fits a peak, we estimate initial descriptors for an Inverse Gaussian model in retention time as follows:

The descriptors are then converted into model parameters (see Appendix A).
After processing each window, we have obtained a list of size, say, k, of Inverse Gaussian distributions. We expect these distributions to be a mixture of k overlapping peaks in a single peak chain. To obtain an optimal deconvolution, we first normalize the volume factors v r of the k components to obtain v r,j such that k j=1 v r,r = 1 and then apply the EM algorithm. As a byproduct, we obtain an (n × k) matrix M = (M i,j ) that determines the membership probability for each of the n data points (r i , h i ) to each of the k models.
To obtain the Inverse Gaussian distribution parameters in the IRM dimension for each of the k models, we first compute model descriptors using a membershipweighted average over the individual model descriptors: For j ∈ {1, . . . , k}, let We then convert these descriptors back into model parameters (Appendix A). The final peak volume is computed as v * j = v j,r · i n v i,t . For every model j ∈ {1, . . . , k}, we check the following conditions: • The width at half height in the retention time dimension has approximately the expected size (cf. Eqs. (2), (3)): ξ(m j,r )/2 σ j,r /φ < 2 · ξ(m j,r ), • The peak height at its maximum is sufficiently above the noise level: v * j · g(m j,t ; μ j,t , λ j,t , o j,t )· g(m j,r ; μ j,r , λ j,r , o j,r ) ≥ noise_margin · σ N , where noise_margin > 0 is a tunable parameter, • the Inverse Gaussian peak model g in retention time correlates well (in terms of the Pearson product-moment correlation coefficient ρ) with its quadratic approximation b in a window around the mode. More precisely, consider the window W = [ m j,r − ξ(m j,r )/φ, m j,r + ξ(m j,r )/φ], the model vector G = g(x; μ j,r , λ j,r , o j,r ) for x ∈ W and the quadratic approximation vector B = b j (x; θ) for x ∈ W , and test whether the Pearson correlation satisfies ρ(G, B) ≥ ρ min .
If all conditions are satisfied, we have identified a 2-D peak model (v * j , μ j,t , λ j,t , o j,t , μ j,r , λ j,r , o j,r ). Otherwise the model is discarded.

Background
We now consider a series of IMS measurements, for each of which we have extracted peaks available in the form of parameter vectors or descriptors. The question arises how to decide which descriptors in different measurements represent the same peak (and hence potentially the same VOC).
Let X be the union of peak locations in all measurements, let |X| =: n, and let X i,R be the retention time of peak i and X i,T its IRM. We introduce a clustering approach using the EM algorithm with twodimensional Gaussian mixtures that differs from the standard approach by its ability to dynamically adjust the number of clusters in the process.

Mixture model
We assume that the measured retention times and IRMs belonging to peaks from the same compound are independently normally distributed in both dimensions around the (unknown) true retention time and IRM. Let θ j := (μ j,R , σ j,R , μ j,T , σ j,T ) be the parameters for component j, and let f j (x givenθ j be a two-dimensional Gaussian product distribution for a peak location x = (x R , x T ) with these parameters.
The mixture distribution is with a yet undetermined number C of clusters. Note that there is no "background" model component.

Initial parameter values
In the beginning, we initialize the algorithm with as many clusters as peaks, i.e., we set C := n. This assignment makes a background model obsolete, because all peaks are assigned to at least one cluster. All clusters get as start parameters for μ j,R , μ j,T the original retention time and IRM of peak location X j , respectively, for j = 1, . . . , n. We set σ j,T := t_width > 0 and σ j,R := ξ(X j,R )/φ.

Dynamic adjustment of the number of clusters
After computing weights in the E-step, but before starting the M-step, we dynamically adjust the number of clusters by merging clusters whose centers are close to each other. Every pair j < k of clusters is compared in a nested forloop. When |μ j,T − μ k,T | < t_width and |μ j,R − μ k,R | < ξ max{μ j,R , μ k,R } , then clusters j and k are merged by summing the EM weights: ω + := ω j + ω k and W i,+ := W i,j + W i,k for all i. The summed weights are assigned to the location of the cluster with larger weight. (The recomputation of the parameters happens immediately after merging in the maximization step). The comparison order may matter in rare cases for deciding which peaks are merged first, but since new means and variances are computed, possible merges that were omitted in the current iteration will be performed in the next iteration. This merging step is applied after in the second EM iteration, since the cluster means need at least one iteration to move towards each other.

Maximum likelihood estimators
The maximum likelihood estimators for mean and variance of a two-dimensional Gaussian are the standard ones, taking into account the membership weights, for all components j = 1, . . . , C.
One problem using this approach emerges from the fact that initially each cluster contains only one peak, leading to an estimated variance of zero in many cases. To prevent this, minimum values are enforced such that σ j,T ≥ t_width and σ j,R ≥ ξ(μ j,R )/φ for all j.

Final step
The EM loop terminates when no merging occurs and the convergence criteria for all parameters are fulfilled. The resulting membership weights determine the number of clusters as well as the membership coefficient of peak location X i to cluster j. If a hard clustering is desired, the merging step has to be traced.

Evaluation
We evaluate different properties of the online method: 1. the quality of reducing a single spectrum to a peak list (denoising/baseline correction (Section 'Denoising and baseline correction') and spectrum reduction (Section 'Reducing a spectrum to peak models'), 2. the execution time of both steps, 3. the quality of the new clustering approach, 4. the correlation between manual annotations on full IMSCs by a computer-assisted expert and our automated online extraction method.
Parameters. For evaluation measurements, the MCC was adjusted to a temperature of 40°C and throughput of 150 mL min −1 . The IMS had a voltage of 4380 V, a grid opening time of 300 μs and a throughput of 150 mL min −1 . We chose the following parameters [9]: • r_width_offset = 2.5 s (width offset for peaks in retention time), • r_width_factor = 0.06 (width slope for peaks in retention time), • t_width = 0.003 V s cm −2 (standard deviation for peaks in IRM), • thresh (convergence threshold; value varies within evaluation), • noise_margin = 4 (factor multiplied with standard deviation of background noise for minimal peak height), • ρ min = 0.95 (minimal Pearson product-moment correlation coefficient).

Quality of single spectrum reduction
In a first experiment, we tested the quality of the spectrum reduction method using an idea by Munteanu and Wornowizki [19] that determines the agreement between an observed set of data points, interpreted as an empirical distribution function F (the data) and a model distribution G (the mixture distribution obtained from the peak list parameters). The approach writes F =s·G +(1−s)·H withs ∈ [ 0, 1], where H is a non-parametric distribution whose inclusion ensures the fit of the model G to the data F. If the weights is close to 1.0, then F is a plausible sample from G. We compare the original spectra and reduced spectra (peaks from peak lists) from a previously used dataset [20]. This set contains 69 measurements preprocessed with a 5 × 5 average. Every measurement contains 1200 spectra. For each spectrum in all measurements, we computed the reduced spectrum model and determineds. Over 92% of all 82 000 models achieveds = 1 and over 99% reacheds ≥ 0.9. Nos dropped below 85%. In summary, spectrum reduction provides an accurate parametric representation of most spectra.

Execution time
We tested our method on two different platforms, (1) a desktop PC with Intel(R) Core(TM) i5 2.80GHz CPU, 8GB memory, Ubuntu 12.04 (64bit) OS and (2) a Raspberry Pi [21] type B with ARM1176JZF-S 700MHz CPU, 512 MB memory, Raspbian Wheezy (32bit) OS, once with the factory defaults of 700 MHz and once overclocked up to 900 MHz. The Raspberry Pi was chosen because it is a complete credit-card-sized low-cost single-board computer with low CPU and power consumption (3.5 w). This kind of device is appropriate for data analysis in future mobile measurement devices.
Recall that each spectrum contains 12 500 data points. It is current practice to analyze not the full spectra, but aggregated ones, where five consecutive values are averaged. Here we consider the full spectra, slightly aggregated ones (average over two values, 6 250 data points) and standard aggregated ones (average over five values, 2 500 data points). We measured the average execution time of denoising, baseline correction and consecutive spectrum reduction. Table 1 shows the results. It is remarkable that at the highest resolution (Average 1) the Raspberry Pi with 900 MHz keeps barely the time bound of 100 ms between consecutive spectra. At lower resolutions, the Raspberry Pi satisfies the time restrictions easily. The desktop PC copes with the analysis effortless on any setting.
We found that in the steps that use the EM algorithm, on average 25-30 EM iterations were necessary for a precision of thresh := 0.001 (i.e., 0.1%) (see Convergence in Section 'The EM Algorithm for mixtures with heterogeneous components'). Relaxing the threshold from 0.001 to 0.01 halved the number of iterations without noticeable difference in the resulting estimated parameters.

Clustering
To evaluate peak clustering methods, we simulate peak locations according to locations in real MCC/IMS datasets, together with the true partition P of peaks.
Most of the detected peaks appear in a small dense area early in the measurement. The remaining peaks are distributed widely, which is referred to as the sparse area (we let the areas overlap such that the dense are is contained in the sparse area). The areas approximately have the following boundaries (in units of (V s cm −2 , s) from lower left to upper right point, cf. Peak clusters are ellipsoidal and dense. From [9], we know the minimum required distance between two peaks in order to be identified as two separate compounds. We simulate 30 peak cluster centroids in the dense area and 20 in the sparse area, all picked randomly and uniformly distributed in the respective area. We then randomly pick the number of peaks per cluster. We also randomly pick the distribution of peaks within a cluster. Since we do not know the actual distribution model, we decided to simulate with three models: normal (n), exponential (e) and uniform (u) distribution with the following densities: Here (μ t , μ r ) is the coordinate of the centroid with RIM in Vs/cm 2 and retention time in s. For the normal distribution, we used σ t = 0.002 and σ r = μ r · 0.002 + 0.2. For the exponential distribution, we used λ t = (1.45 · 2500) −1 (reduced mobility width for in single cell within M) and λ r = 1/(μ r · 0.002 + 0.2). For the uniform distribution, we used an ellipsoid with radii ν t = 0.006 and ν r = μ r · 0.02 + 1.
We compared our adaptive EM clustering with two common clustering methods: k-means and DBSCAN. Since k-means needs a fixed number of clusters k and appropriate start values for the centroids, used k-means++ [22] for estimating good starting values and give it an advantage by assigning the true number of partitions. DBSCAN has the advantages that it does not need to know the number of clusters in advance and that it can find non-linear cluster boundaries, but it does not easily yield parametric cluster descriptors.
To measure the quality of an obtained clustering C we use the Fowlkes-Mallows index (FMI, [23]) and the normalized variation of information (NVI) score [24].
For the FMI one considers all pairs of data points. If two data points belong to the same true partition of P, they are called connected. Accordingly, a pair of data points is called clustered if they are clustered together by the clustering method are evaluating. Pairs of data points that are both connected and clustered are called true positives (TP). False positives (FP, not connected but clustered) and false negatives (FN, connected but not clustered) are computed analogously. The FMI is the geometric mean of precision and recall: FMI(P, C) := √ TP/(TP + FP) · TP/(TP + FN), where P is the true partition and C is the clustering. We have FMI(P, C) ∈ [ 0, 1], and FMI(P, C) = 1 indicates perfect agreement. The FMI is difficult to interpret when the number of clusters in C and P differs significantly.
Therefore we use a second measure that considers cluster sizes only, the normalized variation of information (NVI). To compute the NVI, an auxiliary (|P| × |C|)dimensional matrix A = (a i,j ) is computed, where a i,j is the number of data points within partition i that are assigned to cluster j. The NVI score is defined via entropies; let n be the number of data points and if H(P) = 0, H(C) otherwise.

Figure 4
Histograms of Fowlkes-Mallows index (FMI; higher is better) and normalized variation of information (NVI; lower is better) comparing 100 simulated measurements containing partitioned peak locations with their clusters produced by the different methods.
Here NVI(P, C) = 0 indicates perfect agreement between the cluster size distributions. Together, an FMI of 1 and an NVI score of 0 indicate a perfect clustering.
For the first test, we evaluated 100 sets of data points distributed as described above. The cluster model (normal, exponential or uniform) was drawn randomly. The results show that even with the advantage that k-means knows the true k, our adaptive EM clustering performs best on average in terms of FMI and NVI score (Figure 4).
For the second test, we additionally inserted 200 uniformly distributed (noise) peaks into the measurement area. All these peaks are singletons and have no matching  Time series of discovered intensities of two peaks. Left: A peak with agreement between manual and automated online annotation. Right: A peak where the online method fails to extract the peak in several measurements. If one treated zeros as missing data, the overall trend would still be visible.
peaks. The results ( Figure 5) show that the adaptive EM clustering still performs best on average, whereas k-means fails.

Comparison of automated online peak extraction with manual offline annotation
The fourth experiment compares extracted peaks from a time series of measurements of two automated methods to an expert manual annotation. The automated methods are our online analysis process described here and automated peak detection using the commercial VisualNow software.
Here 15 rats were monitored in 20 minute intervals for up to a day. Each rat resulted in 30-40 measurements (a time series) for a total of 515 measurements. To track peaks over time, we used the previously described EM clustering method.
As an example, Figure 6 shows time series of intensities of two peaks detected by computer-assisted manual annotation and using our online algorithm. The example shows that there are cases where the sensitivity of the online algorithm is not perfect; this is mainly true for peaks whose intensity only slightly exceeds the background noise.
To obtain an overview over all time series, we computed the cosine similarity γ ∈ [ −1, +1] time series of peak intensities discovered by manual annotation and each automated method. We also computed the recall automated method for each time series, that is, the relative fraction of measurements where the peak was found by the algorithm among those where it was found by manual annotation. Figure 7 shows overall good agreement between both automated methods (our online method and automated VisualNow peak extraction) with the expert manual annotation. The cosine similarity of the inferred time series is in better agreement than the more variable recall. When comparing automated methods against each other, we outperform VisualNow in terms of sensitivity and computation time: About 31% of the points extracted by the online method exceed 90% recall and 98% cosine similarity whereas only 5% of the time series extracted by VisualNow achieve these values. The peak detection of one measurement takes about 2 seconds on average (when the whole measurement is available at once) with the online method and about 20 seconds with VisualNow on the desktop computer described above. VisualNow only provides the position and signal intensity of the peak's maximum, whereas our method additionally provides shape parameters.
Problems of our online method stem from low-intensity peaks only slightly above the detection threshold, and resulting fragmentary or rejected peak chains.

Discussion and conclusion
We presented the first approach to extract peaks from MCC/IMS measurements while they are being captured, with the long-term goal to remove the need for storing full measurements before analyzing them in small embedded devices. Our method is fast and satisfies the time restrictions even on a low-power CPU platform like a Raspberry Pi and outperforms existing software.
While performing well on single spectra, there is room for improvement in merging one-dimensional peak models into two-dimensional peak models. Our method has to be further evaluated in clinical studies or biotechnological monitoring settings. It also has not been tested with the negative mode of an IMS for lack of data. In general, the robustness of the method under adversarial conditions (high concentrations with formation of dimer ions, changes in temperature or carrier gas flow in the MCC) has to be evaluated and probably improved.