 Software article
 Open Access
 Published:
Bayesian localization of CNV candidates in WGS data within minutes
Algorithms for Molecular Biology volume 14, Article number: 20 (2019)
Abstract
Background
Full Bayesian inference for detecting copy number variants (CNV) from wholegenome sequencing (WGS) data is still largely infeasible due to computational demands. A recently introduced approach to perform Forward–Backward Gibbs sampling using dynamic Haar wavelet compression has alleviated issues of convergence and, to some extent, speed. Yet, the problem remains challenging in practice.
Results
In this paper, we propose an improved algorithmic framework for this approach. We provide new spaceefficient data structures to query sufficient statistics in logarithmic time, based on a lineartime, inplace transform of the data, which also improves on the compression ratio. We also propose a new approach to efficiently store and update marginal state counts obtained from the Gibbs sampler.
Conclusions
Using this approach, we discover several CNV candidates in two rat populations divergently selected for tame and aggressive behavior, consistent with earlier results concerning the domestication syndrome as well as experimental observations. Computationally, we observe a 29.5fold decrease in memory, an average 5.8fold speedup, as well as a 191fold decrease in minor page faults. We also observe that metrics varied greatly in the old implementation, but not the new one. We conjecture that this is due to the better compression scheme. The fully Bayesian segmentation of the entire WGS data set required 3.5 min and 1.24 GB of memory, and can hence be performed on a commodity laptop.
Background
Hidden Markov models (HMM) are arguably among the central methods for signal processing. In bioinformatics, they are commonly used for the detection of copynumber variations (CNV), which have been recognized to play an important role in cancer progression [1,2,3] and neuropsychiatric disorders [4, 5]. Depending on the application and experimental platform, the number of states would be chosen between 3 for simple gains and losses, to around 10 for complex genomic alterations in certain cancers. Since CNV can disrupt or duplicate genes and regulatory elements, effects such as lossoffunction, chimeric proteins, as well as gene dosage can lead to variations in phenotype. Copynumber variants fixed in divergently selected populations can be used as candidates for genetic causes underlying phenotypic adaptations.
The challenges in HMM segmentation of WGS data are twofold. First, though the advantages of Bayesian segmentation over frequentist approaches have previously been noted [6,7,8,9,10], inference is computationally demanding on WGSscale data; in particular, Bayesian methods which rely on Markov Chain Monte Carlo (MCMC) approximations are infeasible on standard computers, in terms of memory requirements, speed and convergence characteristics. Second, HMM assume piecewise constant data with variates conditionally independent given the true segmentation, which means that any longrange bias violates the model assumptions. Unfortunately, this is the case when using readdepth data from WGS experiments for CNV estimation. The number of reads mapped to any given position is confounded by amplification bias due to primer affinity and GC content, as well as computational bias incurred during read mapping. This can lead to multiple shifts in segment means, as well as nonlinear longrange effects in the signal which would be modeled more accurately as piecewise higherorder polynomials. Removing such effects computationally, e.g. by regression methods such as loess [11], is nontrivial, as it requires the separation of three signals: additive experimental noise, a smooth longrange bias as well as the sequence of true means. In other words, it is hard to differentiate between shifts in signal averages which are due to bias and those that represent actual CN changes.
The contributions of this paper aim to address these issues. On the matter of efficient computation, it was recently shown that Bayesian inference of the hidden state sequence using Forward–Backward Gibbs sampling (FBG) [12] can be made feasible for large data sets by using a dynamic compression scheme based on Haar wavelet regression [6]. In this approach, data is presented to the Gibbs sampler in a compressed form, and the sampler adapts the compression dynamically according to the noise level it obtains in each sampling step. This has led to drastic improvements in speed and convergence behavior of FBG. Conceptually, the approach allows the software to “zoom in” on candidate regions for CNV and concentrate its computational efforts there, while ignoring long diploid segments. While the issue of convergence has been addressed and overall speed has been improved [6], memory usage remains an obstacle when analyzing WGS data. Here, we present a novel algorithmic framework to implement the dynamic wavelet compression approach for HMM inference using FBG. We provide new data structures to efficiently store and update marginal state counts for compression block structures, and to efficiently query sufficient statistics at different wavelet resolution levels. We derive a linear time, inplace algorithm for the data transform required for its construction, based on the lifting scheme [13].
On the matter of providing FBG with data that fits its model to a reasonable degree, we noticed that it is common practice to sequence sample and control in a multiplexed fashion, often for cost reasons. Using differential read counts from the same, multiplexed sequencing run, see [14] for instance, cancels out any additive coverage bias. This not only reduces the potential for false CNV calls due to systematic shifts in the data, but also obviously decreases the conditional dependence of the observed variates given the true segmentation labels. Using such data is therefore a more appropriate input to HMM methods. Aside from these general considerations, wavelet compression acts favorably on such data: regression relies on a property of wavelets called polynomial suppression. If the underlying signal is a polynomial of a degree up to a certain constant, wavelets are orthogonal to it and hence removed during regression. This yields a separation of signal and noise. Higherorder polynomials due to longrange experimental bias however would incur additional discontinuities in the regression, leading to lower compression ratios, higher memory requirements, and, consequently, longer running times of FBG.
In order to benchmark our method and demonstrate its applicability to real data, we used it to obtain CNV candidates from differential read depth data for rat populations divergently selected for tame and aggressive behavior (Fig. 1). As expected for a behavioral phenotype, the results are significantly enriched for annotations of neuronal development and function, showing that results are consistent with a hypothesis that CNV play a role in the domestication syndrome. To the best of our knowledge, this is the first time fully Bayesian inference on several hundreds of millions of latent state variables has been performed on a commodity laptop within minutes.
As was shown previously [6, 7], compressing the observed data into blocks of sufficient statistics can significantly speed up Bayesian inference, in particular Forward–Backward Gibbs sampling (FBG). While [7] used a static compression heuristic based on kdtrees, we used the discontinuities in the Haar wavelet regression as block boundaries, based on the smallest emission variance among all latent states sampled in each FBG iteration [6]. We used a data structure termed wavelet tree to solve the problem of querying sufficient statistics for each block for a given resolution/noise level, without explicitly computing the wavelet regression. We will show that this data structure induces superfluous block boundaries, leading to suboptimal compression ratios, and replace it by a new data structure called a breakpoint array. For that, as well as to elucidate the reasoning behind the use of differential read depth data to maximize compression and avoid bias in HMM inference, we briefly review the principles of function regression using wavelet shrinkage: Let \(L^2({\mathbb {R}}) :=L^2({\mathbb {R}}, {\mathcal {B}}({\mathbb {R}}), \lambda )\) be the space of squareintegrable functions over the reals. This is a Hilbert space with inner product \(\left\langle f, g \right\rangle :=\int _{\infty }^\infty f(x)g(x)dx\). As we are only concerned with functions over subsets of \({\mathbb {R}}\), the inner product commutes without involving the complex conjugate. The inner product induces the norm \(\left\ f \right\ :=\sqrt{\left\langle f, f \right\rangle }\). Two functions f, g are said to be orthogonal iff \(\left\langle f, g \right\rangle =0\), and a function f is called normal iff \(\left\ f \right\ = 1\). \(L^2({\mathbb {R}})\) contains all continuous and piecewise continuous functions, including all piecewise constant functions. Let
be the Haar wavelet [15], and \(\left\{ \psi _{j,k} (t) :=\frac{1}{\sqrt{2^j}} \psi \left( \frac{t 2^j k}{2^j}\right) \right\}\), \({j, k\in {\mathbb {Z}}}\) (depicted in Fig. 2, top). Since \(\left\ \psi _{j,k} \right\ = 1\) and \(\left\langle \psi _{j,k}, \psi _{j'k'} \right\rangle = 0\) for \((j,k)\ne (j',k')\), this forms an orthonormal basis of \(L^2({\mathbb {R}})\), where a function y is represented as the linear combination \(y = \sum _{j, k\in {\mathbb {Z}}} \left\langle \psi _{j,k}, y \right\rangle \psi _{j,k}\). The set of detail coefficients \(d_{j,k}:=\left\langle \psi _{j,k}, y \right\rangle\) is called the wavelet transform of y. A wavelet is said to have m vanishing moments if \(\left\langle p^i, \psi \right\rangle = 0, 0\le i < m, p \text { constant,}\) it follows that \(\psi\) is orthogonal to any polynomial of degree less than m, since \(\left\langle \sum _{i=1}^{m1} p^i, \psi \right\rangle = \sum _{i=1}^{m1}\left\langle p^i, \psi \right\rangle = 0\). This property is called polynomial suppression [16]. The Haar wavelet has one vanishing moment, so it is orthogonal only to constant functions.
For computational applications, a vector \(\mathbf{f }\) is obtained by sampling f at equidistant intervals. The discrete versions of the wavelets are then obtained as \({\varvec{\psi }} _{j,k}[t]:=\psi _{j,k}(t)\) for \(t\in {\mathbb {N}}\). These inherit properties such as orthogonality, finite energy and vanishing moments from their continuous counterparts. Let
be the position after the left, central and right discontinuity of \({\varvec{\psi }} _{j,k}\), respectively.
The Haar wavelet transform is an orthogonal transform, represented by a matrix \(\mathcal {W}\) with rows \({\varvec{\psi }} _{j,k}\) Let the observed signal be a sampling of a function f corrupted by centered Gaussian noise, i.e. \(\mathbf{y } = \mathbf{f } + {\varvec{\epsilon }}, {\varvec{\epsilon }}[t] \sim _{\text {i.i.d.}} N(0, \sigma ^2).\) Since the wavelet transform is linear, it acts on the signal and noise component independently, i.e. \(\mathcal {W} \mathbf{y } = \mathcal {W} (\mathbf{f } + {\varvec{\epsilon }}) = \mathcal {W} \mathbf{f } + \mathcal {W} {\varvec{\epsilon }}.\) The central idea in wavelet shrinkage is that \(\left\langle \mathbf{f }, {\varvec{\psi }} _{j,k} \right\rangle =0\) if \(\mathbf{f }\) is polynomial over the entire support of \({\varvec{\psi }} _{j,k}\) due to polynomial suppression, and, in particular, the support does not span a discontinuity in \(\mathbf{f }\). Furthermore, due to orthogonality of \(\mathcal {W}\), \(\mathcal {W} {\varvec{\epsilon }}\) is again a random vector of i.i.d. random variables distributed as \(N(0, \sigma ^2)\), so the noise is maintained under the wavelet transform. In general, orthogonal maps preserve the \(L^2\) norm, so \(\left\ \mathcal {W} {\varvec{\epsilon }} \right\ = \left\ {\varvec{\epsilon }} \right\\) and \(\left\ \mathcal {W} \mathbf{y } \right\ = \left\ \mathbf{y } \right\\). It follows that for piecewise polynomial functions with only a few discontinuities, \(\left\langle \mathbf{y }, {\varvec{\psi }} _{j,k} \right\rangle = \left\langle {\varvec{\epsilon }}, {\varvec{\psi }} _{j,k} \right\rangle\) for most j, k, i.e. most wavelet coefficients are only nonzero due to noise. The idea is then to find a way to create a vector \(\mathbf{w }\) by setting a suitable set of coefficients in \(\mathcal {W} \mathbf{f }\) to zero, and then use the inverse wavelet transform as a regression \(\hat{\mathbf{f }} :=\mathcal {W} ^\intercal \mathbf{w }\). The simplest method is to use the universal threshold \(\lambda _u :=\sqrt{2\ln T}\sigma\) [17], which can be interpreted as the expected maximum deviation of T such Gaussian random variables from their mean, as derived by Cramér–Chernoff’s method [18]. Hence, removing coefficients of absolute value below \(\lambda _u\) removes all noise coefficients with high probability [17]. Using different variances, the resulting \(\hat{\mathbf{f }}\) are piecewise constant functions, whose discontinuities we interpret as block boundaries in a compression scheme. In our approach, \(\sigma ^2\) is the minimum variance of all emission parameters in the HMM as sampled at each iteration. The existence of a discontinuity obviously depends on the magnitude of the wavelet coefficients involved: if \(\left{} d_{j,k} \right >\lambda _u\), then there are block boundaries before data positions \(b_{j,k}^+\), \(b_{j,k}^\pm\) and \(b_{j,k}^\).
Implementation
Block generators
In order to avoid recomputing the wavelet regression explicitly for a new threshold in each FBG iteration, consider the following abstract data structure:
Definition 2.1
(Block generator) Let \(\mathbf{b }\) be a vector of breakpoint weights. For a threshold \(\lambda\), let \(\mathbf{Y }_\lambda\) be a partition of \(\mathbf{y }\) into blocks such that there is a block boundary between positions \(t1\) and t if \(\mathbf{b }[t]\ge \lambda\). We call a data structure a block generator if it can, for any threshold \(\lambda\), generate an ordered sequence of sufficient statistics that represents \(\mathbf{Y }_\lambda\). A block generator is called compressive if, for all \(\lambda\), \(\mathbf{b }[t]<\lambda\) implies that no breakpoint is created between \(t1\) and t. It is called subcompressive if for some \(\lambda\) such a superfluous block boundary is created. A block generator is called spaceefficient if it stores no more than T sufficient statistics, where T is the number of input data points.
This definition of a block generator implies that \(\mathbf{Y }_{\lambda _1}\) is a subdivision of \(\mathbf{Y }_{\lambda _2}\) if \(\lambda _1 \le \lambda _2\). For sufficiently small thresholds, we require sufficient statistics for each data point, hence any block generator implementation will have to store a minimum of T sufficient statistics. On the other hand, if all entries in \(\mathbf{b }\) are unique, each breakpoint subdivides a block defined by a higher threshold, and a simple induction argument shows that a block generator has to be able to generate \(2T1\) different blocks and their sufficient statistics: starting with a single block of size T and a sorted sequence of threshold values in \(\mathbf{b }\), each threshold creates two new blocks by subdividing one block in the previous partition.
We previously defined the wavelet tree data structure to serve as a block generator; for details, see [6]. It is based on the observation that the nonzero support intervals of wavelet basis functions are nested along scales (cf. Fig. 2). Each node corresponds to a basis function, with its position corresponding to the position of the wavelet’s central discontinuity. The wavelet tree stores the maximum absolute coefficient \(s_{ij}\) of its subtree in the node. To obtain the sufficient statistics for a block at a given noise level, the tree is traversed in DFS order. Whenever a node is encountered for which \(s_{ij}<\lambda\), none of its descendants can have a higher value, and hence no additional discontinuities. The subtree is pruned from the DFS, creating a single block for the sufficient statistics of its leaf nodes. On the other hand, if \(s_{ij} \ge \lambda\), the search recurses on the subtrees, creating additional block boundaries between leaves.
Unfortunately, the wavelet tree is subcompressive, as demonstrated by the counterexample in Fig. 2, as well as memoryinefficient, since it stores \(2T1\) statistics. It should be noted that, while the wavelet tree stores as many sufficient statistics as needed for T data points, the fact that it is subcompressive implies that the block structures it creates differ from those of a compressive block generator, and hence these are not the same \(2T1\) statistics that would occur in across all block structures a compressive block generator would yield.
In order to provide an efficient implementation, we separate a block generator into two substructures: a breakpoint array to derive a sequence of start and end positions for blocks, and an integral array to query the sufficient statistics for each block.
Integral array for block statistics
Let a data structure \(D(\mathbf{y })\) support the following query: given a start index s and an end index e, with \(s<e\), return the sufficient statistics in the halfopen interval [s, e), i. e. \(\sum _{i=s}^{e1}\mathbf{T }(\mathbf{y }[i])\). A trivial implementation of such a data structure would be to store the statistics of each input position, and then iterate through the array and calculate their cumulative sums between breakpoints. This is obviously costly for huge data, as it incurs \(\Theta (N)\) time complexity for a block of size N. Constanttime queries could be made by precomputing all \(T^2\) statistics, which is obviously prohibitive for large data.
The basic idea for querying sufficient statistics comes from a simple data structure in image processing called a summedarea table or integral image [19], which is used to query the sum of a rectangular region in constant time. As its onedimensional equivalent, let \(\mathbf{v }\) be an integral array such that
For any arbitrary start and end positions s, e, the sufficient statistics of the block [s, e) can be calculated in constant time as
In contrast to image processing, where integral arrays are constructed over integer data, sufficient statistics require floatingpoint values for most distributions. Unfortunately, this incurs numeric problems for large data sizes. An IEEE 754 singleprecision float has between 6 and 9 significant digits. Assuming that values for sufficient statistics are on the order of 1, the further back a data point is in \(\mathbf{v }\), the more of its significant digits is used to store the sum. Neighboring entries will be similar or even equal, leading to catastrophic cancellation for short segments. For instance, values above \(\sim\)17 million are rounded to multiples of 2, so that even if each entry was 1.0, blocks of size 1 would be queried as 0.
To alleviate this, we subdivide \(\mathbf{v }\) into nonoverlapping cells of size c, and compute partial cumulative sums of sufficient statistics within each cell; for convenience, we compute these sums from high to low indices, see Fig. 3. It is then easy to see that \(\sum _{t=s}^{e1} \mathbf{T }(\mathbf{y }[t]) = \left( \sum _j \mathbf{v }[j]\right)  \mathbf{v }[e]\) for \(j\in \left\{ s\right\} \cup \left\{ i\,\big \, s<i\le e, i\equiv 0\,\, (\text {mod } c)\right\}\). In our implementation, we used \(c=2^{16}=65,\!536\).
Breakpoint array for block boundaries
In order to create a block generator, the integral array has to be supplemented with a data structure which yields start and end positions \(s_k(\lambda )\), \(e_k(\lambda )\) for subsequent blocks k. Since \(e_k(\lambda )=s_{k+1}(\lambda )\), it suffices to implement an iterator over \(s_k\) for increasing k, where \(s_0 = 0\) and \(s_k = e_k(\lambda ) = s_{k+1}(\lambda )\). We use a simple array of pointers to facilitate these queries:
Definition 2.2
(Breakpoint array) Let \(\mathbf{b } \in {\mathbb {R}} ^T\) be a vector of breakpoint weights, and \(\mathbf{p } \in {\mathbb {Z}} ^T_+\) be a vector of pointers. A data structure \((\mathbf{b }, \mathbf{p })\) is called a breakpoint array of input data \(\mathbf{y }\) if and only if \(\forall t<i<t+\mathbf{p } [t]: \mathbf{b } [t] >\mathbf{b } [i]\). We call each interval \([t, \dots , \mathbf{p } [t]1]\) a stretch at t. A breakpoint array is called maximal if for all T there exist no \(n > \mathbf{p } [t]\) such that setting \(\mathbf{p } [t]\) to n would still result in a valid breakpoint array.
A breakpoint array can be constructed in linear time O(T) (Algorithm 1), based on a lineartime algorithm to calculate the pointers to the next element at least as large as the current one, which is well established in algorithmic folklore. It is modified here to use the distance to that element instead of a direct pointer (line 20, which would normally read \(\mathbf{p }[i] \leftarrow t\)). The stack is changed to a deque to accommodate the inclusion of a maximum jump size m. The front of the deque is popped and its pointer set whenever it is m positions away, which happens at most T times.
For each t, \(\mathbf{p } [t]\) points to the beginning of next stretch. Within each stretch, the highest breakpoint weight is located at its first position; when searching for weights below a given threshold \(\lambda\), once the first weight is found to be below \(\lambda\), all others can be safely ignored, leading to a simple query: Starting at \(e_{k}(\lambda )+1\), follow pointers until a weight above threshold is encountered (see Fig. 4). In order to derive complexity results, we require the following result:
Theorem 2.1
(Lefttoright maxima [20, 21]) For a vector \(\mathbf{x },\) let \(\mathbf{x }[t]\) be called a lefttoright maximum of \(\mathbf{x }\) iff \(\forall i<t: \mathbf{x }[i]<\mathbf{x }[t].\) Let \(m_{\mathbf{x }}\) count the number of lefttoright maximal elements in \(\mathbf{x }.\) For a random permutation of \(\mathbf{x }\) with \(\left{} \mathbf{x } \right =N\) elements, \({\mathbb {E}}\left[ m_{\mathbf{x }}\right] = \sum _{i=1}^N\frac{1}{N} \rightarrow \ln N\) as \(N\rightarrow \infty.\) Due to symmetry, the same result holds for minima and righttoleft extrema.
Following pointers in \(\mathbf{p }\) creates a sequence of lefttoright maxima. For a block of size N, starting at \(e_k(\lambda )\), there are \(M:=N2\) elements in \(I:=[e_k(\lambda )+1, \dots , e_k(\lambda )+N=e_{k+1}(\lambda ))\) which can appear in any order, which implies that \(e_{k+1}(\lambda )\) can be found in \(O(\log N)\) expected time. Likewise, the maximum expected stack size in the constructor (Algorithm 1) is \(\ln T\): assume \(m=\infty\). An element at t is pushed whenever there exists an index j on the stack such that \(\forall i=j, \dots , \text {top}: \mathbf{w }[i] < \mathbf{w }[t]\). Given the smallest such j, the stacks are popped until \(\text {top} = j1\), and \(\mathbf{w }[j1] > \mathbf{w }[t]\). Therefore, the stack contains the righttoleft minima of \(\mathbf{w }[1:t]\) after pushing index t, and the claim follows from Theorem 2.1 for \(t=T\). For any \(m<\infty\), the front of the deque gets popped, thus only decreasing the stack size. For the size \(T_{hg}\) of the human genome (3.5 billion), the expected maximum stack size is \(<22\), a negligible overhead. We noticed that, for noisy data, most entries in \(\mathbf{p }\) are much smaller than T, and using pointersized integers such as size_t in C++ (typically 8 byte on 64bit systems), would be wasteful. Instead, we use a 2byte unsigned integer type to accommodate jumps up to \(m=65, 536\). The resulting breakpoint array is not maximal anymore, but maintains its spaceefficiency and compressivity. The query overhead is minimal in practice; even in case of a single block for genome sized data, \(\frac{T_{hg}}{65, 536} < 54\).
Haar breakpoint weights
Having established a data structure to iterate over blocks for any given compression level, we now define a vector \(\mathbf{b }_H\) of breakpoint weights for the Haar wavelet transform, i. e. \(\mathbf{b }_H[t] > \lambda\) iff Haar wavelet regression with threshold \(\lambda\) contains a discontinuity between \(t1\) an t, and therefore a block boundary in Haar wavelet compression. This is the case if the absolute value of any coefficient of wavelets who have any of their discontinuities at t as above the threshold, so we define, for any \(t = b^\pm _{j,k} \in [0, T)\),
for \(t>0\) or \(b^_{j,k} < T\). Additionally, there is always a block boundary before the first position, so \(\mathbf{b }_H[0]:=\infty\). Furthermore, if T is not a power of 2, some wavelets have incomplete support. As their magnitude is unknown without padding the data, we assume that their detail coefficient is potentially larger than any threshold, inducing a breakpoint at the central discontinuity, so \(\mathbf{b }_H\left[ b^\pm _{j,k}\right] :=\infty\) for \(b^_{j,k}\ge T\). A breakpoint array initialized with these weights is called a Haar breakpoint array.
We will show that \(\mathbf{b }_H\) can be computed inplace and in linear time. For that purpose, we first define the maxlet array as a generalization of the Haar transform to arbitrary data sizes and absolute values: For \(b^\pm _{j,k} \in [0, T)\), let
We later define the Haar boundary transform to compute \(\mathbf{b }_H\) from \(\mathbf{b }_M\). In order to compute \(\mathbf{b }_M\) inplace, we cannot use the pyramid algorithm as in [6], since it requires padding of the data to a size \(T'\in 2^{\mathbb {N}}\), \(T\le T'\le 2T\), as well as an auxiliary array of size \(T'\), thereby increasing the memory by up to a factor of 4. Instead, we use a more recent inplace calculation of the Haar wavelet transform based on the lifting scheme [13, 22]. It is based on the following recursions:
These relations are illustrated in Fig. 5 using dotted edges, with \(d_{j,k}=w_{j,k}\) and \(c_{0,k} = y_{k} = \mathbf{y }[k]\). By storing \(c_{j,k}\) at index \(b^+_{j,k}\) and \(d_{j,k}\) at index \(b^\pm _{j,k}\), this yields a simple inplace algorithm which never overwrites \(d_{j,k}\) once it is calculated. Notice that detail coefficients \(d_{j,k}\) are stored at the position \(b^\pm _{j,k}\) corresponding to the central discontinuity in their corresponding wavelet, and that this corresponds to an inorder DFS layout of the wavelet tree without the leaves corresponding to the input data, with the leftmost leaf at index 1 (Fig. 5, bold lines); the tree is created from the leaves up, and from left to right. A straightforward modification of the lifting scheme to calculate \(\mathbf{b }_M\) is shown in Algorithm 2, where line 13 is changed to yield the absolute value, and lines 9, 14 and 15 are added to ensure \(\mathbf{b }_H\left[ b^\pm _{j,k}\right] :=\infty\) for \(b^_{j,k}\ge T\).
To derive Haar breakpoint weight from the maxlet transform, we introduce the Haar boundary transform (Algorithm 3), which performs the necessary maximum computations for Eq. 1 inplace and in linear time O(T). In Fig. 5 (top), the set of nodes considered in Eq. 1 are the direct descendants of a node along the solid lines. Algorithm 3 is simple: it iterates over the scales j in a topdown fashion (Fig. 5), and writes the maxima of all required nodes at lower levels \(\ell \le j\) to the current array position. Since it never reads values from levels \(>j\), no extra memory is required, and the algorithm is inplace. Since any node is considered at most twice for updating a node on a higher level, the running time of the Haar boundary transform is also linear, O(T).
Compressed marginal records
In order to keep track of the states sampled for each position during Gibbs sampling, we require the following data structure:
Definition 2.3
(Marginal records) Let \(t\in [0, \ldots , T)\), \(s_{\max }\) the largest state sampled during FBG, and \(s\in [0, \ldots , s_{\max }]\). A marginal record is a data structure which allows to store and query the number of times state s was observed at data index t.
The previous solution to recording marginal state counts was inefficient. Since nodes in the wavelet tree corresponded to compression blocks, counts were stored directly in the nodes. For n latent HMM states, this required allocation of 2Tn array elements, which was wasteful since the quick convergence of HaMMLET meant that many blocks would never be sampled, or only be assigned to a small subset of CNV states. Such a preallocation approach also requires the number of states to be known in advance, and precludes further extensions to priors on the state number such as the Dirichlet Process. Though we resorted to dynamic allocation, the necessary variables for housekeeping still incurred large overhead.
For static compression blocks, marginals can simply be stored in a set of arrays with an additional array containing block sizes, essentially a runlength encoding (RLE), as illustrated by the right column of Fig. 6. This approach however is complicated by the use of dynamic compression: at each new iteration, a different block structure is created, which requires existing RLE segments to be split into multiple parts, each of which will have counts for a different state added. This could be solved trivially using a linked list implementation, in which new segments are inserted with the appropriate updates of its neighbors size. This approach is obviously wasteful.
To get around these issues, we developed an encoding for marginal records that stores counts sequentially in a vector of integers in a highly compressed fashion with minimum overhead. Adding records for runlength encoded state sequences is performed using a queue with iterator access to its front elements, such as implemented by the C++ STL deque, and requires a single pass over the state records and is therefore linear. The memory overhead is 2 bytes per segment, plus one bit for every 32 integers. Encoding for marginal counts for a single position is performed using a sequence \(\mathbf{c }\) of signed integers. A negative number is used to store the counts for a state. The state s(i) of a position i is recursively defined as
Positive entries are called index values. We further require that all index values must be in strictly increasing order, and that no unnecessary index is used, i. e. we require \(\forall \mathbf{c }[i]>0: s(i1)+1 < \mathbf{c }[i]\). In other words, runs of states having observed counts are represented as runs of negative numbers, and runs of zerocounts are represented as a single number indicating the state label of the next higher state with nonzero counts. For instance, the count vector (2, 0, 0, 8, 1, 4, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0) would be encoded as \((2, 3, 8, 1, 4, 9, 5)\), and the corresponding states are (0, 1, 3, 4, 5, 6, 9), though 1 and 6 are somewhat inconsequential as they have no counts associated with them; note that the decision to use negative signs for counts instead of index values is arbitrary in principle, but leads to using fewer negations in the implementation. In settings where quick convergence is expected, the number of zeros is expected to be high, leading to good compression under this scheme. In general, assume that the marginals contain M distinct segments after running FBG, and the HMM has S states. Then, the queue can contain no more than \((2S + 1)M\) entries: for each segment, one zero to mark the beginning of a segment, and up to one positive and negative value per state. If the number of latent HMM states is limited to S, then there can be no more than S nonzero entries per segment. Hence, for reasonably high compression ratios, this amounts to small memory usage. For instance, at a compression ratio of 300 for a human genome at baselevel resolution and 10 latent HMM states, marginal records using 2byte signed integers require less than 234 MB. In practice, not every segment will contain 11 values, due to fast convergence, and the numbers get even smaller. Compared to the storage requirements of the block generator, this is negligible.
Results and discussion
In order to verify that the higher compression did not adversely affect the segmentation quality, we reran the evaluation on the 129,000 simulated datasets in [6] using our new implementation of HaMMLET. The Fmeasures and plots are virtually identical to Fig. 5 in that paper, and are therefore not shown here (see Web Supplement).
In the following subsections, we present a case study of CNV inference on differential WGS read depth data using HaMMLET with the Haar breakpoint array.
Experiment background
The domestication of a handful of animal species, starting in the early holocene, has played a crucial role in the development of complex human societies [23]. While we have learned a great deal about when and where animal domestication occurred, the genetic changes that underlie the phenotypic differences between domestic animals and their wild progenitors remain relatively unknown. It has been observed that domestic animal species tend to share a suite of behavioral, physiological and morphological traits that are absent or rarely observed in their wild progenitors [24, 25]. These traits include changes in pigmentation, craniofacial anatomy, hormonal levels, seasonal reproduction cycles and increased docility [26]. This suite of changes is referred to as the “domestication syndrome”. A longstanding question in evolutionary biology is whether these convergent changes are the result of genetic drift, artificial selection by humans for each individual trait, or pleiotropic effects of selection for a few or even a single trait. A proponent of the latter hypothesis was the Academician Dmitry K. Belyaev. He hypothesised that selection for tameness at the start of the domestication process had pleiotropic effects that explained many of the features of the domestication syndrome. To test his hypothesis, he began a program of experimental domestication of the silver fox (Vulpes vulpes) in Novosibirsk, Siberia in 1959. Foxes obtained for fur farms were selectively bred for their behavioral response to an approaching human. One line of foxes was bred for tame behavior towards humans while a control line was selected for a fearful and aggressive response towards humans, to maintain the wildtype behavior despite being maintained in captive conditions. After just a few generations of selective breeding the tame line began to show many of the traits associated with the domestication syndrome, including changes in pigmentation, morphology and behavior [27,28,29].
The same experimental setup of artificially selecting two lines, one for tame and one for fearful and aggressive behavior towards humans was also repeated by the same research group in the brown Norway rat (Rattus norvegicus) with similar results [30]. These results seem to confirm Belyaev’s hypothesis that selection for tameness alone could explain many of the features of the domestication syndrome. However, the specific genetic changes that underlie these changes remain unknown. Knowledge of the genetic variants that have been selected in these lines could lead to mechanistic insights into the domestication process. Genomic structural variants are of particular interest as they are known to have played a role in the adaptation of other domestic animals [31] and structural variants that affect multiple functional genomic loci are one possible explanation for the rapid response to selection observed in these lines. To address this issue we analysed wholegenome data that was generated from multiple individuals from the tame and aggressive lines of rats.
Sample origins and data generation
DNA samples were obtained from two rat lines originating from a shared wild source population and subsequently maintained in isolation and divergently selected for \(\sim\)70 generations for their behavioral response to humans. 20 samples were obtained from the tame line, which has been selected for a reduced fear response towards an approaching human hand. 20 samples were obtained from the aggressive line, which has been selected for an increase in fearful and aggressive behavior towards an approaching human hand. DNA extraction was carried out at the Institute of Cytology and Genetics, the Siberian Branch of the Russian Academy of Sciences, Novosibirsk and at the Max Planck Institute for Evolutionary Anthropology (MPIEVA), Germany.
For all samples, sequencing libraries were generated consisting of 125 bp doubleindexed pairedend reads. Samples were pooled into a single library in order to avoid any batch effects during sequencing. Sequencing was performed on a combination of the Illumina Genome Analyzer II and HighSeq platforms. Library preparation and sequencing was carried out at the MPIEVA. The rats have a mean coverage of \(\sim\)4× per individual. Base calling was done using freeIbis [32]. Adapters were removed and potentially chimeric sequences flagged using leeHom with default parameters [33]. Reads were demultiplexed using deML using default quality thresholds [34]. Reads were then mapped to the Rattus norvegicus reference assembly rno5, using the BWA with default parameters [35]. Duplicate read removal was performed with Picard (http://broadinstitute.github.io/picard/). Local indel realignment was performed using GATK [36]. Lowest mapping positions were recorded for each read, and their counts were accumulated. Start counts for the tame population were subtracted from their counterparts in the aggressive population, yielding 1,880,703,547 data points. Due to the low coverage, the data showed highly discrete noise, and hence the data was averaged over nonoverlapping windows of 20 positions to approximate Gaussian noise, resulting in 94,035,178 input positions. We then ran HaMMLET with 8 CNV states and automatic priors, see [6].
Computational benchmarks
On a computer with Intel Xeon CPU E78890 v4 (2.20 GHz) and 1 TB RAM, running Ubuntu 14.04.5 LTS, full Bayesian inference with HaMMLET for 200 iterations with a burnin of 1800 for an 8statemodel required 3 min 41 s and 1.3 GB RAM on a single core. By comparison, the previously published version of HaMMLET took 1 h 5 min 27 s, using 40 GB RAM, a 17.8fold speedup.
For a broader evaluation, we have created 100 replicates of the data by splitting it into 2500 chunks of equal sizes, which we then permuted randomly. We measured the memory usage (maximum resident set size), running time as well as cache behavior (minor page faults), see the boxplots in Fig. 7). The smaller savings in runtime compared to the original data can be attributed to the fact that permutation of the data is likely to disrupt long highly compressible sections of the data.
While the RAM usage remains almost constant among replicates within each implementation, we noticed that runtime and cache behavior varied widely in the old, but not the new implementation. We attribute this to the fact that the old compression scheme is suboptimal, yielding smaller blocks and hence more randomized assignment to states, leading to slower mixing properties of the Gibbs sampler. Notice that the data contains outliers which are likely to result from sampling small emission variances due to short compression blocks.
Biological results
We consider all genomic segments with an absolute state mean \(\ge 1\) as containing putative structural variation segregating between the tame and aggressive rat lines. This results in 10,083,374 regions with a mean size of 407 base pairs. We identify all genes that are within or overlap these regions by \(\ge 1\) base pair using Ensembl’s Variant Effect Predictor [37]. We find 1036 genes with at least partial overlap with these regions.
To investigate the potential phenotypic consequences of these structural variants we performed GO gene enrichment analysis using the software Webgestalt [38, 39]. We tested for enrichment of GO categories using all genes overlapping these structural variants using all genes in the rat genome as background. We consider as significantly enriched all pathways with pvalue \(<0.05\) after using the Benjamini and Hochberg procedure to correct for multiple hypothesis testing [40]. We identify many significantly enriched pathways (Additional file 1: Table S1). We now briefly discuss some of these pathways and the genes within them and how they may inform us about the genetic changes underlying the phenotypic differences between these lines.
The most significantly enriched pathway is “Synapse assembly” (pvalue = 0.0028), with five genes that are in putative structural variants segregating between the tame and aggressive rat lines. Some of these genes are associated with phenotypes that may be involved in the behavioral differences observed between the tame and aggressive rat lines. For example, one of the genes is the neuronal cadherin gene Cdh2. Missense mutations in this gene are associated with obsessivecompulsive behavior and Tourette disorder phenotypes in humans [41] and this gene has been associated with anxiety in mice [42]. Another gene encodes the ephrin receptor Ephb1. The ephrin receptorligand system is involved in the regulation of several developmental processes in the nervous system. Notably, mice with null mutations for this gene exhibit neuronal loss in the substantia nigra and display spontaneous locomotor hyperactivity [43]. This is interesting given that the tame and aggressive rats have differences in their activity in an openfield test [30].
We also observe multiple additional enriched pathways involved in neuronal development and function, e.g. “transmission of nerve impulse”, “regulation of neurological system process”, “dendrite morphogenesis”. Therefore, we suspect that many of these segregating structural variants may have been targeted by selection and are contributing the phenotypic differences between these lines. Future study of the variants identified here may lead to insights into the domestication process. A more detailed evaluation of our finding will be published elsewhere. Plots of segmentation results for the entire genome can be found in the web supplement at https://schlieplab.org/Supplements/rats/.
Conclusion
We have presented an new wavelet compression scheme for HaMMLET. The compression is optimal in that it does not introduce unnecessary block boundaries in addition to the wavelet regression discontinuities. This leads to much more stable benchmarks and reliable performance. Additional improvements, such as a memoryefficient data structure for marginal state records, allow for Bayesian inference of a hidden Markov model of genomesized data, such as for CNV calling, on standard consumer hardware. Future applications include inference on multivariate data. By computing detail coefficients in postorder DFS across all dimensions simultaneously, and the maxlet transform has a straightforward generalization to higher dimensions with only \(O(\log T)\) overhead, instead of the naive \(\Theta (T)\) incurred by aggregating maxima in a second array.
Availability and requirements
 Project name: :

HaMMLET
 Project home page: :
 Operating system: :

Platformindependent
 Programming language: :

C++
 Other requirements: :

C++11compliant compiler. For plotting: Python 2.7, Matplotlib
 License: :

GNU GPL.
Availability of data and materials
References
 1.
Fröhling S, Döhner H. Chromosomal abnormalities in cancer. N Engl J Med. 2008;359(7):722–34. https://doi.org/10.1056/NEJMra0803109.
 2.
Garraway LA, Lander ES. Lessons from the cancer genome. Cell. 2013;153(1):17–37. https://doi.org/10.1016/j.cell.2013.03.002.
 3.
Nakagawa H, Wardell CP, Furuta M, Taniguchi H, Fujimoto A. Cancer wholegenome sequencing: present and future. Oncogene. 2015;34(49):5943–50. https://doi.org/10.1038/onc.2015.90.
 4.
Malhotra D, Sebat J. Cnvs: Harbingers of a rare variant revolution in psychiatric genetics. Cell. 2012;148(6):1223–41. https://doi.org/10.1016/j.cell.2012.02.039.
 5.
Chung BHY, Tao VQ, Tso WWY. Copy number variation and autism: new insights and clinical implications. J Formos Med Assoc. 2014;113(7):400–8. https://doi.org/10.1016/j.jfma.2013.01.005.
 6.
Wiedenhoeft J, Brugel E, Schliep A. Fast Bayesian inference of copy number variants using hidden Markov models with wavelet compression. PLoS Computat Biol. 2016;12(5):1–28. https://doi.org/10.1371/journal.pcbi.1004871.
 7.
Mahmud MP, Schliep A. Fast MCMC sampling for hidden Markov models to determine copy number variations. BMC Bioinform. 2011;12:428. https://doi.org/10.1186/1471210512428.
 8.
Shah SP, Lam WL, Ng RT, Murphy KP. Modeling recurrent DNA copy number alterations in array CGH data. Bioinformatics. 2007;23(13):450–8. https://doi.org/10.1093/bioinformatics/btm221.
 9.
Rydén T. EM versus Markov Chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Anal. 2008;3(4):659–88. https://doi.org/10.1214/08BA326.
 10.
Scott SL. Bayesian methods for hidden Markov models: recursive computing in the 21st century. J Am Stat Assoc. 2002;97(457):337–51. https://doi.org/10.1198/016214502753479464.
 11.
Cleveland WS, Grosse E. Computational methods for local regression. Stat Comput. 1991;1(1):47–62. https://doi.org/10.1007/BF01890836.
 12.
Chib S. Calculating posterior distributions and modal estimates in markov mixture models. J Econom. 1996;75(1):79–97.
 13.
Sweldens W. Lifting scheme: a new philosophy in biorthogonal wavelet constructions. In: Laine AF, Unser MA, editors. Wavelet applications in signal and image processing III. Bellingham: International Society for Optics and Photonics; 1995. p. 68–79. https://doi.org/10.1117/12.217619. http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1007578.
 14.
Daines B, Wang H, Li Y, Han Y, Gibbs R, Chen R. Highthroughput multiplex sequencing to discover copy number variants in drosophila. Genetics. 2009;182(4):935–41.
 15.
Haar A. Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen. 1910;69(3):331–71. https://doi.org/10.1007/BF01456326.
 16.
Mallat SG. A wavelet tour of signal processing: the sparse way. Burlington: Academic Press; 2009. http://dl.acm.org/citation.cfm?id=1525499.
 17.
Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–55. https://doi.org/10.1093/biomet/81.3.425.
 18.
Massart P. Concentration inequalities and model selection. Lect Notes Math. 2003;1896:1–324. https://doi.org/10.1007/9783540485032.
 19.
Lewis JP. Fast template matching. In: Vision interface 95. Quebec City: Canadian Image Processing and Pattern Recognition Society; 1995. p. 120–3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.157.3888.
 20.
Lovász L. Combinatorial problems and exercises. Providence: American Mathematical Society; 1993. p. 639.
 21.
Knuth DE. The art of computer programming. Upper Saddle River: AddisonWesley Professional; 1997.
 22.
Sweldens W. The lifting scheme: a construction of second generation wavelets. SIAM J Math Anal. 1998;29(2):511–46. https://doi.org/10.1137/S0036141095289051.
 23.
Diamond JM. Guns, germs and steel: a short history of everybody for the last 13,000 years. New York: Random House; 1998.
 24.
Darwin C. The variation in animals and plants under domestication. London: John Murray; 1868.
 25.
Wilkins AS, Wrangham RW, Fitch WT. The “domestication syndrome” in mammals: a unified explanation based on neural crest cell behavior and genetics. Genetics. 2014;197(3):795–808.
 26.
SánchezVillagra MR, Geiger M, Schneider RA. The taming of the neural crest: a developmental perspective on the origins of morphological covariation in domesticated mammals. R S Open Sci. 2016;3(6):160107.
 27.
Belyaev DK. Domestication of animals. Sci J. 1969;5(1):47–52.
 28.
Trut LN, Plyusnina IZ, Oskina IN. An experiment on fox domestication and debatable issues of evolution of the dog. Russ J Genet. 2004;40(6):644–55. https://doi.org/10.1023/B:RUGE.0000033312.92773.c1.
 29.
Trut L, Oskina I, Kharlamova A. Animal evolution during domestication: the domesticated fox as a model. BioEssays. 2009;31(3):349–60. https://doi.org/10.1002/bies.200800070.
 30.
Albert FW, Shchepina O, Winter C, Römpler H, Teupser D, Palme R, Ceglarek U, Kratzsch J, Sohr R, Trut LN, Thiery J, Morgenstern R, Plyusnina IZ, Schöneberg T, Pääbo S. Phenotypic differences in behavior, physiology and neurochemistry between rats selected for tameness and for defensive aggression towards humans. Horm Behav. 2008;53(3):413–21. https://doi.org/10.1016/j.yhbeh.2007.11.010.
 31.
Axelsson E, Ratnakumar A, Arendt ML, Maqbool K, Webster MT, Perloski M, Liberg O, Arnemo JM, Hedhammar Å, LindbladToh K. The genomic signature of dog domestication reveals adaptation to a starchrich diet. Nature. 2013;495(7441):360–4. https://doi.org/10.1038/nature11837.
 32.
Renaud G, Kircher M, Stenzel U, Kelso J. freeibis: an efficient basecaller with calibrated quality scores for illumina sequencers. Bioinformatics. 2013;29(9):1208–9. https://doi.org/10.1093/bioinformatics/btt117.
 33.
Renaud G, Stenzel U, Kelso J. leeHom: adaptor trimming and merging for illumina sequencing reads. Nucleic Acids Res. 2014;42(18):141. https://doi.org/10.1093/nar/gku699.
 34.
Renaud G, Stenzel U, Maricic T, Wiebe V, Kelso J. deML: robust demultiplexing of illumina sequences using a likelihoodbased approach. Bioinformatics. 2015;31(5):770–2. https://doi.org/10.1093/bioinformatics/btu719.
 35.
Li H, Durbin R. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics. 2009;25(14):1754–60. https://doi.org/10.1093/bioinformatics/btp324.
 36.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The genome analysis toolkit: a mapreduce framework for analyzing nextgeneration DNA sequencing data. Genome Res. 2010;20(9):1297–303. https://doi.org/10.1101/gr.107524.110.
 37.
McLare W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome Biol. 2016;17(1):122. https://doi.org/10.1038/513S8a.
 38.
Zhang B, Kirov S, Snoddy J. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005;33(Web Server):741–8. https://doi.org/10.1093/nar/gki475.
 39.
Wang J, Duncan D, Shi Z, Zhang B. WEBbased gene set analysis toolkit (WebGestalt): update 2013. Nucleic Acids Res. 2013;41(W1):77–83. https://doi.org/10.1093/nar/gkt439.
 40.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57(1):289–300.
 41.
Moya PR, Dodman NH, Timpano KR, Rubenstein LM, Rana Z, Fried RL, Reichardt LF, Heiman GA, Tischfield JA, King RA, Galdzicka M, Ginns EI, Wendland JR. Rare missense neuronal cadherin gene (CDH2) variants in specific obsessivecompulsive disorder and tourette disorder phenotypes. Eur J Hum Genet. 2013;21(8):850–4. https://doi.org/10.1038/ejhg.2012.245.
 42.
Donner J, Pirkola S, Silander K, Kananen L, Terwilliger JD, Lönnqvist J, Peltonen L, Hovatta I. An association analysis of murine anxiety genes in humans implicates novel candidate genes for anxiety disorders. Biol Psychiatry. 2008;64(8):672–80. https://doi.org/10.1016/j.biopsych.2008.06.002.
 43.
Richards AB, Scheel TA, Wang K, Henkemeyer M, Kromer LF. EphB1 null mice exhibit neuronal loss in substantia nigra pars reticulata and spontaneous locomotor hyperactivity. Eur J Neurosci. 2007;25(9):2619–28. https://doi.org/10.1111/j.14609568.2007.05523.x.
Acknowledgements
JW would like to thank Janet Kelso, Svante Pääbo, and everyone at the Max Planck Institute for Evolutionary Anthropology in Leipzig for their kind hospitality and support.
Funding
Funding was provided through NIH grant 1 U01 CA19895201 and The Federal Research Center Institute of Cytology and Genetics Grant N 032420180016.
Author information
Affiliations
Contributions
JW and AS conceived the computational approach. JW designed and implemented the software. RG, RK and AC performed the experiments and analysis. JW, AC and AS wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wiedenhoeft, J., Cagan, A., Kozhemyakina, R. et al. Bayesian localization of CNV candidates in WGS data within minutes. Algorithms Mol Biol 14, 20 (2019). https://doi.org/10.1186/s1301501901547
Received:
Accepted:
Published:
Keywords
 HMM
 Wavelet
 CNV
 Bayesian inference