We must train the parameters W for the above model using data of the form \(D^\text {train}, H^\text {train}\). We will examine these trained parameters (and several good solutions for them) for insights into which markers are most informative for describing \(D^\text {train}\) and thus topological domains.
Problem 1
Training: Given a set of marker data \(H^{\text {train}}\), likely from several chromosomes and cell conditions, and corresponding set of TAD decompositions \(D^{\text {train}}\), we estimate the most likely parameters W according to Eq. 2.
Problem 2
Inference: Given marker data H model parameters W, we estimate the best domain partition D of the track.
Training
A nice feature of the objective (3) is that it is convex in its arguments, \(\{\mathbf {w^{b}_{m}}, \mathbf {w^{i}_{m}},\) \(\mathbf {w^{e}_{m}}\}_{m \in M}\), which follows from linearity, composition rules for convexity, and convexity of the negative logarithm. However, training involves several challenges: (a) computing the partition function \(Z^{q}_{|V^{q}|}\) in (3), and (b) estimating W so that the weights are sparse. We solve each of these challenges next.
Estimating the partition function
We estimate \(Z^{q}_{|V^{q}|}\) in (3) recursively in polynomial time since each segment can belong to one of 4 states: domain start (sb), inside a domain (i), domain end (eb), non-domain (e), and state of each segment depends only on the previous segment’s state. Let \(Y = \{sb, i, eb, e\}\), and \(Z_{|V^{q}|}^{q} = Z^{q}_{|V^{q}|, eb}\,+\,Z^{q}_{|V^{q}|, e}\) which components can be estimated by:
$$\begin{aligned} Z^{q}_{v, x} = \sum _{y \in Y} Z^{q}_{v-1, y} T_{y, x} \exp ^{E^{x}_{vq}} \end{aligned}$$
(5)
where \(Z^{q}_{v,sb}\), \(Z^{q}_{v,i}\), \(Z^{q}_{v,eb}\), \(Z^{q}_{v,e}\) represent the partition function up to segment v ending with sb, i, eb and non-domain respectively. T is a \(4 \times 4\) binary state transition matrix where \(T_{y, x} = 1\) if a segment can be assigned to x given previous segment is assigned to state y such as \((y, x) \in \{(sb, i), (sb, eb), (i, i), (i, eb), (eb, sb)\), \((eb, e), (e, sb), (e, e)\}\), otherwise 0. Initial conditions are \(Z^{q}_{0,sb} = Z^{q}_{0,i} = Z^{q}_{0,eb} = 0\), \(Z^{q}_{0,e} = 1\). To avoid overflow in estimating \(Z^{q}_{|V^{q}|}\) and speed it up, we estimate \(\log (Z^{q}_{|V^{q}|})\) by expressing it in terms of log of the sum of exponentials and forward and backward variables (\(\alpha\), \(\beta\)) similar to Hidden Markov Model [21].
Estimating a sparse set of good histone effect parameters
We would like to augment objective function (2) so that we select a sparse subset of markers from the data and avoid overfitting. If the coefficients \(\mathbf {w^{b}_{m}} = 0\), then there is no influence of marker m. For this purpose, we will impose grouped lasso type of regularization on the coefficients \(w_{mk}\). Grouped lasso regularization has the tendency to select a small number of groups of non-zero coefficients but push other groups of coefficients to be zero.
We introduce two types of regularization. First, we require that many of the weights be 0 using an \(L_2\)-norm regularization term. Second, we want the effect functions \(\{f\}\) to be smooth. Let \(P = \{b, i, e\}\). We modify our objective to trade off between these goals:
$$\begin{aligned} \mathop {\text{argmin}}\limits _{W}\;-\sum _{q}\log \left (P(D^{q}| W, H^{q})\right )+\overbrace{\lambda _{1} \sum _{p \in P} \left ( \sum _{m \in M} \left|\left|\mathbf {w^{p}_{m}}\right|\right| \right)^{2}+\lambda _{2}\sum _{p \in P} \sum _{m \in M} R(f^{p}_{m})}^{\text {Regularization}} \end{aligned}$$
(6)
where \(\lambda _{1}\), \(\lambda _{2}\) are the regularization parameters, and \(R(f^{p}_{m})\) is the smoothing function for effect of marker m at \(p \in P\):
$$\begin{aligned} R(f^{p}_{m}) = \int _{x} \left( \frac{\partial ^{2} f^{p}_{m}(x,\mathbf {w^{p}_{m}})}{\partial x^{2}}\right) ^{2} dx \end{aligned}$$
(7)
Group lasso in (6) uses the square of block \(l_{1}\)-norm instead of \(l_{2}\)-norm group lasso which does not change the regularization properties [23]. Objective function (6) is convex which follows from the convexity of \(R(f^{p}_{m})\) as proven in Theorem 1.
Theorem 1
\(R(f^{p}_{m})\) is a convex function of \(\mathbf {w^{p}_{m}}\).
Proof
Second-order derivative in (7) can be written more explicitly as in (8) according to [24]:
$$\begin{aligned} \frac{\partial ^{2} f^{p}_{m}(x,\mathbf {w^{p}_{m}})}{\partial x^{2}} = A(A-1) \sum _{i=0}^{A-2} (w^{p}_{m}[i+2]-2w^{p}_{m}[i+1]+w^{p}_{m}[i]) \genfrac(){0.0pt}0{A-2}{i} x^{i}(1-x)^{A-2-i} \end{aligned}$$
(8)
which turns \(R(f^{p}_{m})\) into (9):
$$\begin{aligned}&\int _{0}^{1} \left( \frac{\partial ^{2} f^{p}_{m}(x)}{\partial x^{2}}\right) ^{2} dx = A^{2}(A-1)^{2} \sum _{i=0}^{A} \sum _{j=i}^{A} (w^{p}_{m}[i] w^{p}_{m}[j])\nonumber \\&\quad \left( \sum _{q=\overline{e}_{i}}^{\min (i,2)} \sum _{r=\overline{e}_{j}}^{\min (j,2)} (-1)^{q+r} \genfrac(){0.0pt}0{2}{q} \genfrac(){0.0pt}0{2}{r} T^{i-q}_{j-r}(x) \right) \end{aligned}$$
(9)
where \(\overline{e}_{p} = \max (0,2-A+p)\), \(T^{i-q}_{j-r}(x)\) is defined below and \(\beta (i+j-q-r+1,2A-3-i-j+q+r)\) is the beta function:
$$\begin{aligned} T^{i-q}_{j-r}(x) = \genfrac(){0.0pt}0{A-2}{i-q} \genfrac(){0.0pt}0{A-2}{j-r} \underbrace{\int _{0}^{1} x^{i-q}(1-x)^{A-2-i+q} x^{j-r}(1-x)^{A-2-j+r} dx}_{\beta (i+j-q-r+1, 2A-3-i-j+q+r)} \end{aligned}$$
(10)
\(R(f^{p}_{m})\) is quadratic function of \(\mathbf {w^{p}_{m}}\). Its convexity follows from semidefiniteness of the resulting polynomial. \(\square\)
We note that (6) is convex, but it is a nonsmooth optimization problem because of the regularizer. We solve it efficiently by using an iterative algorithm from multiple kernel learning [23]. By Cauchy-Schwarz inequality:
$$\begin{aligned} \sum _{p \in P} \left ( \sum _{m \in M} \left|\left|\mathbf {w^{p}_{m}}\right|\right| \right )^{2} \le \sum _{p \in P} \sum _{m \in M} \frac{\left|\left|\mathbf {w^{p}_{m}}\right|\right|^{2}}{\gamma _{mp}} \end{aligned}$$
(11)
where \(\gamma _{mp} \ge 0\), \(\sum _{m \in M} \gamma _{mp} = 1, \;\, p \in P\), and the equality in (11) holds when
$$\begin{aligned} \gamma _{mp} = \frac{\left|\left|\mathbf {w^{p}_{m}}\right|\right|}{\sum _{m \in M} \left|\left|\mathbf {w^{p}_{m}}\right|\right|}, \quad p \in P \end{aligned}$$
(12)
This modification turns the objective into the following which is jointly convex in both \(\mathbf {w^{p}_{m}}\) and \(\gamma _{mp}\):
$$\begin{aligned}\mathop {\text{argmin}}\limits _{W}\;-\sum _{q}\log \left (P(D^{q} | W, H^{q})\right )\,+\,\sum _{p \in P} \sum _{m \in M} \left (\lambda _{1}\frac{\left|\left|\mathbf {w^{p}_{m}}\right|\right|^{2}}{\gamma _{mp}}\,+\,\lambda _{2} R(f^{p}_{m})\right ) \end{aligned}$$
(13)
$$\begin{aligned}&\,\text {s.t.} \;\; \sum _{m \in M} \gamma _{mp} = 1.0,\quad p \in P \end{aligned}$$
(14)
$$\begin{aligned}&\;\;\;\;\quad \gamma _{mp} \ge 0,\quad m \in M, p \in P \end{aligned}$$
(15)
We solve this by alternating between the optimization of \(\mathbf {w^{p}_{m}}\) and \(\gamma _{mp}\). When we fix \(\gamma _{mp}\), we can find the optimal \(\mathbf {w^{p}_{m}}\) by any quasi-newton solver such as L-BFGS [25] which runs faster than the other solvers such as iterative scaling or conjugate gradient. When we fixed \(\mathbf {w^{p}_{m}}\), we can obtain the best \(\gamma _{mp}\) by the closed form equation (12). Both steps iterate until convergence.
Training extensions
We can model a variety of shape-restricted effect functions by Bernstein polynomials that cannot be easily achieved by other nonparametric approaches such as smoothing splines [24]. We add the following constraints to ensure monotonicity:
$$\begin{aligned} w^{b}_{m}[i] \le w^{b}_{m}[i+1], \quad i = 0, \ldots , A-1 \end{aligned}$$
(16)
which is a realistic assumption since increasing the marker density should not decrease its effect. We can also ensure concavity of the effect function by:
$$\begin{aligned} w^{b}_{m}[i-1] - 2w^{b}_{m}[i] + w^{b}_{m}[i+1] \le 0, \quad i = 1, \ldots , A-1 \end{aligned}$$
(17)
which has a natural diminishing returns property: the increase in the value of the effect function generated by an increase in the marker density is smaller when output is large than when it is small. Our problem is different than smoothing splines since our loss function is more complicated than traditional spline loss functions due to partition function estimation in (5) which makes it hard to directly apply the smoothing spline methods [26]. In addition, these nonnegativity and other shape constraints can be naturally enforced in our method.
We can also extend the problem to modeling multiple domain subclasses instead of a single class where domains are categorized into subclasses according to their gene-expression profiles such as TADs with highly-active genes, TADs with repressive genes, etc.
Inferring domains using the trained model
Given marker data H over a single track and W, the inference log-likelihood is:
$$\begin{aligned} \mathop {\text{argmax}}\limits _{D}\;\log \left (P(D | W, H)\right ) = \sum _{d=[s, e] \in \overline{D}} r_{se} x_{se}\;+\;\sum _{v \in V} E^{e}_{v} y_{v} \end{aligned}$$
(18)
where \(\overline{D} = \{[s, e]\,|\,s,e \in V, \, e-s \ge 1\}\) is the set of all potential domains of length at least 2 and \(r_{se} = E^{b}_{s} + E^{b}_{e} + \sum _{v =s+1}^{e-1} E^{i}_{v}\). The intuition is that variable \(x_{se}=1\) when the solution contains interval [s, e], and variable \(y_{v}=1\) if v is not assigned to any domain. The \(\log (Z_{|V|})\) term is removed during inference since it is same for all D. We solve (19)–(20) to find the best partition D:
$$\begin{aligned}&\mathop {\text{argmax}}\limits _{D}\; \sum _{d=[s, e] \in \overline{D}} r_{se} x_{se}\;+\;\sum _{v \in V} E^{e}_{v} \left( 1 - \sum _{[s, e] \in M[v]} x_{se} \right) \end{aligned}$$
(19)
$$\begin{aligned}\text {s.t.} \;\; x_{se} + x_{s'e'} \le 1\qquad \forall \; \text {domains } [s,e], [s',e'] \text { that overlap} \end{aligned}$$
(20)
where M[v] is the set of intervals that span fragment v. We replace \(y_{v}\) in (18) with \(1 - \sum _{[s, e] \in M[v]} x_{se}\) since each segment can be assigned to at most a single domain. (20) ensures that inferred domains do not overlap. Problem (19)–(20) is Maximum Weight Independent Set in interval graph defined over domains which can be solved optimally by dynamic programming in \(O(|V|^{2})\) time.