In this paper, we introduce an algorithm to identify topological domains in chromatin using interaction matrices from recent high-throughput chromosome conformation capture experiments. Our algorithm produces domains that display much higher interaction frequencies within the domains than in-between domains (Figure 2) and for which the boundaries between these domains exhibit substantial enrichment for several insulator and barrier-like elements (Figure 6). To identify these domains, we use a multiscale approach that finds domains at various size scales and generates multiple optimal and near-optimal solutions.
We define a consensus set to be a set of domains that persist across multiple resolutions and give an efficient algorithm that finds such a set optimally.
Our method uses a score function that encodes the quality of putative domains in an intuitive manner based on their local density of interactions. Variations of the scoring function in (4), for example, by median centering rather than mean centering or by optimizing the homogeneity of interaction frequencies instead of total frequencies, can be explored to test the robustness of the enrichments described here.
Our method is particularly appealing in that it requires only a single user-specified parameter γmax. For our experiments, the parameter γmax was set based on the maximum domain sizes observed in Dixon et al.’s experiments so that we could easily compare our domains to theirs. This parameter can also be set intrinsically from properties of the Hi-C interaction matrices. For example, we observe similar enrichments in both human and mouse when we set γmax to be the smallest γ∈Γ such that the median domain size is >80kbp (two consecutive Hi-C fragments at a resolution of 40kbp). This is a reasonable assumption since domains consisting of just one or two fragments do not capture higher-order spatial relationships (e.g. triad closure) and interaction frequencies between adjacent fragments are likely large by chance .
We compared the fraction of the genome covered by domains identified by Dixon et al. vs. the domains obtained from our method at various resolutions. Dixon et al.’s domains cover 85% of the genome while our sets tend to cover less of the genome (≈ 65% for a resolution that results in the same number of domains as those of Dixon et al.). The fact that our domain boundaries are more enriched for CTCF sites indicates that our smaller, more dense domains may be more desirable from the perspective of genome function. The dense, functionally-enriched domains discovered by our algorithm provide strong evidence that alternative chromatin domains exist and that a single length scale is insufficient to capture the hierarchical and overlapping domain structure visible in heat maps of 3C interaction matrices.
We provided the first quantitative analysis testing the hypothesis that the domain structure across scales is significantly hierarchically organized, suggesting that the domains we identify can be used as the basis for studying the hierarchical organization of genomes and how this structure impacts gene regulation. By incorporating multiple optimal and near optimal solutions into this analysis, we provide evidence that the observed hierarchical structure persists not only across scales but across a variety of plausible high-scoring domain sets. However, multiple optimal solutions are not necessary to quantify the hierarchical structure of the domains since single optimal solutions across scales can already reveal a hierarchical structure. There are many more near-optimal solutions at higher values of γ since the domain sizes tend to be smaller. For this special case, it would be desirable to develop a method that more concisely characterizes these larger solution spaces, and this is an interesting direction for future work. The quantitative evidence of the hierarchical structure of topological domains also motivates the development of novel methods for domain discovery that directly account for such hierarchy in the models they assume and the functions they optimize.
The method for discovering topological domains that we have introduced is practical for existing datasets. Our implementation is able to compute the consensus set of domains for the human fibroblast cell line and extract the consensus set in 24 minutes when run on a personal computer with 2.3GHz Intel Core i5 processor and 8Gb of RAM. Computing optimal and near-optimal solutions adds only a small overhead to overall running time: when computing 20 top optimal and near-optimal solutions per each γ setting (with γ 0.0-0.9 with a step of 0.05) the computation finishes in 25 minutes 34 seconds.
A preliminary version of this manuscript appeared in the 2013 Workshop on Algorithms for Bioinformatics .