Detecting transcriptomic structural variants in heterogeneous contexts via the Multiple Compatible Arrangements Problem

Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a 14\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{1}{4}$$\end{document}-approximation algorithm for k=1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=1$$\end{document} and a 34\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{3}{4}$$\end{document}-approximation algorithm for the diploid case (k=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=2$$\end{document}) assuming an oracle for k=1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=1$$\end{document}. Combining these, we obtain a 316\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{3}{16}$$\end{document}-approximation algorithm for MCAP when k=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=2$$\end{document} (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require k>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k>1$$\end{document} alleles to satisfy read concordancy and show that such structures are prevalent. We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid.

Therefore, read pairs that are far apart may end up in one genome segment in GSG and thus ignored by the downstream arrangement algorithm. Although we predict the edges that connect far apart segments in the original genome as TSVs as a post-processing step, it is possible that some deletion events are ignored.
We investigate the false negatives that may correspond to those ignored deletion events. Among 128 false negatives in HCC1395, for 67 of them, both of their breakpoints locate within one segment. For the remaining 61 false negatives, each pair of breakpoints locate in two segments that are not connected to each other, which means that the breakpoints are not supported by any read. Among 46 false negatives in HCC1954, for 21 of them, both of their breakpoints locate within one segment and 25 of them are not supported by any edge. In both samples, none of the false negatives are represented by concordant or discordant edges (Table 1).
This shows that the additional step that outputs concordant edges that connect far apart segments is effective in reducing false negatives.
Among the false negatives whose breakpoints locate within one segment, 3 of them are supported by reads in HCC1395 and 4 of them are supported by reads in HCC1954 (Table 2). In HCC1395, the distance between those breakpoints are greater than the average gene length, which may correspond to deletion events. In HCC1954, some of the distances are small, which may correspond to other TSV events.
The results show that most of the false negatives lack read support. Still, adding read-pair and segment distance as a concordancy consideration may reduce false negatives such as the TSVs in HCC1395 listed in Table 2.

False positives
False positives are the predicted TSVs in D-SQUID that are not validated by the ground truth. We investigate whether the distance between the two segments in a TSV can be used to distinguish between true positive (TP) and false positive (FP) predictions in D-SQUID.
Two types of distances are considered: the number of segments between each TSVconnected segments (node distance), and the number of basepairs between each TSV-connected segments (basepair distance) in the output arrangement. D-SQUID outputs two arrangements of segments that correspond to the diploid assumption.
When a discordant edge is made concordant in one of the arrangements, that arrangement where the edge is concordant is used in counting segments and basepairs for the distance calculation. When a discordant edge is made concordant in both arrangements, we use the arrangement where the basepair distance is the minimum to calculate the distances. We choose a smaller basepair distance because the actual distance in the true rearranged genome cannot be too large. A minimum basepair distance within the two output arrangements is an upper bound of the minimum across all co-optimal solutions.
The predictions with the largest node distances and the largest basepair distances tend to be FP predictions (Supplementary Figure S1). This indicates that directly applying a node distance and a basepair distance threshold on the rearranged genome is able to reduce the FP predictions.
Nevertheless, the number of reduced FP is limited, since the distributions of the distances of TP and FP predictions both have a large mass at small distances (Supplementary Figure S1B, C) and the tail of FP distance distribution is very small.
In addition, without investigating more validated datasets, a widely applicable distance threshold cannot be determined. This limitation is partially because D-SQUID and SQUID may not choose the arrangements with the minimum distances among all co-optimal arrangements. But if a distance penalty is included in D-SQUID or SQUID objective for selecting arrangements from co-optima, TP and FP may be better distinguished.