The gPBWT was implemented within xg, the succinct graph indexing component of the vg variation graph toolkit [18]. The primary succinct self-indexed data structure used, which compressed the gPBWT’s B[] arrays, was a run-length-compressed wavelet tree, backed by sparse bit vectors and a Huffman-shaped wavelet tree, all provided by the sdsl-lite library used by xg [17]. The B[] arrays, in this implementation, were stored as small integers referring to edges leaving each node, rather than as full next-side IDs. The c() function was implemented using two ordinary integer vectors, one storing the number of threads starting at each side, and one storing the number of threads using each side and each oriented edge. Due to the use of sdsl-lite, and the poor constant-factor performance of dynamic alternatives, efficient integer vector insert operations into the B[] arrays were not possible, and so the batch construction algorithm (Algorithm 3), applicable only to directed acyclic graphs, was implemented. A modified release of vg, which can be used to replicate the results shown here, is available from https://github.com/adamnovak/vg/releases/tag/gpbwt2.
The modified vg was used to create a genome graph for human chromosome 22, using the 1000 Genomes Phase 3 VCF on the GRCh37 assembly, embedding information about the correspondence between VCF variants and graph elements [16]. Note that the graph constructed from the VCF was directed and acyclic; it described only substitutions and indels, with no structural variants, and thus was amenable to the batch gPBWT construction algorithm. Next, haplotype information for the 5008 haplotypes stored in the VCF was imported and stored in a gPBWT-enabled xg index for the graph, using the batch construction algorithm mentioned above. In some cases, the VCF could not be directly translated into self-consistent haplotypes. For example, a G to C SNP and a G to GAT insertion might be called at the same position, and a haplotype might claim to contain the alt alleles of both variants. A naïve interpretation might have the haplotype visit the C and then the GAT, which would be invalid, because the graph would not contain the C to G edge. In cases like this, an attempt was made to semantically reconcile the variants automatically (in this case, as a C followed by an AT), but this was only possible for some cases. In other cases, invalid candidate haplotype threads were still generated. These were then split into valid pieces to be inserted into the gPBWT. Threads were also split to handle other exceptional cases, such as haploid calls in the input. Overall, splitting for causes other than loss of phasing occurred 203,145 times across the 5008 haplotypes, or about 41 times per haplotype.
The xg indexing and gPBWT construction process took 9 h and 19 min using a single indexing thread on an Intel Xeon X7560 running at 2.27 GHz, and consumed 278 GB of memory. The high memory usage was a result of the decision to retain the entire data set in memory in an uncompressed format during construction. However, the resulting xg index was 436 MB on disk, of which 321 MB was used by the gPBWT. Information on the 5008 haplotypes across the 1,103,547 variants was thus stored in about 0.93 bits per diploid genotype in the succinct self-indexed representation, or 0.010 bits per haplotype base.Footnote 2 Extrapolating linearly from the 51 megabases of chromosome 22 to the entire 3.2 gigabase human reference genome, a similar index of the entire 1000 Genomes dataset would take 27 GB, with 20 GB devoted to the gPBWT. This is well within the storage and memory capacities of modern computer systems.
Random walks
The query performance of the gPBWT implementation was evaluated using random walk query paths. 1 million random walks of 100 bp each were simulated from the graph. To remove walks covering ambiguous regions, walks that contained two or more N bases in a row were eliminated, leaving 686,590 random walks. The number of haplotypes in the gPBWT index consistent with each walk was then determined, taking 61.29 s in total using a single query thread on the above-mentioned Xeon system. The entire operation took a maximum of 458 MB of memory, indicating that the on-disk index did not require significant expansion during loading to be usable. Overall, the gPBWT index required 89.3 μs per count operation on the 100 bp random walks. It was found that 316,078 walks, or 46%, were not consistent with any haplotype in the graph. The distribution of of the number of haplotypes consistent with each random walk is visible in Fig. 3.
Read alignments
To further evaluate the performance of the query implementation, we evaluated read alignments to measure their consistency with stored haplotypes. 1000 Genomes Low Coverage Phase 3 reads for NA12878 that were mapped in the official alignment to chromosome 22 were downloaded and re-mapped to the chromosome 22 graph, using the xg/GCSA2-based mapper in vg, allowing for up to a single secondary mapping per read. (The vg aligner was chosen because of its ease of integration with our xg-based gPBWT implementation, but in principle any aligner that supports aligning to a graph could be used.) The mappings with scores of at least 90 points out of a maximum of 101 points (for a perfectly-mapped 101 bp read) were selected (thus filtering out alignments highly like to be erroneous) and broken down into primary and secondary mappings. The number of haplotypes in the gPBWT index consistent with each read’s path through the graph was calculated (Fig. 3). For 1,500,271 primary mappings, the count operation took 150.49 seconds in total, or 100 microseconds per mapping, using 461 MB of memory. (Note that any approach that depended on visiting each haplotype in turn, such as aligning each read to each haplotype, would have to do its work for each read/haplotype combination in less than 20 μs, or about 45 clock cycles, in order to beat this time.) It was found that 2521 of these primary mappings, or 0.17%, and 320 of 43,791 secondary mappings, or 0.73%, were not consistent with any haplotype path in the graph.Footnote 3 These read mappings, despite having reasonable edit based scores, may represent rare recombinations, but the set is also likely to be enriched for spurious mappings.
Scaling characteristics
To evaluate the empirical space usage scaling characteristics of our gPBWT implementation, a scaling experiment was undertaken. The 1000 Genomes Phase 3 VCFs for the GRCh38 assembly were downloaded, modified to express all variants on the forward strand in the GRCh38 assembly, and used together with the assembly data to produce a graph for chromosome 22 based on the newer assembly. This graph was then used to construct a gPBWT with progressively larger subsets of the available samples. Samples were selected in the order they appear in the VCF file. For each subset, an xg serialization report was generated using the xg tool, and the number of bytes attributed to “threads” was recorded. The number of bytes used versus the number of samples stored is displayed in Fig. 4.
After empirical size data was obtained, a log-plus-linear curve, consisting of a log component and a linear component, was fit to the data. This curve was used to extrapolate an estimated size of 5.34 GB on disk for the storage of 100,000 samples’ worth of data on chromosome 22. We choose 100,000 because it is representative of the scale of large contemporary sequencing projects, such as Genomics England’s 100,000 Genomes Project (https://www.genomicsengland.co.uk/the-100000-genomes-project/) [20] and the NHLBI’s TOPMed program (https://www.nhlbi.nih.gov/research/resources/nhlbi-precision-medicine-initiative/topmed). Linear extrapolation from the 51 megabase chromosome 22 to the 3.2 gigabase human genome yields a size estimate of 336 GB for the storage of 100,000 diploid genomes, in addition to the space usage of the underlying graph. Although this extrapolation does not account for the dependence of graph complexity on the number of samples sequenced, it suggests that the gPBWT is capable of scaling to the anticipated size of future sequencing data sets, while using currently available computing resources.