Overview
CZ ID aims to provide easy-to-use tools for pathogen identification that are robust and scalable for the long-term. The mNGS short-read (Illumina) pipeline prioritizes identification of pathogens (including potentially novel pathogens) and thus involves alignment of raw sequencing data against full databases of known organisms. As the NCBI databases grow, the tools within CZ ID must evolve as well. To achieve this scale, as of CZ ID pipeline version 7.0, the short-read alignment tools GSNAP and rapsearch2 will be replaced with minimap2 and DIAMOND, respectively. The upgraded pipeline will be available after March 10, 2022. Read more to learn about the motivation, technical innovation, biological validation, and scalability improvements of this change, or browse a few datasets yourself!
Limitations of existing short-read alignment tools
As the field of genomics grows, aligning sequences to the full NCBI nucleotide (NT) and protein (NR) databases becomes increasingly challenging. Since its inception, the NCBI GenBank database has doubled approximately every 18 months [1]. When CZ ID (formerly IDseq) was first developed, the NCBI database was notably smaller and both GSNAP and Rapsearch2 could index the full database (for NT and NR, respectively) [2, 3]. However, NT has grown from < 200GB to > 400GB in about 2 years, and has no indication of slowing down.
Alignment of short read sequences to NCBI databases requires two steps - 1. pre-building an index from the database (a data structure that will enable rapid querying of sequences), and 2. alignment of sequencing reads (or query sequences) to the database, using the pre-built index from step one. The process of building the index is extremely compute-intensive and most alignment tools (including GSNAP and rapsearch2) require that the index is built and loaded on a single machine, therefore requiring increasingly large machines to perform these tasks.
The compute-intensive process of building indices from the databases has become impractical with our existing tools on CZ ID. Our last index update required a r5d.24Xlarge AWS instance, which has 768 GB of RAM, and the process still took ~96 hours. Since then, the size of NT (and therefore the necessary RAM required to build the index) has outstripped the amount of RAM available on a single instance for most cloud services. As the databases continue to grow, this problem will only get worse. Thus, our team sought alternative solutions for enabling the CZ ID pipeline to scale alongside the increasingly large databases.
Considerations with existing methods
Considering k-mer based taxonomic classification approaches
Of note, one common approach for dealing with alignment scalability challenges is to bypass full sequence alignment in favor of k-mer based methods. These methods split sequences into short sub-sequences (referred to as k-mers), match these k-mers against a database, and then rapidly assign taxonomic identity to reads. The CZ ID team considered a k-mer based approach, but ultimately decided against it due to the following considerations:
- CZ ID’s focus on identifying pathogens (including potentially-novel pathogens) requires the ability to identify matches to potentially-divergent sequences. In this context, full sequence alignment achieves greater sensitivity.
- CZ ID provides a number of metrics for data interpretation (including alignment length, which is a valuable indicator of quality). Given the sensitivity of the pipeline, the team wanted to maintain these metrics, which are not available when using k-mer based methods.
Horizontal v. vertical scaling for alignment methods
Instead, the CZ ID team focused on scaling other existing alignment methods. Theoretically, we should be able to split the process of alignment over multiple machines. By scaling “horizontally” (using many smaller machines) as opposed to “vertically” (using increasingly large single machines), we are immune to any scale limitations of a single machine. Thus, we can ensure that we always have the capability to handle larger and larger databases. However, the ability to scale horizontally is not common in most modern bioinformatics tools and thus required specialized development.
Technical Details
Many modern alignment tools already split the reference database into partitions to conserve memory, but then run each partition serially, on a single machine, greatly increasing time spent during alignment. One of the major technical challenges that we faced was how to reimplement tried and true bioinformatics tools to run alignment in parallel over multiple machines, instead of serially (Figure 1). In the next section, we describe how we were able to achieve this through modifications to two open-source alignment tools – minimap2 (for nucleotide alignment) and DIAMOND (for protein alignment) [4, 5].
Horizontal scaling implementation of minimap2
To scale read alignment for the full NT database, we adopted a modified version of the popular minimap2 algorithm by Heng Li. Gamaarachchi et al. (2019) previously extended minimap2 to align reads against several partitions of a reference database (serially), then quickly merge the partitioned results to produce alignments nearly identical to using the whole database at once [6]. We further refactored this to allow distributed processing of the database partitions across compute notes, such as through a parallel WDL scatter.
In practice, this allows us to align reads in parallel to a chosen number of partitions of the NT database, using minimap2 database indexes previously generated for each partition. In addition to the controlled index size per partition, each alignment task individually benefits from minimap2’s high speed and low memory usage, which are much improved compared to earlier generations of aligners.
Horizontal scaling implementation of DIAMOND
To scale read alignment for the full NR database, we adapted a version of the popular DIAMOND algorithm by Benjamin Buchfink. This tool already had horizontal scaling implemented in parallel, but it was set up to run on an HPC cluster with a shared file system. We slightly modified the program so that single pieces can be run and the resulting files could then be transferred between nodes. This made the existing parallelism flexible enough to run on our system which optimizes for dynamic scaling.
Figure 1. Most bioinformatics tools are set up to use vertical scaling with increasingly large machines to index the reference database (A). The CZ ID implementations to scale minimap2 and DIAMOND leverage horizontal scaling by partitioning the index across multiple smaller machines and merging the final results (B).
Biological Validation
Validation Process
Validation of these pipeline changes was done through integration testing with a series of benchmark datasets, previously used for evaluating pipeline performance [7, 8]. The impact of the tool upgrades on overall pipeline performance was characterized for each NT (minimap2) and NR (DIAMOND) database separately. For each benchmark dataset, our automated benchmarking suite surfaced key metrics evaluating the taxonomic abundances with respect to ground truth relative abundance values (AUPR, L2 distance, precision, recall). We optimized parameterizations by observing trends in metrics across a range of different parameterizations as compared to those achieved by the existing CZ ID pipeline. We provide details about the parameter optimization process as well as metrics associated with the final pipeline upgrade and the resulting impact on taxonomic abundance estimation in CZ ID.
Impact on Data Interpretation
Overall, the final pipeline upgrades resulted in...
- an increase in the AUPR and precision at recall = 100% in NT
- an increase in the sensitivity of NR for identification of a simulated divergent virus
- a marginal decrease in the AUPR in NR due to the increase in sensitivity
- a notable decrease in the time required to run both the NT and NR alignment steps
The impacts of these changes on real-world samples are likely to vary, but overall this instills confidence that the upgrades will maintain high quality results for metagenomic samples while improving the scalability of the pipeline as reference databases grow. To see some real-world examples in CZ ID, check out the See it yourself! section of this blog post.
Detailed validation of NT alignment with minimap2
Optimizing minimap2 parameterization
Minimap2 has two parameters that affect sensitivity and specificity – k (minimizer kmer length) and w (minimizer window size). Thus, we performed a grid search over the parameters k (range 12 - 20) and w (range 5 - 10) to identify an appropriate parameterization. For each parameter pair, we evaluated the AUPR, L2 distance, and precision at recall = 100%. For each dataset, these metrics were compared to the current values obtained using the CZ ID pipeline with GSNAP short-read alignment (Figure 2).
Figure 2. Performance of each dataset was evaluated across a range of parameterizations. (A) AUPR for a single dataset (soil) using the upgraded CZ ID pipeline with k = 14, w = 8. For each possible k and w the raw AUPR value was evaluated (B) and then compared to the original CZ ID pipeline to evaluate the differences (C). The star indicates the AUPR values associated with the original pipeline.
We observed that, despite generally favorable results, the impact of the tool upgrade (from gsnap to minimap2) differed between the datasets. Thus, we selected parameters that improved average AUPR while reducing L2 distance and improving precision at recall = 100% across the tested datasets. Across most parameterizations, AUPR increases. Four parameterizations increase AUPR by 0.35 or more. Among those, three reduce L2 distance by > 0.001. Of those, k=14, w = 8 maximizes the precision at recall = 100% (Figure 3). Thus, we selected final minimap2 parameterizations of k = 14, w = 8.
Figure 3. Grid search results are plotted for each metric, showing the average difference across the twelve tested datasets between the new and old pipelines for each metric – AUPR (A), L2 distance (B), and precision at recall = 100% (C) – for each k and w tested. Red indicates positive improvement (increased AUPR and precision, reduced L2 distance), gray indicates neutral change, and blue indicates reduced performance.
Performance of minimap2 upgrade on NT taxonomic abundance estimation
This resulted in an average increase in NT AUPR of 0.0035, an average reduction of L2 distance of 0.00026, and an average increase in precision (at recall = 100%) of 0.016. Altogether, AUPR was improved or unaffected in 10 of 12 benchmark samples and the precision at recall = 100% was improved in all 12 benchmark samples (Figure 4). These values suggest that the upgraded pipeline achieves comparable to improved performance over the original CZ ID pipeline.
Figure 4. For the optimized parameterization (k = 14, w = 8) we show the AUPR (A), L2 distance (B), and precision at 100% recall (C) for the new pipeline as compared to the previous CZ ID pipeline across all automated benchmark samples.
Detailed Validation of NR Alignment with DIAMOND
DIAMOND has a single sensitivity tuning parameter with six different options (--fast, --mid-sensitive, --sensitive, --more-sensitive, --very-sensitive and —ultra-sensitive) [5]. Thus, we analyzed each benchmark dataset across all possible sensitivity parameters and evaluated the resulting performance metrics (AUPR, L2 distance, and precision at recall = 100%) with respect to those obtained using the CZ ID pipeline with Rapsearch2 short-read alignment. We observed that the sensitivity parameterization did not have a significant impact on the metrics for the benchmark datasets. Thus, we further tested the ability of the different parameterizations to identify divergent viruses (a main use-case for the NR alignment within CZ ID) by using a pre-existing series of simulated divergent rhinovirus C samples at increasing levels of divergence [9]. Similarly, we found that there were no differences in the number of reads found for any of the divergent viruses over any of the sensitivity parameter settings. We ultimately opted to use the --mid-sensitive option for sensitivity parameterization, which is the default for DIAMOND.
Performance of DIAMOND upgrade on NR taxonomic abundance estimation
When comparing the DIAMOND to rapsearch2 (the pre-existing NR alignment tool), we observed that DIAMOND had increased sensitivity for detecting divergent viruses. DIAMOND was able to detect divergent sequences at 70% sequence identity to a reference whereas rapsearch2 was only able to detect sequences down to 75% sequence identity (Table 1).
Table 1. Results comparing sensitivity of rapsearch2 and DIAMOND (using --mid-sensitive parameter) for detection of simulated divergent Rhinovirus C. When marked as detected, greater than 80% of input reads could be identified at that divergence level.
Sample |
Sample Description |
rapsearch2 |
DIAMOND |
HRC_085 |
Divergent Rhinovirus C, 85% identity to reference |
detected |
detected |
HRC_080 |
Divergent Rhinovirus C, 80% identity to reference |
detected |
detected |
HRC_075 |
Divergent Rhinovirus C, 75% identity to reference |
detected |
detected |
HRC_070 |
Divergent Rhinovirus C, 70% identity to reference |
not detected |
detected |
HRC_065 |
Divergent Rhinovirus C, 65% identity to reference |
not detected |
not detected |
The increased sensitivity of DIAMOND also had implications for the estimation of relative abundance in the 12 benchmark samples. In particular, while several benchmark samples obtained equivalent AUPR values, we observed a slight decrease in AUPR across four datasets (Table 2). One notable reduction in AUPR was that of the atcc_staggered dataset (AUPR of 0.8 by rapsearch2 and 0.72 by DIAMOND), which is composed of known organisms at staggered abundances including some true positives present at exceedingly low abundance. Of note, this dataset had the lowest AUPR for all datasets via both DIAMOND and rapsearch2. The drop in AUPR is due to increased sensitivity, and therefore an increase in low-abundance false positive results which have a more notable impact in cases where there are low abundance true-positives, such as in the atcc_staggered dataset.
Historically, researchers using CZ ID have accounted for presence of false positives in NR (due to either rapsearch2, or in this case DIAMOND), by applying threshold filters that require taxons to have at least one read aligning to both NT and NR databases. This greatly reduces the amount of noise for relative abundance estimation while still enabling researchers interested in divergent viruses to capitalize on the value of the NR database. You can experiment with this approach by viewing samples run with both pipeline versions here.
Table 2. AUPR values achieved on 12 benchmark datasets for rapsearch2 and DIAMOND (using --mid-sensitive parameter). Red indicates cases where AUPR dropped slightly by DIAMOND as compared to rapsearch2.
Dataset |
AUPR by rapsearch2 |
AUPR by DIAMOND |
atcc_even |
1 |
1 |
atcc_staggered |
0.8 |
0.72 |
idseq_bench_3 |
1 |
1 |
idseq_bench_5 |
0.99 |
0.99 |
unambiguouslymapped_ds_7 |
0.91 |
0.88 |
unambiguouslymapped_ds_buccal |
1 |
1 |
unambiguouslymapped_ds_cityparks |
0.98 |
0.96 |
unambiguouslymapped_ds_gut |
1 |
1 |
unambiguouslymapped_ds_hous1 |
0.96 |
0.96 |
unambiguouslymapped_ds_hous2 |
1 |
1 |
unambiguouslymapped_ds_nycsm |
0.92 |
0.92 |
unambiguouslymapped_ds_soil |
0.98 |
0.96 |
Speed / Scalability Improvements
In addition to enabling CZ ID to map against the entire NCBI database by leveraging horizontal scalability, the upgrades to the short-read alignment step have tangible effects on pipeline speed. First we compared the alignment time for each of the minimap2 parameters. Overall, there were some differences, with lower k and w values tending to take longer than higher values, as expected. However, the time cost differences as a function of parameterization were minor in comparison with the notable speed improvements of pipeline version 7.0 compared to the previous pipeline version. For the 12 benchmark datasets evaluated, the original pipeline (v 6.9) took an average of 164 minutes* and the upgraded pipeline (v 7.0) took an average of 103 minutes. This represents an average reduction in total pipeline run time of 38%. Focusing specifically on the short-read alignment step, the average reduction in alignment time was 49% with most of the gains coming from a reduction in time taken for NR alignment. While not the primary aim of upgrading the short-read alignment components, the improvements in speed are a welcome side effect and have the potential to significantly impact the scalability of pipeline runs overall.
*Note: these pipeline run time values were obtained through running the pipeline in our production AWS system, which represents real-world usage, but does inherit some noise due to the amount of time and availability of appropriate machines.
Comparison - see it for yourself!
To show the impact of the changes within the CZ ID interface and enable researchers to perform their own comparisons, we’ve compiled the benchmark samples run on the old (v 6.9) and new (v 7.0) pipeline versions (Table 3). Feel free to log in to CZ ID and take a look for yourself!
Table 3. Table containing links to comparative datasets.
Sample Name |
Pipeline v6.9 (old) CZ ID Sample |
Pipeline v 7.0 (new) CZ ID Sample |
atcc_even |
v6.9 | |
atcc_staggered |
||
idseq_bench_3 |
||
idseq_bench_5 |
||
unambiguouslymapped_ds_7 |
||
unambiguouslymapped_ds_buccal |
||
unambiguouslymapped_ds_cityparks |
||
unambiguouslymapped_ds_gut |
||
unambiguouslymapped_ds_hous1 |
||
unambiguouslymapped_ds_hous2 |
||
unambiguouslymapped_ds_nycsm |
||
unambiguouslymapped_ds_soil |
References
- Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 44(D1):D67-72 (2016)
- Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26:873–81.
- Yongan Zhao, Haixu Tang and Yuzhen Ye. RAPSearch2: a fast and memory-efficient protein similarity search tool for next generation sequencing data. Bioinformatics 2012, 28 (1): 125-126
- Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
- Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x
- Gamaarachchi, H., Parameswaran, S. & Smith, M.A. Featherweight long read alignment using partitioned reference indexes. Sci Rep 9, 4318 (2019). https://doi.org/10.1038/s41598-019-40739-8
- McIntyre ABR, Ounit R, Afshinnekoo E, et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 2017;18:182.
- Ye SH, Siddle KJ, Park DJ, et al. Benchmarking metagenomics tools for taxonomic classification. Cell. 2019;178:779–94.
- KL Kalantar, et al.: IDseq—an open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring. GigaScience 9.10 (2020): giaa111
Comments
0 comments
Please sign in to leave a comment.