Jump to Section:
Authors
Chan Zuckerberg Initiative: Katrina Kalantar*, Phoenix Logan*, Todd Morse, Ryan Lim, Omar Valenzuela, Ryan Lim, Omar Valenzuela, Nina Bernick, Juan Caballero Perez, Francisco Loo, Karyna Rosario, Xochitl Butcher, Olivia Holmes (*lead authors)
Chan Zuckerberg Biohub: Joe DeRisi
Summary
The exponential growth of the NCBI GenBank database presents significant challenges for metagenomic analysis, particularly in pathogen detection. To address these challenges, here we introduce a novel compression approach implemented in CZ ID metagenomic pipelines that reduces database size while maintaining sequence diversity. The compression method utilizes sourmash, a tool that applies MinHash techniques, to compress and minimize redundancy of nucleotide and protein sequences in NCBI databases through efficient k-mer hashing. This approach was validated across various benchmark datasets, highlighting its utility in accurately identifying known and novel pathogens with increased computational efficiency. We demonstrate that the compression technique retains the ability to detect pathogens post-compression while managing the increasing database sizes, crucial for keeping metagenomic tools up-to-date.
Introduction
The National Center for Biotechnology Information (NCBI) GenBank database, a comprehensive repository of nucleotide and protein sequences, is a key resource for biological research. It offers crucial insights into genetic diversity, evolutionary relationships, and functional annotations (1). It is fundamental for metagenomic analyses, particularly for classifying sequencing reads via alignment. Metagenomics, which has emerged as a vital tool for analyzing the pathogen landscape and identifying emerging pathogens, benefits from the NCBI database's extensive sequence range for enhanced detection of novel and divergent sequences (2–4).
The rapid growth of the NCBI GenBank database, doubling every 18 months, challenges the scalability of tools used for sequence alignment and classification (5). To enhance pathogen detection, database size reduction without compromising sequence diversity is crucial. Several recent efforts, such as NCBI's ClusteredNR and phylogenetic compression, reflect this need (6–8).
This article introduces a novel method for compressing the NCBI GenBank database that maintains sequence diversity while minimizing redundancy. We detail this method and its benchmarking on simulated and real datasets to showcase its efficacy for metagenomic pathogen detection.
Methods
Implementation
The CZ ID compression method utilizes sourmash, a tool suite employing MinHash techniques for efficient genomic and metagenomic sequence analysis (9). The approach, which operates separately for the NCBI NT and NR databases, is summarized in Figure 1. It begins by binning sequences by taxID and sorting each bin by descending sequence lengths (according to the accession2taxid mapping files provided by NCBI). Next, a k-mer signature vector is initialized for each taxon bin and data from each taxID is loaded in chunks. The longest sequences are processed first and added to the k-mer vector. Then, for incoming shorter sequences, k-mer containment is evaluated as compared to all previously added sequences in the k-mer vector and those with containment above a specified threshold are eliminated, thereby removing redundant sequences. The routine is implemented using Rust bindings to sourmash in the CZ ID workflows repository.
Figure 1. Database compression algorithm workflow.
Specific parameterization is critical to the accuracy and scalability of the compression approach. Sourmash hashing parameters (k = 31, scaled = 1000), which impact time cost and sensitivity of the compression routine, were selected based on evaluation of containment as a function of genome similarity through simulation experiments. To select an appropriate containment threshold, we compared the impact of database compression on metagenomic next-generation sequencing (mNGS) profiles using varying containment thresholds. The containment threshold of 0.9 was selected to balance mNGS profile consistency, while still providing necessary compression.
Evaluation and Validation
Evaluation of Database Compression in the Context of mNGS Data Analysis
The utility of the compressed NCBI database for metagenomic data analysis was evaluated using uncompressed and compressed database versions within the CZ ID mNGS pipelines (13). This assessment involved running a series of simulated and real datasets through the pipelines, enabling a direct comparison of performance between complete and compressed databases (Table 1).
Several datasets were used to test the impacts of database compression within the context of the mNGS analysis pipeline (Table 1). The datasets enabled us to evaluate the following:
Comparison to ground truth: The Unambiguously Mapped datasets and their associated “gold standard” truth distributions, sourced from previously-published benchmarking studies (14,15), were used for initial evaluation of database performance across a parameter sweep. This assessment involved compressing the NCBI database using a range of parameterizations (k = 21, 31; scaled = 100, 1000, containment threshold = 0.6, 0.7, 0.8, 0.9), using the resulting databases for mNGS analysis, and evaluating the resulting taxon abundances. A comparable “gold standard” dataset (Zymobiomics, obtained from a previously-published study evaluating the utility of Nanopore sequencing for metagenomic analysis (16)), was used for evaluating the CZ ID Nanopore mNGS pipeline.
Assessment of sensitivity for divergent viruses: The Virus Spiked datasets were designed to simulate the presence of novel viruses, divergent from existing entries in the NCBI database. Reference genomes for six virus species were downloaded from NCBI (Human herpesvirus 5, NC_006273.2; Hepatitis B virus, NC_003977.2; Lassa virus L/S, NC_004297.1/NC_004296.1; Nipah virus, NC_002728.1; Human rhinovirus C, NC_009996.1; Severe acute respiratory syndrome coronavirus 2, NC_045512.2) and subjected to artificial mutation at rates ranging from 5%, to 50% using code available at https://github.com/caballero/mutator/. The resulting mutated genomes, in FASTA format, were utilized to generate simulated Illumina and Nanopore reads using the ART simulator and PBsim2, respectively, achieving a sequencing depth of 7X (17,18).
Assessment of real-world datasets: Evaluation of the compressed database's efficacy was conducted using two real-world datasets. First, an SARS-CoV-2 amplicon sequencing sample (SC2) was chosen to assess performance on a highly redundant taxa. Second, the Medical Detectives dataset, which aggregates various sample types from existing pathogen-detection research, was utilized for general pathogen detection analysis.
Table 1. Summary of the datasets used for benchmarking, alongside metrics comparing compressed vs. non-compressed databases. All datasets were short-read mNGS, except where explicitly labeled as “Nanopore”. SD = standard deviation; AUPR = Area under the precision recall curve; Avg = average.
Dataset Name | Description | NT Metrics | NR Metrics |
Unambiguously Mapped | 10 reference datasets commonly used for benchmarking, used for parameter sweep and initial compression evaluation (14) | Avg Δ AUPR: -0.002, SD = 0.005 | Avg Δ AUPR: -0.004, SD = 0.005 |
Avg Δ L2: 0.000, SD = 0.001 | Avg Δ L2: 0.001, SD = 0.013 | ||
Avg Spearman correlation: 0.906, SD = 0.030 | Avg Spearman correlation: 0.618, SD = 0.085 | ||
Avg QC-filtered Spearman correlation: 0.985, SD = 0.017 | Avg QC-filtered Spearman correlation: 0.954, SD = 0.041 | ||
Virus Spiked | 6 simulated divergent virus samples, used for evaluating sensitivity for novel pathogen detection (using CZ ID Illumina mNGS pipeline) | Avg Δ LOD: 0, SD = 0 | Avg Δ LOD: 0, SD = 0 |
SC2 | 1 amplicon-based SARS-CoV-2 sample, used for evaluating performance improvements for highly represented sample |
Reads mapped: +0.03% (NT), +0.04% (NR) Coverage breadth: 0.3% increase Coverage depth: 0.5% increase |
|
Zymobiomics (Nanopore) | 1 sample from published mock community, used for evaluating database compression for ONT pipeline (16) | Δ AUPR: 0.0 | Δ AUPR: 0.005 |
Δ L2-dist: 0.0 | Δ L2-dist: 0.021 | ||
Spearman correlation: 0.911 | Spearman correlation: 0.824 | ||
QC-filtered Spearman correlation: 0.988 | QC-filtered Spearman correlation: 1.0 | ||
Virus Spiked (Nanopore) | 6 simulated divergent virus samples, used for evaluating sensitivity for novel pathogen detection (using CZ ID Nanopore mNGS pipeline) | Avg Δ LOD: 0, SD = 0 | Avg Δ LOD: 0, SD = 0 |
Medical Detectives | 9 real-world datasets containing pathogens of interest, curated from a variety of studies, used for expansion of findings to real-world analysis. | Avg Spearman correlation: 0.691, SD = 0.109 | Avg Spearman correlation: 0.802, SD = 0.037 |
Avg QC-filtered Spearman correlation: 0.980, SD = 0.032 | Avg QC-filtered Spearman correlation: 0.992, SD = 0.010 | ||
Avg Δ prop: 0.001, SD = 0.001 | Avg Δ prop: -0.012, SD = 0.025 |
Evaluation Methods and Metrics
Datasets underwent evaluation using multiple metrics to gauge the impact of database compression on metagenomic next-generation sequencing (mNGS) taxon abundance. For datasets with known taxon distributions (Unambiguously Mapped and Zymobiomics), we computed the area under the precision-recall curve (AUPR) and L2 distance, with respect to expected taxon counts.
AUPR, which indicates the accuracy of microbe detection relative to abundance cutoffs, and L2 distance, which measures the similarity of relative abundances to ground truth, were then contrasted for runs using uncompressed (original) and compressed databases (13,15). Spearman correlation was applied to all samples, including those without ground truth data, to directly compare taxon counts between full and compressed database results. Outliers were investigated manually. For divergent viral samples (Virus Spiked), we determined the “divergence limit of detection” (defined as the greatest divergence percentage allowing for alignment of at least one read to the expected taxon) using original and compressed databases. Finally, in single virus samples (SC2, Novel Virus), genome coverage and percentage identity (%ID) to reference were analyzed.
Results
Assessment of Parametrization
Containment threshold used for database clustering had a notable impact on the performance of the mNGS pipeline – as expected, as the threshold increased (indicating less compression) the results appeared more similar to baseline (uncompressed database). In particular, the spearman correlation increased while the number of spurious taxa identified and L2 distance reduced. These trends were consistent across compression of NT and NR (Figure 2). Taken alongside the compression savings (see Clustering Impact on Compute Resources), the containment threshold of 0.9 was utilized for both NT and NR database clustering.
Figure 2. Parameter sweep evaluating total Spearman correlation between mNGS profiles obtained when using compressed versus uncompressed databases at a range of containment values for NT (A) and NR (B).
Clustering Impact on Compute Resources
Initially, the NCBI 2021 database sizes were 370 GB (NT) and 169 GB (NR). Post-compression, the database sizes were reduced to 250 GB (NT) and 147 GB (NR), achieving reductions of 32% (NT) and 13% (NR). The NCBI 2024 database showed increased sizes of 1.29 TB (NT) and 312 GB (NR), with compression to 842 GB (NT) and 245 GB (NR). Therefore, relative to the 2021 version, compression of the NCBI 2024 database achieved comparable reductions of 34% (NT) and 21% (NR). The compression routine required approximately 11 days for joint compression of the 2024 NT and NR databases, up from approximately 4 days for the 2021 database.
Clustering Impact on Pipeline Accuracy
Compared to gold standards, NT and NR datasets showed similar AUPR and L2 values with original and compressed databases (mean Δ AUPR = -0.002 for NT and -0.004 for NR; mean Δ L2 = 0.000 for NT and 0.001 for NR; see Unambiguously Mapped dataset in Table 1). Average Spearman correlations between original and compressed database results were 0.906 (±0.030 SD) for NT and 0.618 (±0.085 SD) for NR (Figure 3).
Figure 3. Accuracy metrics for Unambiguously Mapped benchmark datasets. Showing for NT (blue) and NR (orange), the change in AUPR (A) and change in L2 distance (B) as compared to the gold standard. Then, (C) the Spearman correlation directly comparing taxon abundances from original v. compressed databases, first using raw taxon abundances and then using QC-filtered taxon abundances.
Deeper investigation of these values suggested that low-abundance, false-positive taxa were drivers of the changes in correlation. The ATCC_even sample, encompassing a wide abundance range of known organisms at equal abundances, showed the lowest correlation for both databases, likely due to deviations in low-abundance taxa (Figure 4). Notably, deviations were more pronounced in NR than NT, consistent with the previously-documented reduced specificity in NR results from rRNA-induced false positives.
Figure 4. Correlation plot showing the comparison between compressed vs. uncompressed for a sample ATCC_even (Unambiguously Mapped benchmark dataset) on NT (A) and NR (B). Gray dots indicate taxa that didn’t meet the “QC filter thresholds” applied for computation of the “QC-Filtered Spearman Correlation”, which highlights taxa with evidence on both NT (NT rPM >= 10) and NR (NR rPM >=1).
These trends persisted across different sequencing technologies and pipelines (see Zymobiomics dataset in Table 1). To emulate the standard practices in mNGS data analysis, whereby low-confidence hits may be filtered out, QC filtering was applied, which dramatically improved the correlation across databases (mean Spearman correlation = 0.985 (±0.017 SD) for NT and 0.954 (±0.041 SD) for NR (Figure 3C).
In Virus Spiked simulations, the compressed database’s median detection limit was 30% divergence for NT (range 25% - 35%) and 45% for NR (range 40% - 50%), which was equivalent to the original database’s limits (Figure 5).
Figure 5. Evaluation of divergence limit of detection using simulated divergent virus samples. A) For a single example (SARS-CoV-2), the rPM associated with the true positive taxon “Betacoronavirus” for NT (blue) and NR (orange) on uncompressed (light) and compressed (dark) databases. B) Limit of detection for each of six simulated viruses comparing NT (blue) vs. NR (orange) in compressed (dark) v. uncompressed (light).
The enhanced sensitivity of NR as compared to NT is consistent with existing literature, attributable to the differential impact of mutation rates on nucleotide versus protein alignments (13). These patterns were consistent across sequencing technologies and associated pipelines (Virus Spiked Nanopore dataset) (Figure 6).
Figure 6. Divergent virus detection for Virus Spiked Nanopore dataset.
The post-pandemic surge in SARS-CoV-2 genomes significantly increased their redundancy in the NCBI database, leading to an especially high compression for this taxa (a 96% reduction in size for the 2024 index). In the SC2 dataset, key metrics were comparable with the compressed database (Table 1). Total reads mapping to SARS-CoV-2 remained similar (0.03% and 0.04% increases in reads mapped via NT and NR, respectively). Changes in genome coverage depth and breadth were negligible (Table 1).
In real-world samples from the Medical Detectives project, the compressed database consistently identified the same pathogen species, with average Δ proportions of less than 0.5% (0.001 (NT) and -0.012 (NR)) (Table 2). Taxonomic distribution comparisons yielded an average QC-filtered Spearman correlation of 0.980 (±0.032 SD) for NT and 0.992 (±0.010 SD) for NR. These findings align with those from benchmark and simulated data, indicating the reliability and accuracy of the compressed database in practical applications.
Table 2. Proportion of pathogenic species identified in the Medical Detectives dataset using uncompressed (original) and compressed databases (DB). Proportions are based on reads per million (rPM).
Discussion
In this article, we describe a sequence database compression approach that successfully reduces computational resources required for metagenomic data analysis without compromising accuracy. The approach was validated across various benchmark datasets, particularly in pathogen-detection scenarios. This study presents a significant advancement in managing the ever-growing NCBI GenBank database, crucial for metagenomic research. By leveraging efficient k-mer hashing tools, we have demonstrated that it is possible to compress both nucleotide and protein sequences, catering to the needs of contemporary metagenomic analysis.
A few approaches have been developed for tackling the challenges of growing reference databases. One suite of methods has focused primarily on database compression for the NCBI protein database, often collapsing taxonomic groupings to a representative least common ancestor (6,7). This proves successful in compression while potentially balancing protein searches and improving specificity, but may reduce sensitivity for species-level identification. Integrating the MMseqs2-based clustering approaches employed for those databases alongside the taxID binning step of the approach discussed here represents a viable option for achieving comparable results to those presented in this paper. However, in preliminary testing, we observed slower total clustering times and therefore halted further investigations.
Another approach, known as phylogenetic compression, has integrated phylogenetic information into the clustering approach to further improve the clustering efficiency. Phylogenetic compression, offers a promising approach to compress databases in the context of well-curated genomes, but it was validated on smaller datasets containing no more than hundreds of thousands of genomes (8). Our approach leverages efficient k-mer hashing tools for species-resolved compression of entire NCBI nucleotide and protein databases, achieving unbiased sequence redundancy reduction without the need for complete genomes. This approach unlocks the potential for regular and continual updates of complete databases used for nucleotide and protein sequence alignment in metagenomic pipelines, essential in pathogen detection applications given the global scale and diversity of pathogen sequencing efforts and the need to detect divergent viruses.
Despite maintaining the ability to accurately identify pathogens post-compression, our approach was challenged by an increased detection of low-abundance taxa, especially false positives identified in analysis with the NR database. This underscores the limitations of taxonomy-aware methods and suggests the merits of aforementioned least common ancestor approaches or algorithmic solutions to mitigate false positives due to sequence homology (19). Moreover, apparent false positives could stem from database contamination or misannotation.
Implementation trade-offs involved parameterization choices in sourmash, balancing accuracy and compression efficiency. The sourmash scaled value of 1000, dictated by compression routine runtime constraints, limited our ability to perform evaluations of lower values (i.e. scaled=100) potentially impacting sensitivity. The chosen containment threshold of 0.9 offers a balance between accuracy and database size, though different thresholds might yield greater compression with specific analysis goals, such as in cases where genus-level accuracy is suitable.
The compression routine implements a greedy algorithm, which enables its scalability to the entire NCBI database, but may not select an optimally compact sequence set. Consequently, the resulting compressed database is marginally larger than the theoretically optimal. Additionally, the current implementation does not address database contamination from misannotation, a challenge for NCBI curators. Future enhancements could include additional safeguards, such as taxon-independent containment searches or outlier sequence detection within taxa, to further refine this approach and make it robust to database sequence contamination.
Our findings underscore the challenges and trade-offs in database compression, particularly in balancing accuracy, sensitivity, and computational efficiency. While our approach offers a practical solution for current metagenomic workflows, especially in pathogen detection, it also highlights areas for future improvement. Addressing the increased detection of low-abundance taxa and potential false positives, refining algorithm parameters, and enhancing the method to be robust against database contamination are crucial next steps.
In conclusion, the developed compression routine holds the potential to transform the use of large reference databases in metagenomics, enabling more efficient and up-to-date analyses. This work lays the groundwork for further refinements that could lead to more sophisticated and scalable bioinformatics tools, ultimately advancing our understanding and detection of diverse pathogens in a rapidly evolving genomic landscape.
Code and Data Availability
CZ ID workflow code used for NCBI database compression can be found at https://github.com/chanzuckerberg/czid-workflows/tree/main/workflows/index-generation. Sample reports and additional code for data analyses can be found at https://github.com/katrinakalantar/ncbi-compress-experiments/.
Datasets used in this study were uploaded to CZ ID public projects (NCBI Index 2021-01-22) where non-host reads are publicly available for download. Projects include: Unambiguously Mapped - NCBI DB Compression, Virus Spiked - NCBI DB Compression, and Medical Detectives - NCBI DB Compression.
Acknowledgements
We acknowledge the contributions of the whole CZI Infectious Disease development team for their contributions to the metagenomics pipelines used throughout these analyses.
References
1. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Res. 2012 Nov 26;41(D1):D36–42.
2. Ko KKK, Chng KR, Nagarajan N. Metagenomics-enabled microbial surveillance. Nat Microbiol. 2022 Apr 1;7(4):486–96.
3. Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019 Jun;20(6):341–55.
4. Smith RH, Glendinning L, Walker AW, Watson M. Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome. Anim Microbiome. 2022 Nov 18;4(1):57.
5. GenBank and WGS Statistics [Internet]. [cited 2023 Oct 27]. Available from: https://www.ncbi.nlm.nih.gov/genbank/statistics/
6. Staff N. NCBI Insights. 2022 [cited 2023 Oct 28]. New ClusteredNR database: faster searches and more informative BLAST results. Available from: https://ncbiinsights.ncbi.nlm.nih.gov/2022/05/02/clusterednr_1/
7. Reiter T. Clustering the NCBI nr database to reduce database size and enable faster BLAST searches. 2023; https://doi.org/10.57844/arcadia-w8xt-pc81
8. Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, et al. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv. 2023 Apr 18;2023.04.15.536996.
9. Pierce NT, Irber L, Reiter T, Brooks P, Brown CT. Large-scale sequence comparisons with sourmash. F1000Research. 2019 Jul 4;8:1006.
10. Gourlé H, Karlsson-Lindsjö O, Hayer J, Bongcam-Rudloff E. Simulating Illumina metagenomic data with InSilicoSeq. Hancock J, editor. Bioinformatics. 2019 Feb 1;35(3):521–2.
11. Li H. Minimap2: pairwise alignment for nucleotide sequences. Birol I, editor. Bioinformatics. 2018 Sep 15;34(18):3094–100.
12. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59–60.
13. Kalantar KL, Carvalho T, de Bourcy CFA, Dimitrov B, Dingle G, Egger R, et al. IDseq—An open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring. GigaScience. 2020 Oct 15;9(10):giaa111.
14. McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, Hénaff E, Alexander N, et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 2017 Dec;18(1):182.
15. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell. 2019 Aug;178(4):779–94.
16. Nicholls SM, Quick JC, Tang S, Loman NJ. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. GigaScience. 2019 May 1;8(5):giz043.
17. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012 Feb 15;28(4):593–4.
18. Ono Y, Asai K, Hamada M. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. Peter R, editor. Bioinformatics. 2021 May 5;37(5):589–95.
19. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci. 2017 Jan 2;3:e104.
Comments
0 comments
Please sign in to leave a comment.