What's New?
As of May 2024, NCBI databases used for metagenomic analysis in CZ ID will be compressed. This means that the implemented databases will not include the complete NCBI nucleotide (NT) and protein (NR) databases. Rather, the compressed version will collapse highly similar sequences into clusters. The compression approach minimizes redundancy and computational burden because alignments against the NCBI NT and NR databases will only include one representative sequence from each cluster. Given that only highly similar sequences are collapsed into clusters, this approach maintains the sequence diversity needed for accurate pathogen detection.
Samples uploaded to projects created after May 6, 2024 will use a compressed database version (NCBI Index Date 2024-02-06 and later). Continue to read to learn about the new database compression:
Background and Impact
At CZ ID, we are constantly striving to enhance our software to meet the evolving needs of the scientific community, ensuring that our tools remain at the forefront of pathogen identification and metagenomic analysis. Our team recognized that the rapid growth of the NCBI GenBank database, which doubles every 18 months, presents challenges for the scalability of alignment and classification tools integrated in CZ ID. We addressed this scalability issue by improving our database compression techniques, making pathogen detection more efficient.
The new database compression approach implements sourmash, a tool for efficient genomic and metagenomic sequence analysis based on MinHash techniques or random sampling of k-mer content. This process involves binning by taxID, sorting sequences by length, and employing a k-mer signature vector for each taxID bin to filter out highly similar sequences, ensuring a streamlined and non-redundant database.
Our evaluations demonstrate that the compression approach achieves size reductions of 34% (NT) and 21% (NR) on the most recent databases and yields associated compute time savings. The compressed databases provide consistent results with their uncompressed counterparts, across a range of datasets including those simulating novel virus detection and real-world pathogen analysis. CZ ID users can continue to expect the highest level of accuracy and reliability in their metagenomic analyses.
Click here to learn more about the new compressed database implementation, including compression parameters and the associated evaluation.
Key Features and Benefits
- Efficient Compression: Our new method significantly reduces the size of NCBI databases, making the storage and processing of metagenomic data more efficient.
- Maintained Sequence Diversity: Despite the compression, the integrity and diversity of sequences are preserved, ensuring that the detection capabilities for both known and novel pathogens remain robust.
- Validation and Reliability: Rigorous evaluation and validation across various benchmark datasets confirm that our compression technique retains high accuracy in pathogen identification.
Usability
The database version used for data analysis is specified by the NCBI Index Date, which indicates the date the NCBI NT and NR databases were downloaded for use by CZ ID. This index date can be used to find associated GenBank release numbers (see GenBank release notes). The downloaded databases are then compressed and indexed by CZ ID. Newer versions of the index will have the most up to date taxon information from NCBI.
The NCBI Index Date used for sample analysis will be dictated by the project given that the index is automatically assigned upon project creation. Each project is pinned to one NCBI Index Date to ensure that all sample runs within a project are comparable. For example, if your project is pinned to NCBI Index Date 2021-01-22, all new samples uploaded to that project will run on Index 2021-01-22. You can find this information during upload and on the Sample Report page.
The NCBI Index Date is specified when choosing Analysis Type during sample upload.
The NCBI Index Date used for analyzing a given sample is specified within the Sample Report page.
During upload, you will see a Warning Icon ( ) if there is a new NCBI Index available. To use this new NCBI Index, you must create a new project.
Looking Ahead
The introduction of this compression technique is a key step in our ongoing efforts to enhance the scalability and efficiency of metagenomic tools. As NCBI databases continues to grow, this approach will continue to be critical for our maintenance of up-to-date indexes for pathogen detection.
Comments
0 comments
Please sign in to leave a comment.