On March 24, the CZ ID pipeline upgraded to version 4.0, in which all relative abundance values in CZ ID will be computed without duplicate removal - this includes the metrics r, rPM, and contig r. Prior to version 4.0, these values were calculated using de-duplicated reads.
Here we provide a summary outlining the Implications for Analysis, Technical Details of the implementation, and provide some Examples comparing the current pipeline version to the new version.
Historically, mNGS analyses have removed duplicates to reduce the potential for bias introduced during PCR amplification as well as the total computational burden. However, given advances in protocols, recent literature has concluded that deduplication may be unnecessary [1,2]. Additionally, deduplication precludes analysis of samples processed using technologies that rely on amplification and quantification of all reads, such as FLASH . The pipeline changes outlined here enable analysis of such data.
Implications for Analysis
- We recommend that reads per million (rPM) values from samples run on v4.0 and later are not compared to samples run on v3.x. The relative abundance values from previous pipeline runs (r, rPM, and contig r, before v4.0) are computed based on a different formula than the new relative abundance values (see Technical Details). We provide examples of the impact below (Examples).
- We recommend generating new background models with samples processed on pipeline v4. Z-scores for samples processed on pipeline versions v4.0 and above, when compared to background models generated on pipeline v3.x data, will be inflated.
- This may affect the threshold filters that you have determined to be most appropriate for your sample types. It is always recommended that you evaluate the threshold filters against control samples. However, for illustrative purposes, if you have typically used a filter of NT r >= 10, but most samples in the project have a duplicate compression ratio (DCR) of 2.5, the equivalent filter for samples run on pipeline v4.0 would be NT r >= 25.
- To make pre-existing pipeline runs roughly comparable with new runs (post v4.0), it is possible to multiply the r, rPM, and contig r values by the DCR (see Examples, Figure 1C). In cases where the DCR is equal to 1.0, this change will have no effect.
Previously, rPM values were provided based on the following formula, in which the numerator used “reads after deduplication” and the denominator used “total reads (before deduplication)”:
- r = r_after_deduplication
- rPM = r_after_deduplication / ((total_reads_before_deduplication - total_ercc_reads) * subsample_fraction) * 1 million
- contig r = r_after_deduplication aligning to assembled contigs
On March 24, pipeline version 4.0 will begin providing rPM values calculated including all reads, based on the following formula:
- r = r_before_deduplication
- rPM = r_before_deduplication / ((total_reads_before_deduplication - total_ercc_reads) * subsample_fraction) * 1 million
- contig r = r_before_deduplication aligning to assembled contigs
In the new pipeline version, CD-HIT-DUP will still be run to identify and compress duplicate reads (for its computational cost-saving benefit) and the CD-HIT-DUP cluster files will be used to back-compute the original number of reads. This has a few implications:
- To achieve this, the similarity threshold parameter for CD-HIT-DUP will be increased from 95% to 100%*, which may result in a slight drop in DCR values (see Examples below). *Note that the similarity percentage (consistent with all previous pipeline versions) remains computed on the first 70bp of the read.
- This means that the per-sample DCR values are still available as a QC metric for assisting with interpreting the impact of PCR amplification during library prep.
- The per-taxon duplicate compression ratios (DCR) used to back-compute the original numbers of reads will be available for download in the pipeline visualization. The per-taxon DCR values do range slightly, but are distributed closely around the total sample DCR. Outlying per-taxa DCR values are associated with low-abundance taxa.
Please note that the pipeline version on which samples were run can be determined in the upper left corner of the Single Sample Report:
Here we demonstrate how the pipeline change impacts the rPM values for a randomly selected sample with previous DCR of 2.42 (Figure 1). With the new pipeline version, the DCR drops slightly to 2.37. The small change to total DCR does not change our interpretation of the QC metric. The relative abundances of species-level taxa are tightly correlated between pipeline versions (corr = 0.97 for log-scaled rPM values), with highly abundant taxons driving the correlation while we observe greater noise in the lower abundance taxa (i.e. taxa with < 10 rPM) (Figure 1A). The per-taxon DCR values are centered around the overall sample DCR (Figure 1B). Figure 1C shows a heatmap representation of the NT rPM values for the top 10 taxa, comparing the previous pipeline (v3.19, column 1) to the new pipeline version (v4.0, column 2). The rPM values from the new pipeline version are comparable to results obtained by multiplying the previous pipeline version’s rPM values by the DCR. Intuitively, this makes sense because the previous pipeline computed rPM after deduplication.
The drop on total DCR is a consistent trend across all samples tested as part of this pipeline change - the results are quantified in Figure 2A. Of note, samples with the greatest DCR show the greatest change. Similarly, the correlation between rPM values across pipeline versions is high across DCR values, but drops with increasing DCR (Figure 2B).
The full analysis of pipeline versions across these selected samples (Table 1) can be seen in Figure 3. Samples are sorted in order of increasing DCR. Of note, the greater the sample DCR, the greater the impact of this change. This is to be expected, and will open the door for analysis of samples processed using technologies that rely on quantification of all reads (including duplicates), such as amplification-based techniques like FLASH (Finding Low Abundance Sequences by Hybridization).
- Stephen Nayfach and Katherine S. Pollard. Toward Accurate and Quantitative Comparative Metagenomics, Cell, 25 August 2016, https://doi.org/10.1016/j.cell.2016.08.007
- Ebbert, M. T. W. et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics17, 239 (2016).
- Jenai Quan, Charles Langelier, Alison Kuchta et al. FLASH: a next-generation CRISPR diagnostic for multiplexed detection of antimicrobial resistance sequences, Nucleic Acids Research, Volume 47, Issue 14, 22 August 2019, Page e83, https://doi.org/10.1093/nar/gkz418