Jump to Section:
Overview
CZ ID aims to provide easy-to-use tools for pathogen identification that are robust, scalable, and up-to-date with best-practices in the field. One of the key components of the short-read mNGS pipeline is the host filtering stage. This stage consists of a number of steps that iteratively perform removal of host reads and quality control (QC). Since the initial development of the short-read mNGS pipeline, a number of new tools have been developed that increase the speed and sensitivity of these steps. As of CZ ID pipeline version 8.0, the short-read mNGS pipeline (and AMR pipeline) will incorporate these modern pipeline components.
The upgraded pipeline will be available after April 19, 2023. Read more to learn about the motivation and technical implementation, biological validation, and considerations for this change. You can even browse datasets analyzed with old and modern filtering steps (see See It Yourself section).
Motivation and technical implementation – upgrading components based on new tools
As the field of genomics grows, new tools are consistently being released and evaluated by a broad community of researchers. As such, the community consensus may shift to recommend new best practices. This pipeline change reflects a move to adopt these best-practices within the short-read mNGS host and QC filtering stage to use modern bioinformatics tools that are more efficient and better-maintained. In particular, the upgraded pipeline replaces the memory-intensive STAR and GSNAP aligners (previously used for host and human filtering) with Bowtie2, HISAT2, and Kallisto; and a number of disparate QC steps with fastp, an all-in-one QC tool (Table 1) [1-6]. The new host filtering stage will run faster on smaller cloud instances, thus reducing costs and reflecting best practices from the peer-reviewed literature [7].
Table 1. Summary of tool changes associated with host filtering upgrades.
Function | Previous tool | Upgraded tool |
Adapter trimming | Trimmomatic | fastp |
Low quality read removal | PriceSeq | |
Low complexity read removal | LZW | |
ERCC read removal | STAR | Bowtie2 |
Gather host gene counts | STAR | Kallisto |
Host read removal | STAR, Bowtie2 | Bowtie2, HISAT2 |
Human read removal | STAR, Bowtie2, GSNAP | Bowtie2, HISAT2 |
The new host filtering and quality control steps include:
1. Input validation: Verification of file type (FASTQ) and number of reads. The number of reads will be capped at 75 million for single-end reads or 150 for paired-end reads.
2. ERCC removal: Removal of External RNA Controls Consortium (ERCC) sequences representing spike-in controls identified through Bowtie2.
3. QC filtering: Removal of sequencing adapters, short reads, and sequence reads with low quality (e.g., high percentage of “Ns”) and low complexity regions (e.g., sequence repeats) using a customized version of fastp. More specifically, this step removes bases with quality scores below 17, reads shorter than 35 bp, reads containing low complexity regions accounting for more than 40% of the sequence, and reads containing more than 15 undetermined bases (Ns).
4. Host read removal: Removal of host-associated read sequences identified through Bowtie2 followed by HISAT2 alignments against reference host genomes.
5. Human read removal: Regardless of host, removal of human sequences identified through Bowtie2 followed by HISAT2 alignments against reference human genomes.
6. Duplicate read removal: Retention of only one representative read for read sequences that are 100% identical based on the first 70 bp using CZID-dedup. Note that QC metrics (e.g., Passed Filters) and all relative abundance values reported for identified taxa in the Sample Report are computed without duplicate removal. See a discussion of how this affects results here.
7. Subsampling: Subsampling to 1 million single-end or 2 million paired-end reads to maximize computational resources. Note that subsampling is performed using unique (or deduplicated) reads. However, reported subsampled values in the Web App reflect non-deduplicated reads and may be greater than 1 million (or 2 million for paired-end data).
Note that the new pipeline implements a customized version of fastp. The tool was customized to improve the low complexity filter. The complexity filter logic implemented in fastp only takes into account direct repetition of the same base (see fastp’s low complexity filter). Therefore, fastp’s complexity filter is not designed to detect other types of low complexity sequences, such as simple repeats (e.g., di- or tri- nucleotide repeats) and microsatellites. During QC filtering in the updated CZ ID mNGS Illumina pipeline, fastp’s low complexity filter is overwritten and a low complexity masking algorithm known as symmetric DUST (SDUST) is used instead. If less than 60% of a read is masked based on SDUST default settings, then the read is retained. Otherwise, the read is discarded. Click here to view CZ ID’s customized script for low complexity filtering.
Biological validation – new pipeline steps produce comparable results
Validation process
Validation of pipeline changes was done through integration testing with a series of real-world samples selected at random. These samples included different hosts, nucleic acid types, and pipelines-of-interest (mNGS and AMR) to capture the effectiveness of the new host filtering and QC steps on a variety of dataset types commonly run through CZ ID. For each sample, we evaluated the proportion of reads passing all filters, the percentage of reads lost due to QC, and the impact on biological findings (microbial counts for the mNGS pipeline, AMR gene counts for the AMR pipeline, and host gene counts for the host gene counts output). All validations were performed with pipeline version 8.0.
Impacts on data interpretation
Overall, the pipeline upgrades resulted in the following:
- The microbial relative abundances were reasonably concordant between pipeline versions, with minor deviations arising amongst low-abundance taxa (potential informatic noise). Any more notable differences could be explained by better host removal through the new pipeline reducing previously false-positive results.
- The AMR results were highly correlated between pipeline versions, with minor differences arising in cases where reads may have been attributed to different genes with homology within the same gene class.
- The host gene counts were correlated between pipeline versions, with the greatest variability occurring in mitochondrial genes (due to differences in the underlying reference genomes) and low-abundance genes.
The impacts of these changes on real-world samples are likely to vary, but overall this instills confidence that the upgrades will maintain high quality results for metagenomic samples while improving the scalability of the pipeline as reference databases grow.
To see some real-world examples in CZ ID, check out the See it yourself! section of this blog post.
Detailed validation of mNGS
For detailed descriptions of the datasets for which metrics are shown below, check out the See it yourself! section. Each dataset was processed using both the old and new pipelines. The resulting Samples Overviews (containing per-sample QC metrics) and the Sample Taxon Reports (containing per-taxon microbial abundances) were downloaded.
The % QC and the total reads passing all filters was compared between pipeline versions (Figure 1). We observed relatively comparable metrics for Percent Pass QC across the two pipeline versions, but there were some deviations between “Total Passed Filters” values – particularly the three mosquito samples. Upon investigating the outliers, we determined that the difference is attributable to deviations in the complexity filtering performed by LZW in the old version and fastp in the new version. In the new pipeline, a customized fastp implements a version of the sdust algorithm. This is a more modern and widely-accepted method for complexity filtering and, therefore, we accept the deviations in these particular samples. Moreover, these deviations did not have a significant impact on species identification (Figure 2A) given that most sequences passing the new filter, but not the old one, did not map to any taxa.
Figure 1. Key metrics from host and quality filtering stage evaluated for each dataset comparing old (ver. 7.1) v. new (ver. 8.0) pipeline versions in dark blue and light blue, respectively. A) Quality Control, B) Total Passed Filters.
At a per-sample level, we further interrogated the impact on relative abundance of taxa identified by the short-read mNGS pipeline. We calculated the Pearson's correlation coefficient (r) between NT rPM abundances identified by the new v. old pipelines. We excluded from this analysis two samples that had fewer than three total taxa identified. Amongst the remaining samples, we saw high concordance (average r = 0.96, range from 0.84 to 1.0 and standard deviation 0.05). Altogether, these results suggest that the results identified by the new and old pipelines are highly concordant and the host filtering updates don’t have a major impact on identified taxa.
Figure 2. Pearson's correlation coefficients for species-level taxonomic relative abundance (NT rPM) between new (ver. 8.0) and old (ver. 7.1) pipeline versions (A). The gray line indicates correlation of 0.9. Highlighting the old (x-axis) v. new (y-axis) taxon relative abundances for two samples, including amr-2-dna, with high correlation (B) and pig-2-rna, with lower correlation (C).
Detailed validation of AMR
The AMR pipeline uses the same host filtering and QC steps as the short-read mNGS pipeline. To understand the impact of the updated host filtering steps on AMR results, we evaluated the AMR profiles detected on both old and new pipelines in four of the datasets that were known to contain in silico predicted AMR genes. We observed high concordance of results when comparing the results between the old and new pipeline versions. Most variability was attributed to cases where reads may have been attributed preferentially to AMR genes from the same gene family with sequence homology. This is expected given the subtle differences that may occur in the initial filtration.
Figure 3. Heatmap showing the read-level AMR results for four sample pairs analyzed using old (ver. 7.1) and new (ver. 8.0) pipelines (pipeline versions are specified within sample names). Heatmap colors correspond to the total number of reads mapping to each gene. Colors across the top visually represent groupings of AMR genes from the same gene family.
Detailed validation of host gene counts
Previously, CZ ID provided host gene counts for samples specifying "human" as the host organism. These counts were generated using STAR, but the new pipeline uses Kallisto to provide transcript counts. While STAR does full alignment of sequences to the reference genome prior to counting alignments within gene regions, Kallisto provides a fast alternative for transcriptome quantification using pseudo-alignment. To understand the differences between the previous gene counts and the new results, we evaluated the host gene counts from the RNA-seq samples from a human host. We compared to pre-existing STAR gene-level counts by rolling Kallisto transcript counts to gene level through summation of counts associated with each transcript of a given gene. We then evaluated the correlation between gene counts across the two pipeline runs.
Rigorous evaluations of the differences between STAR and Kallisto are well-documented in a number of studies [8,9]. However, in the subset of samples that were used in this evaluation, we found that several samples included in the analysis had relatively low gene counts overall and the sparsity led to increased noise. We focused our evaluation on four samples that had > 25,000 gene counts. In these samples, there were some notable differences in genes that were identified by Kallisto, but not identified by STAR. In particular, several mitochondrial genes were not identified by STAR (Figure 4) given that the underlying reference that was used for STAR alignment did not contain these annotations. We also noted distributions of gene counts that showed greater noise at lower levels of transcript abundance, with many genes being identified at low levels by only one of the two methods. However, after dismissing the mitochondrial genes (only detected by Kallisto) the most abundant genes tended to correlate well between the two methods. The overall Pearson's correlation between gene counts was greater than 0.7 for the four samples with > 25,000 gene counts.
We note that the format output from STAR and Kallisto are slightly different. In particular, STAR outputs a file containing Gene IDs and counts independent of strand (unstranded) and then per-strand counts. On the other hand, Kallisto outputs pseudo-alignment results with columns corresponding to gene names, estimated counts, transcripts per million (tpm), and effective length. More documentation on this format can be found here. The new Kallisto output available for download through CZ ID contains transcript counts, which can be rolled to gene level counts using existing tools such as tximport. Users may need to account for output differences between STAR and Kallisto in any downstream pipelines.
Figure 4. Correlation in host gene counts identified by the old STAR method (x-axis) and the new Kallisto method (y-axis) for two of the four samples evaluated (A and B). The top genes with notable differences between the two methods are annotated in text, highlighting the abundance of mitochondrial genes identified only by Kalliso.
Considerations
The new host and QC filtering changes do have some subtle impacts on the interpretation of CZ ID QC metrics. In particular, the QC steps are now applied ahead of host filtering (instead of after the initial STAR host filtering step, as was previously the case). This can be seen in metrics such as those displayed on the project-level QC page within the CZ ID web application (Figure 5).
Figure 5. Project-level QC charts showing the Reads Lost charts associated with the old (ver. 7.1) (A) and new (ver. 8.0) (B) pipelines highlighting the difference in ordering of metrics.
Pipeline comparison - see it yourself!
Check out the Sample Report pages for datasets analyzed with pipelines implementing old (ver. 7.1) and new (ver. 8.0) host and QC filtering steps.
References
[1] Dobin A, Davis CA, Schlesinger F, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013; 29:15–21. https://doi.org/10.1093/bioinformatics/bts635
[2] Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 2010; 26:873–81. https://doi.org/10.1093/bioinformatics/btq057
[3] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–9. https://doi.org/10.1038/nmeth.1923
[4] Kim, D., Paggi, J.M., Park, C. et al. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 2019; 37:907–915. https://doi.org/10.1038/s41587-019-0201-4
[5] Bray, N., Pimentel, H., Melsted, P. et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 2016; 34:525–527. https://doi.org/10.1038/nbt.3519
[6] Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34 (17): i884–i890. https://doi.org/10.1093/bioinformatics/bty560
[7] Bush SJ, Connor TR, Peto TEA, Crook DW, Walker AS. Evaluation of methods for detecting human reads in microbial sequencing datasets. Microb Genom 2020; 6(7):mgen000393.
[8] Liu, X., Zhao, J., Xue, L. et al. A comparison of transcriptome analysis methods with reference genome. BMC Genomics 2022; 23:232. https://doi.org/10.1186/s12864-022-08465-0
[9] Du Y, Huang Q, Arisdakessian C, Garmire LX. Evaluation of STAR and Kallisto on Single Cell RNA-Seq Data Alignment. G3 Genes|Genomes|Genetics 2020;10(5):1775-1783. https://doi.org/10.1534/g3.120.401160
Comments
0 comments
Please sign in to leave a comment.