Jump to Section:
Overview
The CZ ID metagenomics (mNGS) pipeline encompasses a set of computational steps to process raw sequencing data and generate reports summarizing identified taxa in your sample. In this guide we dive deeper into the results of one sample to better understand steps implemented throughout the mNGS Illumina pipeline and how we report metrics associated with sample quality.
The guide briefly explains some of the pipeline steps but you can find more information by viewing pipeline details and navigating to the pipeline visualization for a given sample run. Note that samples uploaded to projects created prior to April 19, 2023 will run on pipeline version 7 (v7), whereas projects created on and after this date will run on the most recent major pipeline version at the time of the project creation. Host filtering and quality control steps differ between pipeline v7 and later versions.
After reading this guide, you will:
- Learn about host filtering and quality control steps implemented through the mNGS pipeline
- Become familiar with host filtering and quality control metrics and how to interpret them
- Learn how to find details regarding a pipeline run through the pipeline visualization
- Become familiar with contig assembly
Background Information for Example
To follow along the examples highlighted in this guide, please see the Sample Report for Patient 008 (CSF) from the Medical Detectives public project. This sample came from a patient that went to the hospital with an infection that could be diagnosed through traditional methods. The scientists used the mNGS pipeline to see if they could detect an infectious agent that was not captured by their diagnostic tests. Below is an image of the Sample Report.
Viewing Pipeline Details
To view pipeline details, select Sample Details on the right-hand side of the page.
The Sample Details panel will pop up on the right-hand side of the page. Once the side-panel opens, you will be able to see Metadata and Pipeline tabs displaying information associated with this specific sample. The side-panel defaults to the Metadata tab, select the Pipelines tab to view pipeline details. Pipeline details include host filtering and quality control information and links to the pipeline visualization.
Host Filtering and Quality Control
The first stage of the mNGS pipeline deals with host filtering and quality control (QC). During this stage, the pipeline implements a series of filtering steps to remove low-quality reads (i.e., those containing low quality bases, low complexity, and are too short) and reads representing human and host sequences. Regardless of the selected sample host, human reads are filtered out to the best of our ability.
The percentage of reads that are removed during Host Filtering and QC tends to depend on the sample type (tissue of origin). For example, cerebrospinal fluid (CSF) samples often have > 99% of reads removed during the host filtering steps, as one expects the microbial fraction to be minimal compared to the human fraction. Stool samples, on the other hand, may have a much lower percentage of reads removed during host filtering steps, since the bulk of the genetic material is expected to be microbial rather than host-derived.
Host Filtering and Quality Control Steps
The pipeline workflow referred to as "Host Filtering and Quality Control" encompasses a series of data preprocessing steps to remove low quality reads and host/human contamination. The preprocessing steps have been updated and the steps used to process your samples will depend on the date projects were created. Samples uploaded to projects created prior to April 19, 2023 will run on pipeline version 7 (V7), whereas projects created on April 19, 2023 and after will run on version 8 (V8).
Projects created prior to April 19, 2023
If uploading samples to projects created prior to April 19, 2023 "Host Filtering and Quality Control" steps include the following:
1. Initial host filtration and removal of ERCC sequences using STAR
2. Trim sequencing adapters using Trimmomatic
3. Quality filter using PriceSeq
4. Identify duplicate reads using czid-dedup (note: prior to pipeline version 6.0, CZ ID used CD-HIT-DUP for duplicate identification)
5. Filter out low complexity sequences using LZW
6. Filter out remaining host sequences using Bowtie2
7. Subsampling to 1 million reads (or 2 million for paired-end data) if > 1M reads remain after step (6). Note that subsampling is performed using unique (or deduplicated) reads. However, reported subsampled values in CZ ID reflect non-deduplicated reads resulting in values greater than 1 million (or 2 million for paired-end data) in many cases.
8. Filter out human sequences, regardless of host (using STAR, Bowtie2, and GSNAP)
Projects created on or after April 19, 2023
If uploading samples to projects created on or after April 19, 2023, the "Host Filtering and Quality Control" steps will implement tools designed to run faster on cloud instances and incorporate best practices from the peer-reviewed literature. Click here for details.
Passed Quality Control (QC) Value
The "Passed Quality Control" value represents the percentage of reads remaining after QC filtering to remove low quality bases, short reads, and low complexity sequences (i.e., number of reads remaining after step 3 compared to the reads remaining after step 1). For the sample from Patient 008, you can see that the total reads that Passed QC was 46.14%, which is an acceptable value.
Passed Filters Value
The "Passed Filters" value represents the percentage of reads remaining after QC filtering and removal of host and human reads. If reads were not subsampled after QC and host filtering (see Host and Quality Filtering Steps), the Passed Filters value indicates the percentage of original sequencing reads that were used for downstream analysis and microbe identification.
If reads were subsampled, the Passed Filters value reflects the number of reads after subsampling for projects created on or after April 19, 2023 (mNGS pipeline v8.0 and later) and indicates the number of reads used for downstream analysis. However, the Passed Filters value does not include subsampling for samples uploaded to projects created prior to April 19, 2023 and, thus, does not indicate the final number of reads used for downstream analysis.
For the sample from Patient 008, you can see that 60.94% of reads passed QC, but only 0.92% Passed Filters. This tells us that more than half of the reads had high sequencing quality, but only 0.82% of the total reads were not flagged as host or human reads. In other words, this sample contained a large proportion of human reads and less than 1% of reads were used for microbe identification.
Mean Insert Size Value
The mean insert size value is computed using the Picard CollectInsertSizeMetrics Tool on the host reads, and indicates the mean length of the nucleotide sequence that is inserted between sequencing adapters. This value can be used to provide evidence regarding the quality of the nucleic acid in the final library. For example, short fragment sizes may indicate sample degradation or over-fragmentation during library preparation. The mean insert size value is only computed for paired-end sequencing libraries generated from human hosts and will appear empty ("--") for all other sample types.
Compression Ratio Value (DCR)
The compression ratio is calculated as the total number of reads after QC filtering and host/human read removal divided by the number of unique reads identified by czid-dedup. Duplicate sequences are flagged, but these reads remain present for calculation of relative abundance results for identified taxa, including reads per million (rPM), total reads (r), and contig reads (contig r) associated with a given taxon.
Interpreting Sample Quality
Low Passed QC Values (i.e., < 10%)
The "Passed QC" percentage refers to the QC Filtering step. It indicates the fraction of reads that came out of the fastp step, compared to the number of reads used as input. Thus, samples with low Passed QC values are likely samples that included a high proportion of reads with one or more of the following:
- Presence of adapter sequences (e.g., adapter dimers)
- Low quality scores during basecalling
- Low complexity sequences (e.g., homopolymers or sequence repeats)
- Undetermined bases (Ns)
- Short length (< 35 bp)
High DCR Values
High DCR values are due to high proportion of duplicate reads. If many reads are flagged as duplicates, you will see that the number of "unique reads" indicated in the Reads Remaining table for CZID-dedup is much lower than the total number of reads remaining. This will correspond to a large DCR and may indicate that there was a large amount of PCR amplification during the library preparation stages, resulting in many duplicate sequences. Note that CZ ID is not intended for use with amplicon sequencing libraries (e.g., 16S).
To find the Reads Remaining table, scroll down the Pipeline tab within the Sample Details panel and click the Reads Remaining dropdown menu. The table Reads lists a breakdown of how many reads were lost in each preprocessing step, including deduplication through CZID-dedup.
Pipeline Visualization
You can go to the pipeline visualization to view detailed information about steps implemented throughout the pipeline. To do this, click View Pipeline Visualization from the Pipeline tab found within the Sample Details panel.
The View Pipeline Visualization link will direct you to an interactive visualization of the pipeline workflow. To view details about a given step, click on the step and a panel will open on the right-hand side with details about the step and links to intermediate files associated with the step, including inputs and outputs. Simply click on the intermediate file of interest to download.
Assembly
Sequence assembly refers to aligning and merging short sequencing reads obtained from a longer DNA sequence in order to reconstruct the original sequence. Through the process of assembly, longer, contiguous sequences (known as contigs) are generated. The mNGS Illumina pipeline implements SPAdes to assemble contigs. Each contig is composed of several sequencing reads. CZ ID maps quality-filtered reads back to the assembled contigs to determine which reads are associated with each contig using Bowtie2.
Keep in mind that genome size can influence the number of contigs identified for a given microbe. For example, suppose that Sample A contained a virus known to have a genome of 5 kbp (kilo-base-pairs) and Sample B contained a bacterium known to have a genome size of 300 kbp. When analyzing your sample results, fewer contigs would be assembled in Sample A than Sample B, assuming a random distribution of the short reads across the original genome.
Comments
0 comments
Please sign in to leave a comment.