Jump to Section:
Overview
Wondering how to approach metagenomic next-generation sequence (mNGS) data analysis through CZ ID? You will obtain results after uploading data to the platform, but what’s next? Here we outline steps to begin analyzing your data. Data analysis may take some time due to noise in the data that may prevent you from drawing conclusions quickly. This guide aims to walk you through analyses that will help you distinguish real microbial signals from background noise or false positives. If you have questions, please reach out to our team by sending an email to help@czid.org.
The diagram below summarizes the general mNGS data analysis workflow that can be performed with short-read (Illumina) data using CZ ID.
mNGS data analysis overview
The sections below expand on data analysis steps summarized in the diagram and provide links for step-by-step guides on how to perform specific tasks on CZ ID. Although CZ ID now supports both short- and long-read data from Illumina and Nanopore sequencers, this guide focuses on analysis performed through the mNGS Illumina pipeline. Note that the mNGS Nanopore pipeline does not support all the analysis included in the mNGS Illumina pipeline. For example, the mNGS Nanopore pipeline does not include data visualizations (i.e., heatmaps and phylogenetic trees).
Data analysis steps, include:
2.2 Evaluate Spike-in Controls
2.3 Create Background Models to Account for Contamination
3. Exploring Results and Analyzing Data
3.1 Identify Microbial Composition per Sample
3.1.1 Filter Results and Explore Sample Report Table
3.1.2 Evaluate Coverage for Identified Taxa
3.1.3 Use BLAST to Confirm Taxa
3.1.4 Generate Ciral Consensus Genomes
3.2.1 Create Taxon Heatmaps
3.2.2 Build Phylogenetic Trees
Step 1: Uploading Data
Click here for step-by-step instructions on how to upload data to CZ ID for mNGS analysis.
To begin your mNGS data analysis, simply upload your data to CZ ID. The platform accepts both short- and long-read data from Illumina and Nanopore sequencing platforms, respectively. However, guides and examples in the following sections focus on the mNGS Illumina pipeline.
Step 2: Performing Quality Control (QC)
Click here for step-by-step instructions on how to perform mNGS sample QC in CZ ID.
Once your data has been uploaded to your CZ ID account and runs through the mNGS pipeline, you should perform a QC check before exploring the results. First, you should evaluate data pre-processing QC metrics to check the quality of your samples and, if included in the mNGS experiment, evaluate control spike-ins. After getting a sense for sample quality, we recommend creating a background model based on negative control samples to help you identify contamination. Below we explain QC steps and metrics you should evaluate.
2.1 Checking the Quality of Your Sample
You can assess the quality of your sample by looking at the proportion of reads that passed data pre-processing steps within the mNGS pipeline. For an overview of the pipeline see CZ ID mNGS pipeline overview. In general, the mNGS pipeline performs a series of data pre-processing steps prior to species identification to remove low quality and low complexity sequence reads, host and human data, and duplicate reads.
The first sample QC check is to evaluate how many reads passed data pre-processing steps.
You can easily keep track of sample QC information within your Project page. The goal at this stage is to look for outliers that may indicate low or high-quality samples. You will be able to evaluate the following for each sample within a given project:
How many input reads did my samples have?
-
- This will help you identify samples that have low input data.
- The table below summarizes expected data yields for Illumina sequencers. Take into consideration the number of multiplexed samples in each run to have a sense for how many reads to expect per sample.
Sequencer
(Illumina)
Maximum data output (Gb) Maximum reads
(million reads per run)
Maximum
read length
iSeq 100 1.2 4 2 x 150 bp MiSeq Micro 7.5 25 2 x 150 bp MiSeq 15 25 2 x 300 bp HiSeq 4000 1,300 – 1,500 250 - 400 2 x 150 bp NextSeq 550 series 120 400 2 x 150 bp NextSeq 1000 & 2000 360 1,200 2 x 150 bp NovaSeq 6000 SP 200 - 250 1,300 – 1,600 2 x 250 bp NovaSeq 6000 S1 400 – 500 2,600 – 3,200 2 x 150 bp NovaSeq 6000 S2 1,000 – 1,250 6,600 – 8,200 2 x 150 bp NovaSeq 6000 S4 2,400 – 3,000 16,000 – 20,000 2 x 150 bp NovaSeq X series 16,000 26,000 – 52,000 2 x 150 bp
What percentage of reads passed QC filtering?
- This will help you evaluate the quality of the input data. A high number of reads containing mainly adapter sequences (i.e., adapter dimers), a high proportion of bases with low quality scores, and/or low in complexity (e.g., containing homopolymers or sequence repeats) will lead to a low percentage of reads passing QC filtering steps.
- Quality-filtering tools use Phred-scale quality score thresholds to assess the quality of each base within reads. The Phred scores are used to represent confidence in a given sequencer base call. By default, the CZ ID mNGS pipeline will remove reads containing >15 uncalled bases (N’s) and bases with Phred scores < 17. This Phred score threshold corresponds to base calls with > 98% accuracy.
Phred Quality
Score
Base Call
Error Probability
Base Call Accuracy
(1 - Error)
10 1 in 10 = 10% 90% 20 1 in 100 = 1% 99% 30 1 in 1000 = 0.1% 99.9% 40 1 in 10000 = 0.01% 99.99% 50 1 in 100000 = 0.001% 99.999% 60 1 in 1000000 = 0.0001% 99.9999%
Do samples have expected insert lengths?
-
- Insert length refers to the length of the nucleotide sequence that is inserted between sequencing adapters. This information can be used to gauge the quality of the final sequencing library. Mean values for insert lengths shorter than expected may indicate sample degradation or over-fragmentation during library preparation.
How many reads represented host and/or human sequences?
-
- You will be able to keep track of host and human data in your samples. The percentage of reads that are removed during host and human read filtering partly depends on the sample type or tissue of origin. For example, cerebrospinal fluid (CSF) samples often have > 99% of reads removed during the host filtering steps. This is expected from CSF samples given that, as a sterile fluid, the microbial fraction is minimal compared to the human fraction. Stool samples, on the other hand, may have a much lower percentage of reads removed during host filtering steps, since microbial concentrations are much higher compared to other sample types.
What are the read duplication levels in my samples?
- High duplication levels are indicative of a low diversity dataset, which may result from biases during library preparation (e.g., biased PCR amplification or intentional enrichment for specific targets). However, high duplication could truly represent a biological phenomenon (low complexity or diversity samples).
- The CZ ID mNGS pipeline calculates duplicate compression ratios (DCR) for you to assess duplication levels. The DCR is the ratio of the total number of sequences present prior to running duplicate identification versus the number of unique sequences. The higher the DCR, the higher the read duplication.
Which QC step filtered out most of the reads?
- Evaluating which QC step removed most of the reads might help you explain what happened with low quality samples. You can easily keep track of the number of reads lost throughout QC filtering and host and human data removal steps. Below is a general guide on how to interpret reads lost throughout data processing.
Pipeline Step Interpretation
(If many reads are lost during this step, then…)
Adapter trimming Possible adapter dimers Filter length Many short reads (< 35 bp), possible degradation or over-fragmentation Filter low quality Many reads containing a high proportion of low quality bases (Phred scores < 17) Filter low complexity Many reads containing homopolymers or simple sequence repeats. Duplicate removal Many duplicate reads, possible over-amplification Remove host and human reads Large proportion of reads mapping to host and/or human sequences
2.2 Evaluating spike-in controls
Some labs use External RNA Controls Consortium (ERCC) standard spike-ins as controls for NGS experiments. If you use ERCC spike-ins, you can track how many ERCC sequences were identified and their distribution (learn where to find ERCC count data). You will be able to evaluate expected ERCC sequence counts vs the observed sequence counts. This information will help you interpret samples. For example, were all ERCC sequences detected? If not, this might indicate that samples require increased sequencing depth. Do expected ERCC sequence counts correlate with observed counts? If ERCC sequences spiked at low concentrations are overrepresented, this might indicate sequencing bias.
2.3 Creating Background Models for Your Project
Click here for step-by-step instructions on how to create a background model.
It is highly recommended to include negative control samples as part of NGS projects to account for potential biological contaminants present during sampling and in reagents used during nucleic acid extractions and library preparation. Negative control samples can be used to create background models that are then used to assess if organisms detected within samples are present at abundances higher than background and, thus, can be distinguished from contamination. Specifically, background models are used to calculate standard (Z) scores that account for relative abundances of detected taxa (measured as reads per million bases sequenced; rPM) within samples and negative controls.
The Z-score for a given taxon in sample “S” is computed as follows:
Taxa present at higher abundance in the sample than in the negative controls will have a Z-score > 1. If a particular taxon is not found within negative controls, then the Z-score will be set to 100. If the taxon is not found in the sample, but is present in the controls, the Z-score will be set to -100. You can set Z-score thresholds to filter results and only focus on taxa that were not identified in controls (Z-score = 100) or those that are found in higher abundances in samples compared to negative controls (Z-score > 1) (see Filtering results section below).
The Z-score is used to calculate an aggregate score (Score). The aggregate score is an empirical heuristic for ranking microbial matches by combining relative abundance and Z-score information at the species and genus levels. The aggregate score is calculated as follows:
If a taxon has a high aggregate score it indicates that a high abundance of reads matched to that taxon in the NT and/or NR database, while being less prevalent in the selected background model.
Step 3: Exploring Results and Analyzing Data
The main goal of mNGS analysis is to detect all the organisms present in a given sample. This information may help you identify organisms of interest, such as potential etiological agents. There are two main ways to explore results and analyze mNGS data within your CZ ID account. You can explore microbial composition identified per sample and/or compare identified taxa across samples.
3.1 Identify Microbial Composition per Sample
See Sample Report Table to learn about the interactive report with identified taxa.
You can easily see identified taxa by exploring an interactive sample report table for each sample in your project. The sample report summarizes taxonomic information and metrics for how well sequences matched a given taxon in the NCBI nucleotide (NT) or non-redundant protein (NR) databases. The report also provides information regarding the relative abundance of a given taxon within a sample.
- High-scoring Taxa: CZ ID implements ranking scores based on a given taxon's abundance in samples relative to controls and matches to both nucleotide (NT) and protein (NR) NCBI databases. High-scoring taxa are automatically highlighted in blue.
- Background Model: Create background models to calculate metrics that account for contamination (Score, Z-score).
- Filter Options: Filter the report table based on categories (e.g., bacteria, viruses) or metric thresholds (e.g., rPM 100) to focus on relevant results.
- Metrics per Taxa: Reported metrics for matches found in the NT and NR databases including: Score, Z-score, reads per million (rPM), # of reads (r), # of contigs, # of reads assembled into contigs (contig r), average % identity of alignments (% id), average length of alignments (L), and E-value.
- Annotation Tag: Use tags to note if identified taxa are a "hit", "no hit", or "inconclusive".
- Details at the Species Level: By default, results are shown at the genus level. However, results at the species level can be viewed by clicking on the downward arrow located by genus names.
- Analysis Icons: Use these icons to view coverage, run BLAST, build trees, generate viral consensus genomes or download sequences associated with the taxon.
- Known Pathogen Flag: Flags highlight pathogenic taxa.
Sample Report Metrics
Database Notes
The sample report provides metric values for both the NT and NR databases. It is very likely that values between the two databases will differ. Comparisons against NT are done at the nucleotide level, which is considered a more stringent search for similar sequences. On the other hand, searches against NR are done at the protein level which may reveal more divergent matches. Here are some considerations:
- Protein-coding sequences are present in both NT and NR databases. However, non-coding regions are only present in the NT database. Therefore, if you see a taxon that has high NT counts but low NR counts, many of the reads aligning to that taxon likely come from rRNA genes or other non-coding regions.
- Divergent sequences may have matches at the amino acid level but not at the nucleotide level. If you see a taxon that has high NR counts and low NT counts, you are likely dealing with a novel sequence. This is more commonly observed for virus sequences compared to bacterial sequences.
- If you are interested in identifying known pathogens, you should focus on values for matches in the NT database. Conversely, if you suspect a novel pathogen, you should focus on values for the NR database.
3.1.1 Filtering Results and Exploring the Sample Report Table
Click here to learn more about how to set filters.
Filtering results on the sample report will help remove some noise (e.g., background contaminants) and let you focus on abundant species representing microbial groups of interest (e.g., viruses, bacteria). Here we discuss settings that will allow you to sieve through the sample report more efficiently.
Choose what you want to see on the sample report based on category, background model, and/or threshold filters.
Category: You can choose to only view results for a specific microbial group, including:
-
- Archaea
- Bacteria
- Eukaryota
- Viroids
- Viruses
- Viruses - Phage
- Uncategorized
Background: You can choose to see results after applying background models. This will allow you to see aggregate scores and Z-scores for identified taxa and account for contamination.
Threshold filters - You can choose to filter results based on metric value ranges reflecting the quality of matches to the NT or NR database. Thresholds will allow you to filter out spurious matches. For example, you can set the following thresholds:
- Z-score > 1 for matches in the NT database to remove taxa that were present at higher abundance in the negative controls compared to the samples and, thus, are likely contaminants.
- rPM > 10 for matches in the NT database to remove taxa that were present at low levels.
- L > 50 bp for matches in the NT database to remove taxa for which alignments were < 50 bp. The longer the alignment between a query sequence and its matching taxon, the greater confidence we have in the match. For libraries containing 150bp reads, we can have confidence in alignments with an L value > 75 bp.
Each of the metrics in the sample report can help you evaluate if a particular taxon is actually present in your sample. After filtering results, you can start focusing on identified taxa. Here are initial steps you can take to start exploring results in the sample report and gain confidence in identified taxa:
1. Find the most abundant taxon (highest rPM) in your sample.
2. Look at the “contig” column. Is the number of contigs > 0?
-
-
- Yes - Great! This indicates that short reads were assembled into longer contigs, which improves your confidence in the match.
- No - That’s ok! Contig assembly is dependent on the total number of reads and the genome size of the identified organism. The lack of contigs could indicate that the organism is present at such low abundance that there were not enough reads to generate a contig and more sequencing is needed.
-
3. Look at the “% id” column. Is the percent identity >90%?
-
-
- Yes - Great! This suggests the identified organism is highly similar to the reference sequence.
- No - Still OK! Novel pathogens will be less similar to known pathogens, thus having lower percent identities. You can double-check if this is a real finding by performing BLAST.
-
4. Look at the “L” column. Is the alignment length at least one-third of the expected read length?
-
-
- Yes - Great! This might be a real match. Note that it is possible to have L values greater than the original read length if contigs were generated.
- No - Not great… Probably a spurious match.
-
5. Repeat steps 1-4 for taxa with abundant taxa.
3.1.2 Evaluate Coverage for Identified Taxa
Click here to learn more about coverage visualization.
The Coverage Visualization is another tool to assess whether a particular taxon is present or not in your sample. This type of visualization shows the range and uniformity of sequencing coverage over a given taxon reference sequence.
Coverage depth vs breadth
In general, high coverage depth and breadth are indicative of the presence of a given taxon in the sample. Below we provide general guidelines for how to interpret coverage.
Keep in mind the microbial group you are working with when evaluating coverage. For example, note whether the genome is viral or bacterial. It is common to see relatively high coverage for viral genomes given that viruses have small genomes compared to bacteria and other cellular organisms. Use the two examples below to familiarize yourself with what high coverage looks like for viral and bacterial genomes.
Example of high coverage visualization for a viral genome (bottom panel).
Example of high coverage visualization for a bacterial genome (bottom panel).
3.1.3 Use BLAST to Gain Further Confidence in Taxon Matches
Click here to learn how to perform BLAST from your CZ ID account.
To maximize computational resources, CZ ID only compares contig sequences against a custom database containing sequences representing taxa identified with read data (see mNGS pipeline information here). The smaller size of the custom database, in comparison to what is available in NCBI databases, may result in the detection of matches that do not represent the best match for a given contig sequence. Therefore, you may want to confirm contig sequence matches by comparing/searching against NCBI. Additionally, you may be interested in learning more about specific genomic regions for a given taxon (e.g., is this a protein-coding region? If so, which protein?).
NCBI BLAST web page contains general BLAST information and interface to perform sequence searches against nucleotide (NT) and non-redundant protein databases (NR). However, CZ ID implements BLASTN and BLASTX so you can search sequences of interest against NT and NR directly from your account.
You can use BLAST (Basic Local Alignment Search Tool) to compare contigs matching a taxon of interest against NCBI databases. See A Guide to BLAST for general information about this sequence search tool. CZ ID currently implements BLASTN and BLASTX for searches against the nucleotide (NT) and non-redundant protein (NR) databases, respectively.
3.1.4 Generate Consensus Genomes for Abundant Viruses
Learn how to generate viral consensus genomes using the web app and the command line interfaces.
If a sample has viral taxa present at high levels, it may be possible to build consensus genomes for each virus. You can then use the consensus genomes for downstream analyses, such as phylogenetic tree building. Consensus genomes can only be created for viral species with at least one contig aligning to the NCBI nucleotide database.
3.2 Analyze Multiple Samples to Compare Microbial Composition
You can compare taxa across samples in CZ ID using heatmaps or phylogenetic trees depending on your goal. If you want to look for patterns, heatmaps provide a visual way of comparing multiple taxa detected across samples. If you are interested in comparing a specific taxon of interest across multiple samples, you can use phylogenetic trees.
3.2.1 Taxon Heatmaps
Click here to learn how to generate heatmaps in CZ ID.
Use heatmaps to easily compare taxa detected across samples. For example, compare samples from different treatments or patient health status to identify microbes of interest or simply compare samples against negative controls to rule out contaminants.
Heatmaps are interactive. You can easily:
-
- Filter the heatmap to help you reduce noise the same way you can filter the Sample Report.
- Control the number of displayed taxa per sample.
- Add metadata
- Select specific taxa you would like to see on the heatmap.
- Pin specific samples that you would like to show and/or group on the left-hand side of the heatmap.
- Filter and View Options: Use filters and view options to customize heatmap based on categories (e.g., bacteria, viruses), metric thresholds (e.g., rPM > 100), taxonomic levels, and/or scale of interest.
- Pin Samples, Add Metadata, and/or Select Specific Taxa: Use these dropdown menus to further customize heatmap.
- Color Scale: Heatmap scale reflects relative abundance (rPM) by default.
3.2.2 Phylogenetic Trees
Click here to learn how to create phylogenetic trees in CZ ID.
The CZ ID phylogenetics module allows you to construct phylogenetic trees to compare the genomic relatedness (nucleotide level) of organisms found across different samples. This is a useful tool for evaluating similarity among microbes identified in multiple samples in the context of outbreaks (e.g., closely-related sequences identified from multiple patients) or contaminants (e.g., are the sequences identified in negative controls closely-related to those identified in samples? Is there evidence of “sample bleeding” in multiplexed runs?).
When creating your phylogenetic tree:
-
- Select a microbe of interest and CZ ID will automatically identify samples within the project where the microbe was identified.
- You can choose to include samples from public projects. CZ ID will automatically identify public projects within the platform where the microbe of interest was also identified.
- CZ ID will automatically include up to 10 reference sequences from NCBI based on reference sequences with the most read matches.
- Once the tree is created, you can easily color the tree by metadata.
Example phylogenetic tree generated for Porcine astrovirus 2.
Recap: Tie It All Together
When you upload sequencing data to CZ ID, you can repeat the following steps to make sense of the data.
Steps to help you gain insights about your samples:
1. View your samples’ QC metrics on the project page. As you gain an understanding about your samples, look for any outliers.
-
- Consider Sample Type, Total Reads, ERCC Reads, Passed QC, Passed Filters, and DCR when evaluating QC.
- For any outliers, dig into the “Reads Remaining” table to understand what may have happened during library prep.
2. Create a heatmap to view identified taxa across samples and customize it to reduce noise.
-
- Add some metadata
- Apply a background model. Make sure you are using an appropriate background model. If you don’t have one, create a background model from water and/or healthy samples processed in your lab.
- Apply some filters to the heatmap (e.g., NT rPM, NR rPM, and L thresholds) and look for trends. You may need to play around with filters to see what makes sense for your dataset.
- Look at Taxon Level = Genus to get a high-level view.
3. Dive into individual hits to taxa of interest using the sample reports. Make sure to evaluate the hits based on several metrics:
-
- rPM, abundance within the sample
- Z score, significance as compared to the background
- Coverage visualization, to assess breadth and depth of reads to an accession
4. Finally, any hits you find by mNGS can be validated through PCR assays and you’ll soon be an expert in interpreting your data.
Comments
0 comments
Please sign in to leave a comment.