Jump to Section:
Overview
Once you upload data to the mNGS Nanopore pipeline and the run is completed, you can analyze data to identify microbes of interest. Here we list steps for exploring results and analyzing data. Click here to learn how to download data and reports produced throughout the pipeline.
After reading this guide, you will be able to:
- Navigate to the Projects page to view sample status
- View and interpret the Sample Report
- Learn about result interpretation
- Annotate hits to keep track of matches to taxa of interest
- Explore Taxonomic Tree View
- Use filters to simplify the Sample Report
- Visualize coverage for taxa of interest
- Use BLAST to confirm taxa of interest
- Troubleshoot BLAST feature
Navigate to Project page
To view the status of your samples:
1) Go to the Project page of interest to view the status of your samples. To do this, look for the project under "My Data" page. Once in the Project page, click on the Metagenomics - Nanopore tab to check on the mNGS Nanopore pipeline status.
Find the project of interest under the “Projects” tab and click on it. Once you are on the page for the project of interest you will see the “Metagenomics - Nanopore” tab, where you can find your nanopore samples and check on their status.
2) Once the mNGS pipeline run is complete, you can view a Sample Table providing metadata and stats for each sample. The sample stats will help you evaluate how many reads passed data processing steps. Note that only up to 1 million reads can pass data processing steps given that reads passing quality and host/human filters are subsampled (see pipeline workflow overview).
Sample Table showing metadata and general stats for one sample. By default, stats shown on the table reflect the percentage of reads passing data pre-processing steps used to filter low quality reads and remove host and human data. You can add more metadata and sample stats columns using the “plus” sign dropdown menu located on the right-hand side of the table.
Below we provide a summary of the information provided within the Sample Table.
The Sample Table only provides stats at the read level for you to keep track of how many reads made it through data pre-processing steps (e.g., QC filtering and subsampling) and, thus, were used for microbe detection. If you are interested in information at the base pair level, please see the Sample Overview report available through bulk download or look at the pipeline details on the Sample Report page (see images below).
Clicking on the sample will take you to the Sample Report page, where you can find more sample information.
Clicking “Sample Details” will open a modal on the right-hand side of the page where you can find more information about a given sample. For example, you can see information regarding how many base pairs passed various data pre-processing filters.
Note the following when looking at the “Bases Remaining” breakdown to understand where most of the data filtering occurred:
- Validate input - Currently, the input validation only verifies that sequences are in the correct format (FASTQ). Therefore, you should not see data loss here.
- Quality filter - The quality filter values refer to the number of bases remaining after removing reads through fastp due to low scores, short length, and/or complexity.
- Host filter - The host filter indicates the number of bases remaining after removing reads matching the selected host organism using minimap2. If a host organism was not specified, you should not see data loss here.
- Human filter - The human filter values indicate the number of bases remaining after removing reads matching the human genome using minimap2. Reads matching the human genome are always removed regardless of the specified host organism.
- Subsampling - Reads passing quality filtering and host/human read removal are subsampled to 1 million sequences. Subsampling values reflect the number of bases remaining after subsampling.
View and interpret the Sample Report
To view the Sample Report, an interactive table summarizing identified taxa and match metrics, for each sample:
1) Click on a sample of interest. This will take you to the Sample Report page.
Once the pipeline run is completed you can go to the Sample Report page to view identified taxa.
Below we provide a summary of the metrics provided within the Sample Report table.
Consider the following when interpreting the Sample Report:
- You should use bases per million (bPM) when comparing relative abundance of taxa detected across samples. This normalized metric accounts for differences in sequencing output (i.e., total number of bases) between samples.
- The Sample Report provides metric values for searches against the NT and NR databases. It is very likely that values between the two databases will differ. Here are some things to keep in mind:
- NR values only reflect contig alignments, whereas NT values reflect alignments for both unassembled reads and contigs.
- Alignments against NT are done at the nucleotide level, which is considered a more stringent search for similar sequences. On the other hand, searches against NR are done at the amino acid level which may reveal more divergent matches.
- Protein-coding sequences are present in both NT and NR databases, whereas non-coding regions are only present in the NT database. Therefore, if you see a taxon that has high NT counts but low NR counts, many of the contigs and/or unassembled reads aligning to that taxon likely come from rRNA genes (e.g., 16S) or other non-coding regions.
- Divergent sequences may have matches at the amino acid level but not at the nucleotide level. If you see a taxon that has high NR counts and low (or no) NT counts, you are likely dealing with a novel sequence. This is more commonly observed for virus sequences compared to sequences from cellular organisms (e.g., bacteria, fungi).
- If you are interested in identifying known pathogens, you should focus on values for matches in the NT database. Conversely, if you suspect a novel pathogen, you should explore values for the NR database.
- The Sample Report includes taxa with the best alignment metrics. However, there are cases in which there are multiple species (or subspecies) producing significant alignments. Amongst all equally good alignments, the reported sequence is assigned based on the sequence match with the highest total read count, including reads assembled into contigs. Otherwise, the reported taxon is randomly selected from equally good alignments. Therefore, it is always a good idea to evaluate coverage and confirm taxa of interest using BLAST.
2) Explore identified taxa. By default, the Sample Report summarizes identified taxa at the genus level and is sorted based on bPM values. If you are interested in learning about specific species, click on a genus of interest. Known pathogenic species will be flagged as “Known pathogen”. If you want to change how the table is sorted, simply click on a metric (header) of interest to sort the table based on values for the selected metric.
You can sort the Sample Report table by any metric of interest.
Annotate Taxa of Interest
Use the Annotation feature to make notes regarding your confidence in matches to taxa of interest. You can annotate taxa at the genus or species level as “confirmed”, “not a hit”, or “inconclusive” to keep track of your analysis. Note that annotating taxa will not affect downstream analyses. You can use annotations to filter the Sample Report.
Click the tag by a taxon name of interest to annotate the hit.
Explore Taxonomic Tree View
Explore an overview of detected species using the Taxonomic Tree View. The tree view depicts the taxonomic relationship of all microbes identified in a given sample. The weight or thickness of the lines connecting tree nodes is proportional to the metric selected under “Tree Metric”. By default, the tree lines will reflect values representing the number of bases matching a given taxon in the NT database (NT b). Taxa with thicker lines will have proportionally higher NT b values than those with lower values.
Get to the Taxonomic Tree View by clicking on the toggle next to the Table View icon in the Sample Report page. By default, the thickness of the lines corresponds to the NT b value.
You can apply filters to simplify the Taxonomic Tree View. Additionally, you can change which value controls the thickness of the tree lines from the Tree Metric dropdown filter.
Taxonomic Tree View options
Use filters to simplify the Sample Report
The number of identified taxa on the Sample Report may be overwhelming. Moreover, not all reported taxa will be relevant. Filtering results on the Sample Report will help remove some noise (e.g., spurious matches) and let you focus on abundant species representing microbial groups of interest (e.g., viruses, bacteria). In this section we discuss settings that will allow you to sieve through the Sample Report more efficiently.
You can choose what you want to see on the Sample Report based on one or more available filters.
Available filters to simplify Sample Report
Any applied filters will show above the Sample Report table. The number of rows passing the filter will be indicated above the table.
If you would like to clear any or all filters simply click on the “X” by the filter you want to remove or click “Clear Filters”, respectively.
Available filters
You can filter the report by:
Category: You can choose to only view results for a specific microbial group, including: Archaea, Bacteria, Eukaryota, Viroids, Viruses (all viruses), Viruses - Phage (only phage), and Uncategorized (not assigned to a specific group).
Category filter dropdown menu
Read specificity: You can choose to see results based on taxonomic assignments. By setting the Read Specificity filter to “Specific Only” you will only see matches to taxa that are assigned to a genus. If you would like to see all taxon matches, regardless of whether or not they have been classified within a genus, set the Read Specificity filter to “All”.
Read specificity filter dropdown menu
Threshold filters - You can choose to filter results based on metric value ranges reflecting the quality of alignments against the NT or NR database. Thresholds will allow you to filter out spurious matches.
Threshold filter menu
For example, you can set the following thresholds:
- If you are interested in known pathogens, set a bPM > 100 filter for matches in the NT database to remove taxa that were present at low levels.
- Set a bPM > 1 filter for matches in the NR database to remove taxa that only have matches in non-coding regions (i.e., taxa that only have matches in the NT database).
- Set a L > 50 bp filter for matches in the NT database to remove taxa for which alignments were < 50 bp. The longer the alignment between a query sequence and its matching taxon, the greater confidence you can have in the match.
- Set an E-value < 0.001 filter for matches against NT and NR to remove likely random matches. Since the E-values are specified as power of 10, specify an E-value <= -3 when setting this filter. The lower the E-value the more stringent the filter will be. In general, we suggest setting the E-value filter to <= -10 for stringent searches or <= -1 to allow for less significant taxon matches.
Annotations: You can filter results to see only annotated taxa. You will see genus or species information for annotations at the genus or species level, respectively.
Annotation filter menu
Visualize coverage
You can evaluate the uniformity and breadth of genome coverage for a taxon of interest by looking at the coverage visualization. Click here to learn more about coverage visualization and how to interpret it. This feature is only available for taxa supported by at least one contig match at the nucleotide level (NT database). Note that the tool used for contig assembly, metaFlye, only assembles reads > 1 kb.
To view genome coverage:
1) Hover your cursor over a taxon of interest in the Sample Report Table. This will enable the coverage icon. You can select taxa of interest at the genus or species level.
2) Click the coverage icon to view the coverage visualization at the bottom of the page. Note that the visualization only depicts alignments at the nucleotide level. The dark blue coverage represents assembled contigs. The light blue coverage represents individual reads.
3) If you choose to view coverage for a given genus, the reference sequence will be selected based on the species with the highest read count. However, you can select other reference sequences from the dropdown menu. Reference sequences in the dropdown menu (up to 10) are selected based on sequences with the highest read count. You can also download contig sequences aligning to the reference by clicking the Download icon on the right-hand side of the coverage visualization panel.
You can select different reference sequences and download contigs aligning to the reference from the coverage visualization panel.
Use BLAST to confirm taxa of interest
You can use BLAST to confirm alignments to taxa of interest. Click here to learn more about BLAST. This feature can be launched from the coverage visualization panel. If you are having trouble viewing results when running multiple contig sequences see the BLAST troubleshooting section below.
To run BLAST:
1) Go to the coverage visualization panel for the taxon of interest.
2) Click the BLAST icon on the right-hand side of the coverage visualization panel.
3) You will be prompted to select the type of BLAST you would like to run. Click “Continue” after making your selection.
4) A dialog box will appear where you can select up to 3 of the longest contigs aligning to the taxon. The BLAST search will only take up to 7500 bp into account. If your contig sequences are longer than 7500 bp, a 7500 bp sequence stretch will be selected from the middle of the contig for BLAST. See BLAST troubleshooting if you have multiple contigs longer than 7500 bp and/or can’t see the BLAST report for all the selected contigs.
After selecting up to three of the longest contigs listed within the BLASTN dialog box, click “Continue”. You will be redirected to NCBI’s BLAST service.
When running BLASTX, you can select contigs with matches in the NT or NR database.
BLASTX dialog box
5) Run the BLAST search from the NCBI BLAST interface by clicking “View report”.
NCBI BLASTN interface
6) Explore BLAST results. See “Interpreting BLAST Results” within A guide to BLAST for details on how to interpret the reported metrics and graphics. Sequences resulting in significant alignments can be explored through two options:
a) Graphic summary - Provides a visual overview of the quality of alignments based on scores and the extent of alignments relative to the query sequence (i.e., contig).
b) Descriptions - Table listing sequences producing significant alignments, alignment metrics, and accession IDs. By default, identified sequences are sorted based on their alignment maximum (max) score.
BLASTN results page
BLAST Troubleshooting
If you select multiple contig sequences longer than 7500 bp for BLAST, the character limit per NCBI’s URL will be exceeded. Therefore, an individual BLAST tab will open per contig sequence. It is important to allow pop-ups from CZ ID to see all the NCBI BLAST tabs.
To do this:
1) Click on the BLAST icon from the coverage visualization panel, select the BLAST type and contigs for BLAST, and click “Continue”.
2) You will be directed to the NCBI BLAST interface.
3) Go back to the CZ ID tab and disable the pop-up blocker.
4) After allowing pop-ups from CZ ID, you will see multiple BLAST tabs (one per contig).
mNGS Nanopore Heatmap
CZ ID does not offer a heatmap functionality to visualize mNGS results from Nanopore data. However, you can easily create a heatmap outside CZ ID by running a heatmap generator script that uses CZ ID's Sample Taxon Reports as input data. The heatmap generator script can run locally on your machine or on the web via Google Colab. To learn more, visit the GitHub page for the Nanopore Heatmap Generator Script.
Notes regarding heatmap generator script:
- The heatmap generator script is not integrated into CZ ID. Google Colab is a separate service from CZ ID. Should you choose to upload your data to Google Colab, your data will be subject to Google’s Terms of Service and Privacy Policy. Click here to see frequently asked questions about Google Colab.
- We recommend uploading only the exact files specified in the instructions that are needed to generate the heatmap. Do not upload host or raw sequencing data.
- Since the heatmap generator script is external to CZ ID, our team will not be able to provide technical support. If you have any questions or comments, please post them in the GitHub page by creating a new issue.
Comments
0 comments
Please sign in to leave a comment.