Jump to Section:
Overview
Once you upload data to the mNGS Nanopore pipeline and the run is completed, you can analyze data to identify microbes of interest. Here we list steps for exploring results and analyzing data. Click here to learn how to download data and reports produced throughout the pipeline.
After reading this guide, you will be able to:
- Navigate to the Projects page to view sample status
- View and interpret the Sample Report
- Learn about result interpretation
- Annotate hits to keep track of matches to taxa of interest
- Explore Taxonomic Tree View
- Use filters to simplify the Sample Report
- Visualize coverage for taxa of interest
- Use BLAST to confirm taxa of interest
Navigate to Project page
Go to the Project page of interest to view the status of your samples. To do this, look for the project in the Discovery page. Once in the Project page, click on the Metagenomics - Nanopore tab to check on the pipeline run status for each sample.
- Project Name
- Metagenomics-Nanopore tab
- Sample status
Once the pipeline run is complete, you can view a Sample Table providing metadata and stats for each sample. The sample stats will help you evaluate how many reads passed data processing steps. Note that only up to 1 million reads can pass data processing steps given that reads passing quality and host/human filters are subsampled (see pipeline workflow overview).
The Sample Table defaults to a list of columns but you can add or remove columns by clicking the Add ("plus") button in the right-hand corner of the table and selecting the data of interest from the dropdown menu.
Below we provide a summary of the information provided within the Sample Table. Columns included in the table by default are highlighted with a star (*).
Category | Provided information |
Pipeline | *Created On - Date sample was uploaded to the platform and pipeline ran. |
Metadata | *Host - Dictates which genomes are used for host subtraction. |
*Location - Specifies where the sample was collected. | |
*Nucleotide Type - Dictates steps implemented for host and human data subtraction. | |
Collection Date - Date sample was originally collected. | |
Sample Type - Type of sample, tissue or site that most accurately describes the sample. | |
Water Control - Indicates if the sample is a water or negative control. | |
Optional Metadata - Available information will depend on the provided optional metadata during sample upload. Options include host metadata (Sex, Race/Ethnicity, Age, ID, Taxonomy), clinical data (Primary Diagnosis, Antibiotic Administered, Immunocompromised, Comorbidity, Diseases and Conditions, Detection Method, Infection Class), and sample processing information (Library Prep, Sequencer, RNA/DNA Input) | |
Sample Stats | *Passed Filters - Percentage of reads remaining after QC filtering, subtracting host and human data, and subsampling to 1 million reads. |
*Passed QC - Percentage of reads remaining after QC filtering to remove low quality bases (phred score < 9), short reads (< 100 bp), and low complexity reads. | |
Total Reads – Total number of reads uploaded to the platform. |
|
Pipeline | Pipeline Version - Specifies which pipeline version was used to analyze the sample |
Subsampled Fraction - After QC filtering and host/human data removal, the remaining reads are subsampled to 1 million. This fraction specifies the ratio of subsampled reads to total reads passing QC filtering and host/human data removal. | |
Total Runtime - Total time required by the mNGS Nanopore pipeline to process the uploaded files and obtain results. | |
Sample Notes | Notes - Sample notes added by the user while uploading data. |
Stats not calculated (or relevant) for Nanopore data | Duplication compression ratio (DCR), ERCC Reads, and Mean Insert Size are stats that are not calculated for Nanopore data and, if selected, will show blank values. |
The Sample Table only provides stats at the read level for you to keep track of how many reads made it through data pre-processing steps (e.g., QC filtering and subsampling) and, thus, were used for microbe detection. If you are interested in information at the base pair level, please see the Sample Overview report available through bulk download or look at the pipeline details on the Sample Report page (see Finding Pipeline Details below).
Finding Pipeline Run Details
To find pipeline details for each sample:
1) Click on a sample of interest. This will take you to the Sample Report page.
2) Click the "Sample Details" link found on the right-hand side of the Sample Report page.
3) Clicking “Sample Details” will open a panel on the right-hand side of the page where you can find more information about a given sample. Click the Pipelines tab to find more details about the pipeline run. If you go to the Bases Remaining dropdown toward the bottom of the panel, you will see information regarding how many base pairs passed each data pre-processing step.
Note the following when looking at the “Bases Remaining” breakdown to understand where most of the data filtering occurred:
- Validate input: Currently, the input validation only verifies that sequences are in the correct format (FASTQ). Therefore, you should not see data loss here.
- Quality filter: The quality filter values refer to the number of bases remaining after removing reads through fastp due to low scores, short length, and/or complexity.
- Host filter: The host filter indicates the number of bases remaining after removing reads matching the selected host organism using minimap2. If a host organism was not specified, you should not see data loss here.
- Human filter: The human filter values indicate the number of bases remaining after removing reads matching the human genome using minimap2. Reads matching the human genome are always removed regardless of the specified host organism.
- Subsampling: Reads passing quality filtering and host/human read removal are subsampled to 1 million sequences. Subsampling values reflect the number of bases remaining after subsampling.
View and Interpret the Sample Report
The Sample Report lists the taxa identified in your sample after aligning sequences against NCBI's nucleotide (NT) and non-redundant protein (NR) databases. To view the Sample Report for each sample:
1) Click on a sample of interest. This will take you to the Sample Report page.
2) Explore the Sample Report. By default, the report will open in a Table View. The table is sorted based on bases per million (bPM) matching listed taxa. If you want to change how the table is sorted, simply click on a metric (header) of interest to sort the table based on values for the selected metric.
4) Explore identified taxa. By default, the Sample Report summarizes identified taxa at the genus level. If you are interested in seeing specific species under a genus, click the downward chevron () next to the genus name. Known pathogenic species will be flagged as “Known pathogen”.
5) Evaluate metric values supporting detected taxa. Each taxon in the Sample Report Table includes two metric values that represent alignment to NCBI's NT and NR databases. In each taxon row, metric values at the top correspond to NT alignments whereas values at the bottom correspond to alignments against NR. Note that values against NR only reflect alignments between contig sequences and their matching taxon (i.e., unassembled reads are not aligned against NR). By default, NT values will be highlighted with bold font and will be used to sort the table. If you would like to highlight NR values, simply click "NR" on the Database toggle.
The metrics provided within the Sample Report table include:
-
Bases per million (bPM) - Number of bases within all the reads aligning to a given taxon, including those assembled into contigs that mapped to the taxon, per million bases sequenced.
- Bases (b) - Number of bases within all the reads aligning to a given taxon, including those assembled into contigs that mapped to the taxon.
- Reads (r) - Number of reads aligning to a given taxon, including those assembled into contigs that mapped to the taxon.
- Contigs (contig) - Number of assembled contigs aligning to a given taxon.
- Contig bases (contig b) - Number of bases within all the reads that assembled into contigs aligning to a given taxon.
- Percent identity (%id) - Average percent identity between all the query sequences (contigs and unassembled reads) and their matching taxon.
- Length (L) - Average length of alignments between all the query sequences (contigs and unassembled reads) and their matching taxon. Note that values against NR are reported in base pairs.
- Expect value (E value) - Average expect value (E-value) of alignments against the NT and NR databases. The E-value represents the number of matches with similar quality one would “expect” to see by random chance. This parameter provides a measure of randomness. The lower the E-value, the lower the probability of getting a match or alignment by random chance (i.e., the closer the E-value is to 0, the better).
Report Interpretation Considerations:
- You should use bases per million (bPM) when comparing relative abundance of taxa detected across samples. This normalized metric accounts for differences in sequencing output (i.e., total number of bases) between samples.
- The Sample Report provides metric values for searches against the NT and NR databases. It is very likely that values between the two databases will differ. Here are some things to keep in mind:
-
- NR values only reflect contig alignments, whereas NT values reflect alignments for both unassembled reads and contigs.
- Alignments against NT are done at the nucleotide level, which is considered a more stringent search for similar sequences. On the other hand, searches against NR are done at the amino acid level which may reveal more divergent matches.
- Protein-coding sequences are present in both NT and NR databases, whereas non-coding regions are only present in the NT database. Therefore, if you see a taxon that has high NT counts but low NR counts, many of the contigs and/or unassembled reads aligning to that taxon likely come from rRNA genes (e.g., 16S) or other non-coding regions.
- Divergent sequences may have matches at the amino acid level but not at the nucleotide level. If you see a taxon that has high NR counts and low (or no) NT counts, you are likely dealing with a novel sequence. This is more commonly observed for virus sequences compared to sequences from cellular organisms (e.g., bacteria, fungi).
-
- If you are interested in identifying known pathogens, you should focus on values for matches in the NT database. Conversely, if you suspect a novel pathogen, you should explore values for the NR database.
- The Sample Report includes taxa with the best alignment metrics. However, there are cases in which there are multiple species (or subspecies) producing significant alignments. Amongst all equally good alignments, the reported sequence is assigned based on the sequence match with the highest total read count, including reads assembled into contigs. Otherwise, the reported taxon is randomly selected from equally good alignments. Therefore, it is always a good idea to evaluate coverage and confirm taxa of interest using BLAST.
Annotate Taxa of Interest
Use the Annotation tag () to make notes regarding your confidence in matches to taxa of interest. You can annotate taxa at the genus or species level as “hit”, “not a hit”, or “inconclusive” to keep track of your analysis. Note that annotating taxa will not affect downstream analyses. You can use annotations to filter the Sample Report.
Explore Taxonomic Tree View
Explore an overview of detected species using the Taxonomic Tree View. By default, when you first open the Sample Report page you will see the Table View. Use the View Toggle ( ) to switch to the Taxonomic Tree View by clicking the phylogram icon. The resulting visualization depicts taxonomic relationships among microbes identified in a given sample.
The weight or thickness of the lines connecting tree nodes is proportional to the metric selected under “Tree Metric”. By default, the tree lines will reflect values representing the number of bases matching a given taxon in the NT database (NT b). Taxa with thicker lines will have proportionally higher NT b values than those with lower values.
You can apply filters to simplify the Taxonomic Tree View.
Use Filters to Simplify the Sample Report
The number of identified taxa on the Sample Report may be overwhelming. Moreover, not all reported taxa will be relevant. Filtering results on the Sample Report will help remove some noise (e.g., spurious matches) and let you focus on abundant species representing microbial groups of interest (e.g., viruses, bacteria). In this section we discuss settings that will allow you to sieve through the Sample Report more efficiently.
You can choose what you want to see on the Sample Report based on one or more available filters.
Any applied filters will show above the Sample Report table. The number of rows passing the filter will be indicated below applied filters (above the table).
If you would like to clear any or all filters simply click on the “X” by the filter you want to remove or click “Clear Filters”, respectively.
Available Filters
You can filter the report by:
Category: You can choose to only view results for a specific microbial group, including: Archaea, Bacteria, Eukaryota, Viroids, Viruses (all viruses), Viruses - Phage (only phage), and Uncategorized (not assigned to a specific group).
Read specificity: You can choose to see results based on taxonomic assignments. By setting the Read Specificity filter to “Specific Only” you will only see matches to taxa that are assigned to a genus. If you would like to see all taxon matches, regardless of whether or not they have been classified within a genus, set the Read Specificity filter to “All”.
Threshold filters - You can choose to filter results based on metric value ranges reflecting the quality of alignments against the NT or NR database. Thresholds will allow you to filter out spurious matches based on one or multiple metrics.
For example, you can set the following thresholds:
- If you are interested in known pathogens, set a bPM >= 100 filter for matches in the NT database to remove taxa that were present at low levels.
- Set a bPM >= 1 filter for matches in the NR database to remove taxa that only have matches in non-coding regions (i.e., taxa that only have matches in the NT database).
- Set a L >= 50 bp filter for matches in the NT database to remove taxa for which alignments were =< 50 bp. The longer the alignment between a query sequence and its matching taxon, the greater confidence you can have in the match.
- Set an E-value < 0.001 filter for matches against NT and NR to remove likely random matches. Since the E-values are specified as power of 10, specify an E-value <= -3 when setting this filter. The lower the E-value the more stringent the filter will be. In general, we suggest setting the E-value filter to <= -10 for stringent searches or <= -1 to allow for less significant taxon matches.
Annotations: You can filter results to see only annotated taxa. You will see genus or species information for annotations at the genus or species level, respectively.
Visualize Coverage
You can evaluate the uniformity and breadth of genome coverage for a taxon of interest by looking at the coverage visualization. Click here to learn more about coverage visualization and how to interpret it. This feature is available for taxa supported by at least one read match at the nucleotide level (NT database). Note that the tool used for contig assembly, metaFlye, only assembles reads > 1 kb.
To view genome coverage hover over the taxon name of interest which will display a set of analysis icons. Click the coverage visualization icon (). You can select taxa of interest at the genus or species level.
The Coverage Visualization Panel will pop up at the bottom of the screen. Note that the visualization only depicts alignments at the nucleotide level. The schematic lines under the coverage plot represent the reference accession length (grey), contigs mapping to the reference accession (dark blue) and unassembled (or loose) reads that mapped to the accession (light blue). By clicking the dark blue lines, you can download or copy contig sequences mapping to regions of interest in FASTA format.
If you choose to view coverage for a given genus, the reference sequence will be selected based on the species with the highest read count. However, you can select other reference sequences from the dropdown menu. Reference sequences in the dropdown menu (up to 10) are selected based on sequences with the highest read count.
You can download contig sequences aligning to the reference by clicking the Download icon on the right-hand side of the coverage visualization panel.
Use BLAST to Confirm Taxa
You can use BLAST to confirm alignments to taxa of interest. Click here to learn more about BLAST. This feature can be launched from the coverage visualization panel.
To run BLAST:
1) Open the coverage visualization panel for the taxon of interest by clicking the coverage visualization icon.
2) Click the BLAST icon on the right-hand side of the coverage visualization panel.
3) A modal will appear for you to select the BLAST type you wish to run. Click "Continue" after making your selection.
BLASTN vs BLASTX
BLASTN is the standard BLAST program that searches the nucleotide database (NT) using a nucleotide query sequence, whereas BLASTX is used to search the protein database (NR). BLASTN is an integral part of the metagenomic analysis workflow that can be used to:
-
-
- Confirm taxon matches.
- Gather relevant alignment metrics for publication (e.g., E-value, % identity, and bit score).
- Find contextual data (related sequences represented in the database) for phylogenetic trees .
- Perform quality control for contigs generated through SPAdes (de novo assembler). By "blasting" contigs, users can identify chimeras (contigs formed by two or more reads that have been incorrectly assembled together) and determine contig orientation for downstream analyses.
-
BLASTX translates nucleotide query sequences and compares translated amino acid sequences to the protein database. It can be used for:
-
-
- Identifying coding regions and encoded proteins
- Confirming novel viral sequences that only have matches in the NR database
-
4) Once the type of BLAST is specified, another modal will appear for you to select which sequences you would like to run. To select sequences, check the boxes for contigs or reads of interest. Click Continue after selecting the sequences for BLAST.
BLASTN Modal: CZ ID will allow up to 3 of the longest contigs that aligned to the NT database to run BLASTN. If no contigs aligned to the NT database, CZ ID will send up to 5 reads that aligned to the NT database. The modal will automatically show the longest contigs or randomly selected reads available for the analysis.
BLASTX Modal: CZ ID will allow up to 3 of the longest contigs that aligned to the NT or NR database to run BLASTX. Note the NT hits (default) and NR hits tabs for you to select which sequences you would like to run. If no contigs aligned to the NT database, CZ ID will send up to 5 reads that aligned to the NT database. The same logic applies for NR hits if you choose to select sequences aligning to NR. The modal will automatically show the longest contigs or randomly selected reads available for the analysis.
Note: If you select multiple contig sequences longer than 7500 bp for BLAST, the character limit per NCBI’s URL will be exceeded. Therefore, an individual BLAST tab will open per contig sequence. It is important to allow pop-ups from CZ ID to see all the NCBI BLAST tabs.
5) A message will appear indicating that you will be redirected to NCBI's BLAST service. Click Continue to send your sequences to NCBI.
6) An NCBI BLAST page will appear with default parameters. To run BLAST, simply click “View report”.
7) The BLAST Results page will appear after the analysis is done. If you ran multiple sequences, use the "Results for" dropdown menu to view the BLAST results for other sequences included in the analysis. Read about BLAST result interpretation in our Guide to BLAST or see NCBI's guide describing the BLAST Results Page.
8) Once you evaluate your BLAST results, you can annotate whether the taxon was confirmed ("Hit"), "Not a hit", or "Inconclusive" using the annotation feature.
mNGS Nanopore Heatmap
CZ ID does not offer a heatmap functionality to visualize mNGS results from Nanopore data. However, you can easily create a heatmap outside CZ ID by running a heatmap generator script that uses CZ ID's Sample Taxon Reports as input data. The heatmap generator script can run locally on your machine or on the web via Google Colab. To learn more, visit the GitHub page for the Nanopore Heatmap Generator Script.
Notes regarding heatmap generator script:
- The heatmap generator script is not integrated into CZ ID. Google Colab is a separate service from CZ ID. Should you choose to upload your data to Google Colab, your data will be subject to Google’s Terms of Service and Privacy Policy. Click here to see frequently asked questions about Google Colab.
- We recommend uploading only the exact files specified in the instructions that are needed to generate the heatmap. Do not upload host or raw sequencing data.
- Since the heatmap generator script is external to CZ ID, our team will not be able to provide technical support. If you have any questions or comments, please post them in the GitHub page by creating a new issue.
Comments
0 comments
Please sign in to leave a comment.