Jump to Section:
Overview
Identified taxa through the mNGS pipeline will be summarized within an interactive Sample Report page. Below we describe features of the Sample Report for the short read (Illumina) mNGS pipeline. Click here if you are interested in learning about the Sample Report for long read (Nanopore) mNGS data.
You can explore the sample report using a table or taxonomic tree view. Here we describe the Sample Report Table view using examples from the Medical Detectives public project. You can follow along by going to discussed samples within the project.
Table View
By default, when you first open the Sample Report page you will see the Table View. The report table provides several metrics indicating the relative abundance of genera and species identified in the sample you have selected, along with scores characterizing the quality of the match.
Sample Report Metrics
The Sample Report Table lists the taxa identified in your sample after aligning sequences against NCBI's nucleotide (NT) and non-redundant protein (NR) databases. Below we define the metrics provided in the Sample Report Table.
Metrics and Their Meanings
Report Metric | Definition |
---|---|
Score | CZ ID's heuristic for ranking microbial hits. The score is intended to combine the following aspects of the evidence for a hit: (a) species-level information, (b) genus-level information, (c) information about relative abundance within the sample, (d) information about abundance relative to the chosen background controls. The score is calculated as follows: ((abs(genus NT Z) * species NT Z * species NT rPM) + (abs(genus NR Z) * species NR Z * species NR rPM)) |
Z | Z-score statistic, used for evaluating prevalence of microbes in your sample as compared to background controls. The Z-score is computed based on the specified background model. |
rPM | Number of reads aligning to the taxon in the NCBI NT/NR database, per million reads sequenced |
r | Number of reads aligning to the taxon in the NCBI NT/NR database |
contig | Number of assembled contigs aligning to the taxon in the NCBI NT/NR database |
contig r | Total number of reads aligning to all assembled contigs for this taxon |
%id | Average percent-identity of alignments to NCBI NT/NR |
L | Average length of the local alignment for all contigs and reads assigned to this taxon |
E value | Average Expect value (e-value) of alignments against NCBI NT/NR. The Expect value (e-value) is a parameter that describes the number of matches one can "expect" to see by chance when searching a database of a particular size. The closer to 0 the more significant the alignment. |
Interpreting Metrics
Below we discuss how to interpret metrics reported in the Sample Report Table, including:
- Database
- Score
- Z score
- Reads per million (rPM)
- Reads (r)
- Number of contigs (contig)
- Contig reads (contig r)
- Alignment length (L)
- E value
Database (NT/NR)
Each taxon in the Sample Report Table includes two metric values that represent alignment to NCBI's nucleotide (NT) and non-redundant (NR) databases. In each taxon row, metric values at the top correspond to NT alignments whereas values at the bottom correspond to alignments against NR. By default, NT values will be highlighted with bold font and will be used to sort the table. If you would like to highlight NR values, simply click "NR" on the Database toggle.
Interpreting alignments against NT and/or NR databases:
- It is possible for a taxon to contain high NT counts and low NR counts. This generally occurs when many of the reads aligning to that taxon come from non-coding sequences (e.g., rRNA genes). Non-coding sequences are present in the NT database, but not the NR database.
- It is possible for a taxon to contain high NR counts and low NT counts. This may occur when a sample contains sequences from a divergent or novel virus that may map via the more conserved protein (NR) sequence, but not via the more mutated nucleotide (NT) sequence.
- Currently, a single read may be assigned to two different taxa based on alignments against NT vs NR databases.
Aggregate Score (Score)
The Aggregate Score (Score) is CZ ID's empirical heuristic for ranking microbial hits. The Score is intended to combine the following aspects of the evidence for a given taxon match: (a) species-level information, (b) genus-level information, (c) information about relative abundance within the sample, (d) information about abundance relative to the chosen background controls. By default, the report page sorts taxa by the Score value.
Score Formula
Score = ( abs( genus NT Z ) x ( species NT Z ) x ( species NT rPM )) + ( abs( genus NR Z ) x ( species NR Z ) x ( species NR rPM )
Score Interpretation
If a taxon has a high Score it indicates that a high abundance of reads matched to that taxon in the NT and NR database while being less prevalent in control samples selected for the applied background model.
Top Scoring Taxa
A blue Lightbulb icon is used to highlight the top three scoring taxa in the report that have more than 1 reads per million matching the NT and NR databases (NT rPM > 1 and NR rPM > 1) and are found in higher abundance in samples than in negative controls (NT Z score > 1 and NR Z score > 1). The lightbulb should draw your attention to taxa that are the most abundant after passing several thresholds, indicating that there is confidence in the alignments and you may want to investigate one or all three taxa further. The example below shows the top three scoring species for the Patient 010 CSF sample from the Medical Detectives project. Note that the rows containing the top three scores are highlighted in blue at the genus and species level.
Z-score (Z)
The Z-score statistic is used to evaluate the prevalence of microbes in your sample as compared to background controls. The Z-score is computed based on the specified background model. See Background Models to learn how these models are used to calculate Z-scores and how to make a background model for your project. Selecting a new background model immediately updates the Z-score column in the report table. In turn, this updates the aggregate Score.
How to Interpret the Z-score:
- If you see a Z-score of 100, the taxon does not appear in any of the samples in your selected background model.
- If you see a score of -100, the taxon does not appear in your sample or in the selected background model. You will only see a Z-score of -100 if the taxon matched in either the NT or NR database but not the other.
- A Z-score of 1 would indicate that the amount of rPM matching the taxon in your sample is one standard deviation greater than the average rPM matching the taxon in samples selected for your background model.
- Taxa with higher Z-scores also have higher Scores.
Reads per Million (rPM)
Reads per million (rPM) refers to the number of reads aligning to the taxon in the NCBI NT or NR database, per million reads sequenced. This rPM metric is used to normalize read counts across samples enabling a comparison of relative abundance of taxa across samples. In other words, rPM is a scaled metric of abundance. Each time we run a sequencing experiment, we may obtain different numbers of total reads. To normalize the values across experiments, we look at the rPM instead of raw read count. Even when looking at a single sample, it can be helpful to sort by the highest rPM value after the aggregate score.
rPM Formula:
How to interpret rPM:
- The rPM associated with each taxon indicates the relative abundance of nucleic acid associated with each taxon in a given sample.
- The rPM value provides a metric that enables comparison of relative abundances across samples that were sequenced to different sequencing depths.
Reads (r)
Reads (r) refers to the number of reads aligning to the taxon in the NCBI NT or NR database.
How to interpret reads (r):
- The number of reads associated with each taxon indicates the relative abundance of nucleic acid associated with this taxon present in the sample.
- If a taxon has a small number of reads, there was likely relatively little of this organism present in the sample.
- If a taxon has a larger number of reads, this organism was likely more abundant within the sample.
Number of Contigs (contig)
The number of contigs (contig) refers to the number of assembled contigs aligning to the taxon in the NCBI NT or NR database.
How to interpret the number of contigs (contig):
- Higher numbers of contigs doesn’t necessarily mean increased confidence in the result. Assembly aims to take raw reads and generate longer sequences. Therefore, it is possible to obtain a complete genome sequence in one contig. Look at contig r (see definition below) to evaluate how many reads are associated with all contigs.
- Few contigs, with high contig r associated with them, indicate high-quality assemblies.
- Many contigs, with relatively low contig r associated with them, indicates that the assembly step may not have improved the species calls significantly beyond the raw reads.
Contig Reads (contig r)
Contig r refers to the total number of reads aligning to all assembled contigs matching a given taxon.
How to interpret contig r:
- High contig r values indicate that many reads were associated with contigs, which improves confidence in the assembled contigs.
Identity (%id)
The identity (%id) refers to the average percent-identity of the reads and contigs that aligned to a given taxon in the NCBI NT or NR database.
How to interpret %id:
- For high-confidence alignments to taxa present in the NCBI databases, we expect to see relatively high percent identity matches (ie > 90% identity) to the reference sequences.
- Sometimes novel pathogens will appear to have lower %id, but you can double check if it’s really there by downloading the reads and contigs and investigating the quality of alignments via BLAST.
Alignment Length (L)
The alignment length (L) refers to the average length of local alignments for all contigs and reads assigned to a given taxon.
How to interpret alignment length (L):
- A high %id and relatively long alignment would give you confidence in a taxon match.
- For libraries containing 150bp sequences, we would trust alignments with an L value in the range of 75 to 150, with 100 being a good value and longer alignments increasing confidence.
- Regardless of raw input read length, alignments under 35 should not be trusted because it’s too short.
- The average alignment length (L) can be longer than the read length when there are contigs present for that taxa.
E-value
The E-value refers to the average expect value of alignments against NCBI NT or NR databases.The E-value is a parameter that describes the number of matches one can "expect" to see by chance when searching a database of a particular size. For example, a raw E-value of 1 would indicate that in a database of the current size one might expect to see 1 match with a similar score simply by chance. Notably, short alignments will have relatively high E-values because the calculation of the E-value takes into account the length of the query sequence and shorter sequences have a higher probability of occurring in the database purely by chance.
How to interpret the E-value:
- Low E values signify higher confidence in the alignment. For example:
- E-values < 10^-10 point to significant taxon matches
- E-values > 0 represent non-significant or spurious taxon matches
Viewing Species in the Sample Report Table
By default, the Sample Report Table will show identified taxa at the genus level. To view detected species under a genus, click on the chevron ( > ) button next to the genus name of interest. This will expand the table to show species listed under the genus and species-level alignment metrics.
Notes regarding species-level assignments:
- Metagenomic next-generation sequencing is a non-targeted approach and we rely on existing databases to align metagenomic reads. As a consequence, it is common to see a lot of noise in the sample reports. Luckily, you can use Threshold Filters to filter out some of that noise. See Filter Sample Report to learn how to apply filters.
- When you have a high abundance of reads and contigs matching a genus, it is common to see one or two species with a lot of reads and then a list of species underneath with only one or two reads. In most cases, the species with low number of reads represent spurious matches to species within the genus. Therefore, it is important to pay attention to rPM, number of contigs, and the quality of those contigs. If you look at all the rPM values for the species under Taenia in the example below (Patient 008), you can see that there is a huge discrepancy between the top two species and the others.
Sorting Sample Report Table
By default, taxa listed in the Sample Report Table are sorted by their Score. You can easily change this by sorting the table by other metrics. To sort taxa by any of the columns (e.g., rPM), choose a column header by which to sort the table and click the chevron icon or header name to sort the column. Downward chevron indicates the column values are sorted from highest to lowest. Click the chevron or header name again to sort values in reverse the order. The metric (or column header) used to sort the table will be highlighted in blue font.
Taxon Details Panel
If you would like to learn more about a specific taxon, click on the taxon name. Clicking the taxon name will open the Taxon Details Panel on the right-hand side of the page. The panel contains information about the taxon and links to external information sources.
Analysis Options
There are several actions you can perform from the Sample Report Table. Hover over the taxon name of interest at the genus or species level to display a set of analysis icons. Click the different to perform the desired action.
Analysis icons from left to right:
1. View coverage visualization for this taxon
2. BLAST contigs or reads associated with this taxon
3. Create a phylogenetic tree for this taxon
4. Assemble consensus genome for this viral taxon
5. Download contigs or reads associated with this taxon in FASTA format
Note: If you select the download icon at the genus level, you will download contig or read FASTA files for all the species under the genus.
Comments
0 comments
Please sign in to leave a comment.