Jump to Section:
Identified taxa through the mNGS pipeline will be summarized within an interactive Sample Report Table. Below we describe features of the Sample Report for the short read (Illumina) mNGS pipeline. Click here if you are interested in learning about the Sample Report for long read (Nanopore) mNGS data.
We describe the Sample Report Table using an example from the Medical Detectives public project. To follow along, please go to the Sample Report page for Patient 008 (CSF). You should be able to see the Table View (default) for Patient 008 (CSF) Sample Report.
The Table View provides a report containing several fields indicating the relative abundance of genera and species identified in the sample you have selected, along with scores characterizing the quality of the match.
Sample Report Table Intro
The Sample Report Table lists the taxa that were found in your sample when matched against NCBI GenBank. We will walk you through filtering data to reduce noise, understanding the metrics and how to interpret them, and understanding if the strength of the evidence for a taxon using the coverage visualization.
Close the Sample Details side panel by clicking the X in the right-hand corner. You can now explore the Report Table which shows the list of taxa CZ ID identified in your sample.
Metrics and Their Meanings
|Score||CZ ID's heuristic for ranking microbial hits. The score is intended to combine the following aspects of the evidence for a hit: (a) species-level information, (b) genus-level information, (c) information about relative abundance within the sample, (d) information about abundance relative to the chosen background controls. The score is calculated as follows: ((abs(genus NT Z) * species NT Z * species NT rPM) + (abs(genus NR Z) * species NR Z * species NR rPM))|
|Z||Z-score statistic, used for evaluating prevalence of microbes in your sample as compared to background controls. The Z-score is computed based on the specified background model.|
|rPM||Number of reads aligning to the taxon in the NCBI NT/NR database, per million reads sequenced|
|r||Number of reads aligning to the taxon in the NCBI NT/NR database|
|contig||Number of assembled contigs aligning to the taxon in the NCBI NT/NR database|
|contig r||Total number of reads aligning to all assembled contigs for this taxon|
|%id||Average percent-identity of alignments to NCBI NT/NR|
|L||Average length of the local alignment for all contigs and reads assigned to this taxon|
|E value||Average Expect value (e-value) of alignments to NCBI NT/NR. The Expect value (e-value) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. The closer to 0 the better.|
NT is NCBI’s database of nucleotide sequences. The NCBI NT database is a collection of sequences from several sources, including GenBank, RefSeq, TPA, and PDB.
NR is NCBI’s database of non-redundant protein sequences.
- It is possible for a taxon to contain high NT counts and low NR counts; this generally occurs when many of the reads aligning to that taxon come from rRNA genes, which are present in the NT database, but not the NR database.
- It is possible for a taxon to contain high NR counts and low NT counts; this may occur when a sample contains sequences from a divergent virus that may map via the more conserved protein (NR) sequence, but not via the more mutated nucleotide (NT) sequence.
- Currently, a single read may be assigned to two different taxons in the NT and NR database.
Aggregate Score (Score)
Score is CZ ID's empirical heuristic for ranking microbial hits. The score is intended to combine the following aspects of the evidence for a hit: (a) species-level information, (b) genus-level information, (c) information about relative abundance within the sample, (d) information about abundance relative to the chosen background controls. By default, the report page sorts taxa by the Aggregate Score (Score) value.
( abs( genus NT Z ) x ( species NT Z ) x ( species NT rPM )) + (abs( genus NR Z ) x ( species NR Z ) x ( species NR rPM )
If a taxon has a high aggregate score it indicates that a high abundance of reads matched to that taxon in the NT and NR database, and is less prevalent in the selected background model.
The Z-score statistic, used for evaluating prevalence of microbes in your sample as compared to background controls. The Z-score is computed based on the specified background model.
It’s important to understand what a background model (link to background model below) is and choose a relevant one.
How to Interpret the z-score:
- If you see a Z-score of 100, the taxon does not appear in any of the samples in your selected background model.
- If you see a score of -100, the taxon does not appear in your sample or in the selected background model. You will only see a Z-score of -100 if the taxon matched in either the NT or NR database but not the other.
- A Z-score of 1 would indicate that the amount of rPM of the taxon in your sample is one standard deviation greater than the average rPM in your selected background model.
Reads per Million (rPM)
rPM (reads per million) refers to the number of reads aligning to the taxon in the NCBI NT/NR database, per million reads sequenced. We use rPM to normalize read counts across samples.
When comparing samples, we use rPM as a scaled metric of abundance. Each time we run a sequencing experiment, we may obtain different numbers of total reads. To normalize the values across experiments, we look at the rPM instead of raw read count. rPM are a standard metric in bioinformatics analyses and is computed as follows:
Even when looking at a single sample, it can be helpful to sort by the highest rPM value after the aggregate score.
How to interpret Reads per Million:
- The rPM associated with each taxon indicates the relative abundance of nucleic acid associated with this taxon present in the sample.
- The rPM value provides a metric that enables comparison of relative abundances across samples sequenced to different total sequencing depths.
(Total) number of reads aligning to the taxon in the NCBI NT/NR database.
How to interpret Reads:
- The number of reads associated with each taxon indicates the relative abundance of nucleic acid associated with this taxon present in the sample.
- If a taxon has a small number of reads, there was likely relatively little of this organism present in the sample.
- If a taxon has a larger number of reads, this organism was likely more abundant within the sample.
Number of assembled contigs aligning to the taxon in the NCBI NT/NR database.
Good to know:
Higher numbers of contigs doesn’t necessarily mean increased confidence in the result. Assembly aims to take raw reads and generate longer sequences; therefore, it is possible to obtain a full genome sequence in 1 contig. Look at Contig r to see how many reads are associated with all contigs.
How to interpret the number of contigs:
- Few contigs, with many Contig r associated with them (see below), indicate high-quality assemblies.
- Many contigs, with relatively low Contig r associated with them (see below), indicates that the assembly step may not have improved the hit-calling significantly beyond the raw reads.
Contig r (Contig Reads)
Total number of reads aligning to all assembled contigs for this taxon.
How to interpret Contig r:
- High Contig r values indicate that many reads were associated with contigs, which improves confidence in the assembled contig.
Identity Match (%id)
Average percent-identity of the reads and contigs that aligned to this taxon in the NCBI NT/NR database.
How to interpret %id:
- For high-confidence alignments to taxa present in the NCBI databases, we expect to see relatively high percent identity matches (ie > 90% identity) to the reference sequences.
- Sometimes novel pathogens will appear to have lower %id, but you can double check if it’s really there by downloading the reads and contigs and investigating the quality of alignments via BLAST.
Average Length (L)
L is the average length of the local alignment for all contigs and reads assigned to this taxon
How to Interpret the Length:
- One thing that gives you confidence if it has high %id and relatively long alignment.
- For libraries containing 150bp sequences, we would trust alignments with an L value in the range of 75 to 150, with 100 being a good value and longer alignments increasing confidence.
- Regardless of raw input read length, alignments under 35 should not be trusted because it’s too short.
- The average alignment length (L) can be longer than the read length when there are contigs present for that taxa.
Average expect value (e-value) of alignments to NCBI NT/NR.The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. For example, a raw E-value of 1 would indicate that in a database of the current size one might expect to see 1 match with a similar score simply by chance. Notably, short alignments will have relatively high E values because the calculation of the E value takes into account the length of the query sequence and shorter sequences have a higher probability of occurring in the database purely by chance.
How to interpret the e value:
- Low E values signify higher confidence in the alignment. For example:
- <10^-10, is a good match
- >0 is a bad match
The report for Patient 008 (CSF) is below.
Each computed value, except for the score, has been computed separately for the NT and NR reads. To toggle which value is bolded, select the NT/NR button in the right-hand corner of the table.
The Blue Lightbulb icon may show up next to a species in your report. In Patient 008’s report, the Blue Lightbulb shows up under the Taenia Genus. The lightbulb exists to draw your eye to the taxa that have passed several thresholds, indicating there is confidence this is a true hit. This icon exists to help you notice these rows in the report and investigate further. The lightbulb highlights the top 3 species that have NT rpm > 1, NR rpm > 1, NT Z score > 1 and NR Z score > 1.
By default, the report page sorts all taxa by the Score column. As you can see, Taenia and Echinococcus are the top hits for Patient 008.
To sort all taxa by any of the columns (including rPM), click on the particular column header. If you do that while viewing Patient 008’s Report Table, you will see that changing the sorting actually doesn’t change the top hits.
Species Level Report View
Next, click on the > button next to Taenia to expand the table and view the species-level metrics.
When you have a high abundance of a certain genus it’s common to see one or two species with a lot of reads and then a list of species underneath with only one or two reads (as you see here).
Metagenomic next-generation sequencing is highly sensitive and we rely on existing databases to align metagenomic reads. Because of this, it is common to see a lot of noise in our reports. Luckily, we can filter out some of that noise using Threshold Filters. You will see this when we reduce the number of species in the next step.
When looking at species within a genus remember to pay attention to rPM, total contigs, and the quality of those contigs. If you look at all the rPM values for the species under Taenia you can see that there is a huge discrepancy between the top two hits and the other 16. In the next section, we will discuss how filtering can help refine your report and remove the spurious false hits.
The Z-score provides an estimate of the relative abundance of a taxon in a sample. You can select which background model you want at the top of the Report Table. While there is no default background model, if you don't know the model used or don't have your own water controls from which to construct a new background model, NID Human CSF V3 may be a good choice. NID Human CSF V3 comes from a curated list of over 400 healthy Cerebral Spinal Fluid (CSF) samples - healthy CSF should contain no human or pathogen cells, therefore it is comparable to water and the background contaminants present in this set reflect those commonly found in water.
To create your own background model you will have to build a collection on the project page.
Selecting a new background model immediately updates the Z-score column on the report table. In turn, this updates the aggregate Score. Taxa with higher Z- scores also have higher aggregate Scores. The Z score can help to highlight uncommon pathogens in your sample.
- If you see a Z-score of 100 the taxon does not appear in any of the samples in your selected background model
- If you see a score of -100 the taxon appears in your background model but not your sample. You will only see a Z-score of -100 if the taxon matched in the NR database but not the NT database.
- A Z-score of 1 would indicate that the amount of rPM of the taxon in your sample matches the average rPM in your selected background model.
To see how the total reads of one taxon in your sample differs from the background samples, click on a taxon name with a Z- score below 100 and above -100 (i.e. Drosophila in the Patient 008 report).
Taxon Details Panel
Clicking on the taxa name pulls out a side panel with information about the taxa, links to external information sources, and a Z- score graph (because Taenia is not present in our background it is assigned a Z-score of 100 and won’t show us a Z- score graph).
The dotted purple line in the Z-score graph shows the amount of NT reads in your sample and the purple bars show the reads in the background samples. In this case, you can see that this sample had many more NT reads of Drosophila than the average reads in the background samples.
There are other actions you can perform for each Genus and Species in the Report Table. Hover over the taxa name to display a set of five icons (see below).
Icons from left to right:
- View this genus or species in the NCBI Taxonomy browser
- Download the fasta file
- Download the contigs
- View the coverage visualization (more below)
- Create a phylogenetic tree from this read
If you download the contigs or fasta file on the genus level you will get all species fasta files and contigs. In the next section, we will dive deep into the coverage visualization which gives you another way to validate your reads and explore your data.