Jump to Section:
Contamination in mNGS Experiments
Metagenomic next-generation sequencing is a highly sensitive technique for identification of nucleic acid found in a sample. The resulting sequencing reads may originate from numerous sources, including contamination from reagents and laboratory surfaces. The issue of background contamination is well-appreciated within the field of metagenomics, and is especially challenging in analysis of low biomass samples [1,2]. The established best-practices for mitigating these challenges focus on inclusion of appropriate laboratory controls in mNGS experiments .
What is an IDseq background model?
To assist with distinguishing microbial signals from reagent and environmental contamination, IDseq supports background model generation. In IDseq, a researcher can specify which samples should be included in a background model - these typically include a set of water or other environmental control samples processed alongside their sequencing experiments. The distribution of relative abundances of each taxa in the background model is then computed. You can then use the z-score metric to compare your samples to the background model (see below). The z-score metric allows researchers to evaluate the significance of relative abundance estimates for taxons in their samples as compared to the background model.
Applied Correction Method Recommendations:
- We recommend that for experiments where ERCCs are spiked into all samples and controls, use mass-normalized background models (Applied Correction Method = Normalized by input mass).
- For experiments where ERCC spike-ins are not used or a mix of samples are included, use the standard background model (Applied Correction Method = Standard). Mass-normalized background models will not be available (since that method relies on ERCC counts for normalization).
Note, background models normalized by input mass are only available for samples run on pipeline v 4.0 or later.
How to create a background model?
To create a new background model navigate to the My Data section of IDseq. You can either navigate into a project or select the Samples tab to see all of your IDseq samples.
Select the samples you would like to include using the blue check-mark to the left of the sample name.
Filter samples to find the appropriate subset (i.e. all samples in the project with Sample Type = Water Control). If you want to discover infecting agents you should create a background model with water controls and/or known infection-negative samples of the same tissue type.
The total number of selected samples will be visible above the table in the right-hand corner.
Click on the background model icon (overlapping squares, shown above) when you are happy with your selection.
A modal will appear with options to name and describe your background model.
You will notice a drop down menu titled Applied Correction Method. The options are Standard and Normalized by input mass. Select the desired background model type (see usage recommendations, and more details here).
Note that if any of the selected samples do not contain ERCC controls (ERCC Reads = 0) or were run on a pipeline version before 4.0, the option for Normalized by input mass will be unavailable.
Review your selected samples and click Create.
Click out of the modal when you are done to return to your project screen. If your background model includes a large number of samples it may take a few minutes to generate.
Once background generation is complete, you can use your background model on the heatmap and the single sample report.
How do I use a background model for data interpretation?
When looking at a single sample you can select your new background model from the Background dropdown filter.
The drop down menu will indicate which Applied Correction Method was used in creating the background model: Standard or Normalized by input Mass.
Select the new background model to update the Z score and the aggregate score in the table.
When viewing a heatmap you also have the option to select a background model. This will affect the Z-score values for filtering (i.e. Threshold Filter, NT Z >= 1). Additionally, if you want to view the Z score on the heatmap you can change the Metric selection to be NT or NR Z score. The heatmap will update automatically upon selection of a background model.
Note that if the heatmap contains samples without ERCCs, the mass normalized background models will not be available in the drop down menu.
How is the Z score calculated?
The Z-score metric in the IDseq sample report is based on the prevalence of each taxon in the selected background model. When viewing the results from a particular sample, the Z-score metric can be used to provide insight into whether a particular taxa was present in the control samples. Specifically, the Z-score for a taxon T in sample S is computed as follows:
Taxons present at higher abundance in the sample than in the controls will have a Z-score > 1. If a particular taxon is not found in the set of control samples, then the Z-score will be set to 100. If the taxon is not found in the sample, but is present in the controls, the Z-score will be set to -100.
Practically speaking, as mentioned here, one approach to narrowing results to account for contamination is to apply a threshold filter for Z-score.
How are standard and mass normalized background models different?
Standard Background Models
The standard background model provides a generic way to evaluate the presence of contaminants across samples. It uses the reads per million (rpm) as the metric of relative abundance. RPM values normalize for sequencing depth, and are computed as shown below:
Standard Background Model Z-score formula:
We recommend using a standard background model for samples without ERCC controls, or samples run on pipeline versions before 4.0. While we recommend using the mass-normalized background model for experiments where all samples and controls used ERCCs, it is possible to use standard background models for samples with ERCCs.
Mass-normalized Background Models
ERCC spike-in controls (External RNA Consortium Controls) are a set of 92 standardized RNA transcripts present in varying concentrations ranging from 1.4 × 10−2 to 3.0 × 10−22 mol/L, which may be spiked into an RNA-sequencing library during the library preparation steps. ERCCs are added at a known concentration and the relationship between the known input concentrations and the number of sequencing reads can be used to inform calculations of total input mass associated with a sequencing library.
By normalizing for input mass (mass-normalization), it is possible to account for differences in true relative abundance even when samples contain differing amounts of total input nucleic acid. This provides additional granularity to analysis and further improves the sensitivity for detecting contamination.
Mass-normalized background models use the mass / pg of input ERCC as the relative abundance.
Mass-normalized Background Model Z-score formula:
We recommend using a mass-normalized background model for higher-sensitivity analysis of RNA-seq samples where ERCC controls were used.
- Salter, S.J., Cox, M.J., Turek, E.M. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 12, 87 (2014). https://doi.org/10.1186/s12915-014-0087-z
- Zinter, M.S., Mayday, M.Y., Ryckman, K.K. et al. Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome 7, 62 (2019). https://doi.org/10.1186/s40168-019-0678-6
- Davis, N.M., Proctor, D.M., Holmes, S.P. et al. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018). https://doi.org/10.1186/s40168-018-0605-2