Background Models

Jump to Section:

Contamination in mNGS Experiments
What is a CZ ID background model?
How do I create a background model?
How do I apply a background model?
How is the z-score calculated?
How are standard and mass normalized background models different?
What makes a good background model?

Contamination in mNGS Experiments

Metagenomic next-generation sequencing is a highly sensitive technique for identification of nucleic acid found in a sample. The resulting sequencing reads may originate from numerous sources, including contamination from reagents and laboratory surfaces. The issue of background contamination is well-appreciated within the field of metagenomics, and is especially challenging in analysis of low biomass samples [1,2]. The established best practices for mitigating these challenges focus on inclusion of appropriate laboratory controls in mNGS experiments [3].

What is a CZ ID background model?

To assist with distinguishing microbial signals from reagent and environmental contamination, CZ ID supports background model generation. In CZ ID, a researcher can specify which samples should be included in a background model - these typically include a set of water or other environmental control samples processed alongside their sequencing experiments. The distribution of relative abundances of each taxa in the background model is then computed. You can then use the Z-score metric to compare your samples to the background model (see below). The Z-score metric allows researchers to evaluate the significance of relative abundance estimates for taxa in their samples as compared to the background model.

Applied Correction Method Recommendations:

We recommend that for experiments where ERCCs are spiked into all samples and controls, use mass-normalized background models (Applied Correction Method = Normalized by input mass).
For experiments where ERCC spike-ins are not used or a mix of samples are included, use the standard background model (Applied Correction Method = Standard). Mass-normalized background models will not be available (since that method relies on ERCC counts for normalization).

Note: Background models normalized by input mass are only available for samples run on mNGS Illumina pipeline v 4.0 or later.

How do I create a background model?

You can create background models by selecting control samples from a Project page of interest or from the Samples tab located in the Discovery page where you will see all the samples in your CZ ID account.

To create a background model:

Select the samples you would like to include in the model. You can filter samples to find the appropriate subset (i.e. all samples in the project with Sample Type = Water Control). If you want to discover infecting agents you should create a background model with water controls and/or known infection-negative samples of the same tissue type. The total number of selected samples will be visible next to the background model icon.
Click the background model icon located above the samples table.
A modal will appear for you to name and describe your background model. Fill in the appropriate information. Use a Name that is meaningful because you will have to search the created background by name when applying the model to sample reports or heatmaps. You will notice a dropdown menu titled Applied Correction Method. The options are Standard and Normalized by input mass. Select the desired background model type (see usage recommendations and details here). Click the Create button when you are done filling in the background model information.
After clicking Create, scroll down to see message highlighted in green confirming that the background model has been created. Click out of the modal when you are done to return to the Project page. If your background model includes a large number of samples it may take a few minutes to generate.
You can now apply the created background model to any Sample Report or Heatmap of interest.

How do I apply a background model?

You can apply background models to Sample Reports and Heatmaps. Once you apply the background model you can add Z-score filters (see threshold filters) to narrow down results to taxa that are more abundant in samples than in negative controls (Z-score > 1).

Sample Report

If there are no background models applied to the Sample Report you will notice that there are no values for Z-score or Score.

To apply a background model, click the Background dropdown menu and select the background of interest. The dropdown menu will indicate which Applied Correction Method was used to create the background model (Standard or Normalized by input mass) under the background model name.

The Score (Aggregate score) and Z-score columns will be populated after applying the background model. The rows including the top three taxa based on Aggregate scores will be highlighted in blue.

Heatmap

When viewing a heatmap you have the option to select a background model using the Background dropdown menu on the left-hand side of the page. Selecting a background model will enable Z-score filters for the heatmap. Additionally, if you want to view the heatmap color scale based on Z-score, you should select the appropriate background to see the correct Z-score values. The heatmap will update automatically upon selection of a background model.

Note: If the heatmap contains samples without ERCCs, the mass normalized background models will not be available in the Background dropdown menu.

How is the Z-score calculated?

The Z-score metric in the CZ ID sample report is based on the prevalence of each taxon in the selected background model. When viewing the results from a particular sample, the Z-score metric can be used to provide insight into whether a particular taxa was present in the control samples. Specifically, the Z-score for a taxon T in sample S is computed as follows:

Interpretation: Taxa present at higher abundance in the sample than in the controls will have a Z-score > 1. Specifically, a Z-score of 1 would indicate that the reads per million (rPM) for the taxon in your sample is one standard deviation greater than the average rPM in samples used to create your background model. If a particular taxon is not found in the set of control samples, then the Z-score will be set to 100. If the taxon is not found in the sample, but is present in the controls, the Z-score will be set to -100. You can use threshold filters in the sample report and heatmap to narrow results to taxa that are present in samples at higher abundance than in negative controls (i.e., NT Z Score > 1).

Note: The Z-score is used to calculate the Aggregate Score (Score). See Aggregate Score definition within Sample Report Table Metrics.

How are standard and mass normalized background models different?

Standard Background Models

The standard background model provides a generic way to evaluate the presence of contaminants across samples. It uses the reads per million (rpm) as the metric of relative abundance. RPM values normalize for sequencing depth, and are computed as shown below:

Standard Background Model Z-score formula:

We recommend using a standard background model for samples without ERCC controls, or samples run on mNGS Illumina pipeline versions before 4.0. While we recommend using the mass-normalized background model for experiments where all samples and controls used ERCCs, it is possible to use standard background models for samples with ERCCs.

Mass-normalized Background Models

ERCC spike-in controls (External RNA Consortium Controls) are a set of 92 standardized RNA transcripts present in varying concentrations ranging from 1.4 × 10−2 to 3.0 × 10−22 mol/L, which may be spiked into an RNA-sequencing library during the library preparation steps. ERCCs are added at a known concentration and the relationship between the known input concentrations and the number of sequencing reads can be used to inform calculations of total input mass associated with a sequencing library (click here for details on how to find ERCC data). Mass-normalized background models use the relationship between known ERCC concentrations and mass to calculate relative abundance of taxa in negative controls.

In particular:

By normalizing for input mass (mass-normalization), it is possible to account for differences in true relative abundance even when samples contain differing amounts of total input nucleic acid. This provides additional granularity to the analysis and further improves the sensitivity for detecting contamination. Therefore, we recommend using a mass-normalized background model for higher-sensitivity analysis of RNA-seq samples where ERCC controls were used.

Mass-normalized background models use the mass / pg of input ERCC as the relative abundance.

Mass-normalized Background Model Z-score (Z) formula:

What makes a good background model?

Since background models are helpful for identifying and removing contaminants from mNGS datasets, the best background models address the two primary sources of contamination:

External contamination - including reagents, laboratory setting, and research subjects
Internal contamination - cross-contamination during sample preparation and sequencing

Processing water controls alongside experimental samples will help identify external contaminants such as those found in reagents and laboratory equipment. The background model should include at least two water controls taken through the same wet-lab protocol as the actual samples - from DNA/RNA extraction to sequencing. However, the more controls are used in the background model, the more robust it becomes. You can also create a background model for other control samples. You can include sampling controls to account for contaminants originating from sampling techniques as well as control groups for identifying microbes of interest by comparing cases vs healthy controls.

References

Salter, S.J., Cox, M.J., Turek, E.M. et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 12, 87 (2014). https://doi.org/10.1186/s12915-014-0087-z
Zinter, M.S., Mayday, M.Y., Ryckman, K.K. et al. Towards precision quantification of contamination in metagenomic sequencing experiments. Microbiome 7, 62 (2019). https://doi.org/10.1186/s40168-019-0678-6
Davis, N.M., Proctor, D.M., Holmes, S.P. et al. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6, 226 (2018). https://doi.org/10.1186/s40168-018-0605-2

Articles in this section

Jump to Section:

Contamination in mNGS Experiments

What is a CZ ID background model?

Applied Correction Method Recommendations:

How do I create a background model?

How do I apply a background model?

Sample Report

Heatmap

How is the Z-score calculated?

How are standard and mass normalized background models different?

Standard Background Models

Mass-normalized Background Models

What makes a good background model?

References

Comments

Articles in this section

Jump to Section:

Contamination in mNGS Experiments

What is a CZ ID background model?

Applied Correction Method Recommendations:

How do I create a background model?

How do I apply a background model?

Sample Report

Heatmap

How is the Z-score calculated?

How are standard and mass normalized background models different?

Standard Background Models

Mass-normalized Background Models

What makes a good background model?

References

Related articles