INTRO
So you’ve uploaded your data to CZ ID...Now what?
In this document, we’ll walk you through the steps to begin analyzing your data. Deriving conclusions from your data is part of a research process - expect this to take time, and therefore patience. Don’t be discouraged and if you have questions, always feel free to reach out to our team.
OVERVIEW
Once your data has been uploaded, the first step is to perform a quality control check. Once the quality of the samples is deemed satisfactory, you can begin analyzing your data. While you are free to dive into your data however you like, we recommend creating a background model and familiarizing yourself with the filters to help identify background contamination and remove spurious hits, respectively.
Part One: Preparing for data analysis: QC and background model
When we refer to quality control (QC), we are talking about assessing the quality of the data generated from a sequenced sample run through the CZ ID pipeline, in which the raw reads are filtered, assembled, and aligned to known taxa (see below diagram). Understanding how a sample was processed in the pipeline can not only teach you more about the pipeline, but also gives you the context to start answering more interesting questions about what is a real hit in your sample.
GET FAMILIAR WITH THE PROJECT PAGE
- In the ‘My Data’ tab, you should see a list of projects. Project names and associated columns can be sorted one at a time by clicking the caret next to the column header.
- Navigate to your project page. In the upper left-hand corner, you should see the project title in bold. In the table, you should see all of your samples for this project.
- Review the meaning of the QC metrics available in the help center. Each of these metrics is a tool that will allow you to better understand the sample pipeline run.
- Click on + to the right of the column names and make sure the following columns are visible: Sample Type, Total Reads, Passed QC, Passed Filters, ERCC Reads, DCR. Samples can also be sorted by column data.
QC ANALYSIS
Our next step will be to identify what we’ll call “High-Quality“ and ”Low-Quality“ samples. This exercise will help you start to view trends and analyze the performance of your samples in the pipeline. Note that even “Low-Quality” samples can be used for downstream analysis. But understanding their QC properties can be informative for identifying ways to improve library prep and to maximize the available data.
- Navigate to the sample QC page by clicking the bar chart icon.
- To view the samples associated with the histogram, click a bar and the samples will populate the pane on the right.
- For each metric, note the names of the Samples that fit each criterion. Briefly, the goal at this stage is to look for low or high-value outliers in each QC metric, as these may indicate low or high-quality samples. Make sure to read the “Sample QC” article in the Help Center to understand more.
- Based on your notes, highlight or mark as “LQ” any sample names that you think might be low-quality.
- Repeat with any samples that you think may be high-quality, marking them as “HQ”.
INVESTIGATE WHAT HAPPENED TO LOW-QUALITY SAMPLES
Note: Your “low-quality” samples still may contain useful information for analysis. However, because of some of the flags we noted in the previous step, it’s best to investigate what happened to these samples to learn for the next preparation of future samples/libraries.
- Choose any sample from the group of potentially Low-Quality samples you identified.
- View the Reads Lost chart. Here you will see a step-by-step breakdown of where reads were lost in the pipeline.
- Note which steps you lost a large proportion of reads.
If samples were spiked with ERCC controls, the counts can be used to identify a problem with library preparation. If you have samples that were spiked with ERCC controls, choose one and follow the below steps.
- Click on the sample that you would like to investigate.
- Click on “Sample Details”, the blue text at the top of the page, just below the sample name. There should now be a panel on the right.
- Click “Pipeline”.
- Click on the ERCC Spike-In Counts heading and answer the following questions.
a) How many dots are on the graph? Each dot indicates one ERCC. We want to see that many of these spike-in control sequences were identified by the pipeline
b) Estimate slope of the line. You spiked in a mix of ERCCs containing 92 individual sequences at varying concentrations. A slope close to 1 indicates an even distribution of spike-in controls. An issue can be indicative of a problem with library prep or a large amount of input RNA that overwhelmed the sample.
- Summarize your findings.
- Repeat Part One for a few other low-quality samples in your project.
CREATE A BACKGROUND MODEL
Since mNGS is so sensitive, it is possible to find many reads mapping to background contaminants. When looking at the heatmap, it becomes clear that even water controls are not purely water - they often contain some organisms. These are what we refer to as “background contaminants”. You can specify which samples should be included in a background model - these typically include a set of water or other environmental control samples processed alongside their sequencing experiments.
In order to account for contamination in the analysis, we can create a background model and then evaluate the difference in rPM of those organisms in our samples versus in the water controls.
- Note! If you are working on a project with a group, please only have one person create the background model by following these steps. Once it is created, all members of your project will be able to access it.
- Navigate back to your project page. In the upper left-hand corner, you should see the project title in bold. In the table, you should see all of your samples for this project.
- Select all of the water controls in your project by clicking on the checkbox to the left of the sample name. If you have included which samples were water controls in your metadata, you can use the ‘sample type’ filter on the left panel to filter for ‘water control’.
- Click on the Save a Background Modal Button
. This will show a modal in which you are prompted to enter a name for the Background Model and an optional Description. We suggest naming your background model “[Project Name] Water Controls [Date]”.
- There will be two options:
-
- Standard
- Normalize by input mass- recommended if ERCC spike-in controls were used. (Note that if any of the selected samples do not contain ERCC controls (ERCC reads=0) or were run on an earlier pipeline version that 4.0, the option for 'Normalize by input masss' will not be available.
-
- Click on Create to save and start the creation of the background model.
Part Two: Overview of data: Heatmap & Z-Score
Our first phase was dedicated to understanding the quality of the sample runs. Next, you’ll actually start digging into the results for your samples. Creating a heatmap is a nice way to get an overview of the taxons present and identify patterns across samples in a project. Even if you have high confidence in a taxon being present in your sequencing library, you may discover that it is not actually an infecting agent, but rather a lab contaminant. To address this question of whether an organism is truly present in a sample, we can start identifying patterns across samples in a project as compared to controls. A heatmap is a common tool to help visualize these patterns.
CREATE A HEATMAP
- Navigate to your project page. In the upper left-hand corner, you should see the project title in bold. In the table, you should see all of your samples for this project.
- Click on the checkmark box to the left of the Sample header to select all samples in your project.
- Click the Heatmap button
in the upper right hand corner of the table and selected Taxon Heatmap to launch CZ ID into creating a heatmap of your samples. This process can take a few minutes. (Note: a heatmap cannot be created with >300 samples)
- On the Heatmap page, you will see some filters and view settings at the top of the page. Adjust the following view settings with:
-
-
- Metric: NT rPM
-
- This will adjust the metric plotted on the heatmap, such that higher rPM values appear darker.
-
- Metric: NT rPM
-
-
-
- Taxon Level: Genus
-
- Evaluating at the genus level helps you zoom out and see high-level trends before diving in at the species level.
-
- Taxon Level: Genus
-
- Click on the Add Metadata button above the first taxon in the Heatmap to begin adding metadata (by default the only metadata that you will see is Collection Location). Metadata will help bring context to the samples you are viewing.
-
-
- Select Sample Type and any other metadata of interest.
-
- We recommend beginning with a highly restrictive set of filters to get a high-level view of your data. If your goal is to identify divergent viruses, you may want to use the NR e value) filter only. Note: as you set filters, you will see taxa disappear. That’s OK. We’ll be creating several iterations of the heatmap. Our first step is to figure out what is real signal and what are contaminants/noise.
-
-
- NT rPM >= 10 (only show taxa with at least 10 reads per million greater that aligned to the nucleotide database) remove low abundance taxa.
- NR rPM >= 1 (Only show taxa with at least 1 read per million aligning to the protein database value less than or equal to 10 ^-5)- removes taxa identified based on chance.
- NT L >= 50 (only show taxa with an average read alignment length greater or equal to 50)- removes false positives resulting from spurious sequence alignments.
-
- Sort the heatmap by the Sample Type field by clicking on the Sample Type heading next to the row displaying Sample Type metadata in the upper left corner of the heatmap. If you click again, the sorting method will change direction, and clicking a third time will remove sorting altogether.
- Take a look at the heatmap you have created. This should give a broad picture of the landscape of pathogens across your samples. Specifically, look for any differences that you see between controls and non-controls. This will give you a good indication of which taxa are potential contaminants.
- Note your observations.
- Save the heatmap by pressing the Save button in the upper right-hand corner. You can return to this heatmap if needed later on. Tip: To view all of your saved heatmaps, first click on the My Data tab in the page header. Then click on Visualizations.
INTERPRETATION OF HEATMAP
There are many ways to begin interpreting the heatmap. You’ll learn what works best for you with time and practice. Not all analysis can or should be done in the heatmap.
- Make sure that you are on the My Data tab in the header. Then click on Visualizations and finally click on the heatmap you previously created.
- By default, the Background is set to NID Human CSF HC. This default isn’t appropriate for your analysis, so in the drop-down menu, choose the background model you previously created in the dropdown.
- Add a filter for NT Z score >= 1. This should filter out taxa that are present in your samples at similar abundances to the controls you selected for your background model.
- Compare your water controls to all other samples. You should see that many of the taxa that were present in your water controls are now filtered out. This will help you identify certain taxa as contaminants.
- Similar to the above section, take a look at your heatmap and observe any trends or taxa of interest.
- Save your new heatmap.
- Try sorting by different metadata fields and see if new patterns arise.
-
-
- Sort by location. What do you notice?
- Do you have metadata indicating phenotype? Add it to the heatmap and sort.
-
- Sometimes it is useful to separate the analysis of bacteria from that of viruses. We can use the Taxon Categories filter to select only the Viruses.
- Then, since we have reduced the total number of microbes, we may want to look more specifically at the species level. Set Taxon Level = Species.
- As you work, if you notice something interesting, save the heatmap so you can reference it later.
- Repeat this process of adding and removing filters, and remember to be patient. This is one part of the research process and it is typical for it to take some time to draw conclusions.
Part Three: Analysis of a Single Sample
Based on taxa of interest that were found in the heatmap, explore whether or not the hit is “real” by using metrics on the sample report page. Note that the heatmap does not show all of the taxa present in every sample so the sample report page should be used to make sure nothing is missed.
GET FAMILIAR WITH THE SAMPLE REPORT PAGE
- Choose any sample that has a taxa of interest that were identified by the heatmap.
- Click on the sample name. You are now on the Sample Report Page.
- You will see a table with all of the taxa found in your sample. Take note of the number of rows in the table (this number can be found in small gray text directly above the table).
- Review the metrics in the Help Center. Each of these metrics is a tool that will help you determine if each taxon is a “real hit” (versus a false positive).
Tip: Hover over a column header (metric name) and a tooltip will appear with the metric’s definition. - You may notice the NT | NR button on the table. You can toggle between 2 different databases - NT is a database of nucleotide sequences and NR of non-redundant protein sequences.
-
- If you see a taxon that has high NT counts and low NR counts, many of the reads aligning to that taxon come from rRNA genes, which are present in the NT database, but not the NR database.
- If you see a taxon that has high NR counts and low NT counts, this is likely a false-positive. However, if this microbe is a virus, this sample may contain sequences from a divergent virus that map via the more conserved protein (NR) sequence, but not via the more mutated nucleotide (NT) sequence.
- You will want to use the NT values for most of the analysis, except for when exploring divergent viruses.
-
IDENTIFY THE MOST ABUNDANT SPECIES
The primary goal of mNGS is to evaluate the relative abundance of microbes in a sample. The reads and rPM (reads per million) metrics in the table provide this relative abundance information. Each time we run a sequencing experiment, we may obtain different numbers of total reads. To scale the values across experiments we use rPM instead of raw read count.
- Sort the table by rPM by clicking on the rPM header in the table. The taxa should now be sorted so that the taxa with the highest rPM appear at the top of the table.
- Identify the taxa with the top 3 highest rPM values.
- Expand the genus of each of these taxa by clicking on its row and answer the following:
-
- How many species are shown within this genus?
- Is one species within the genus much more abundant than another species? Due to sequence homology, it is possible to see false positive species within a genus. If one species is much more abundant than the others, we generally believe the most abundant species is present and pay less attention to the others. This can be confirmed using BLAST.
-
- Note the most abundant species.
FILTERING THE REPORT PAGE
Now that we have a familiarity with the report page, we’ll add filters to remove some noise. There are multiple sources of noise in mNGS - first, similarity between organisms may make it challenging to identify the organisms with short reads, resulting in low-abundance false positives. Secondly, mNGS is highly sensitive and may detect low levels of background contaminants that are not relevant for making a decision regarding the infecting microbe. Filters allow us to focus on highly abundant species within particular categories (i.e. Viruses, Bacteria, etc.).
In this section, we will set some basic filters, but there are many more that you can apply.
- Click on Threshold Filters and set: NT r>= 1 and click Apply. We suggest setting this filter to start, but note that this may exclude some highly divergent or novel viruses from appearing in your table. If this is an important aim of your research, consider removing this filter.
- Click on Threshold Filters and set: NT rPM >= 10 and click Apply. This filter removes rows associated with taxa that are present at extremely low levels.
- Click on Threshold Filters and set the following: NT L >= 50 and click Apply. This filter removes rows associated with taxa where the alignment length was shorter than 50 bp.
- You may also find it useful to focus on particular taxonomic categories - i.e. Bacteria or Viruses. You can do so by clicking on Categories and selecting your choices.
- Take note of the new value indicating the number of rows passing the filters.
GAIN CONFIDENCE IN HITS
Each of the metrics in the Sample Report can be used to assess your confidence in a particular hit (taxon) actually being present in your sample. We will focus on 3 values in particular...
- Find the taxon that you identified as being the most abundant (highest rPM) in your sample.
- Look at the contig column. Is the contig value > 0?
-
-
- Yes - this indicates that short reads were assembled into longer contigs, which improves your confidence in the hit.
- No - that’s ok! Contig assembly is dependent on the total number of reads and the size of the organism’s genome. This organism may be present at such low abundance that we didn’t sequence enough reads to generate a contig.
-
- Look at the column %id. Is the value >90%?
-
-
- Yes - this suggests this organism is highly similar to the reference sequence.
- No - Sometimes novel pathogens will appear to have lower %id, but you can double-check if it’s really there, by downloading the reads and contigs and investigating the quality of alignments via BLAST.
-
- Look at the column L.
-
-
- For libraries containing 150bp reads, we can have confidence in alignments with an L value greater than 75 bp. The longer the alignment, the greater trust we have in this result.
-
- Note that it is possible to have L values greater than the original read length if contigs were generated.
-
- Is L < 50? We advise you to not trust these results as the length value is too low.
- For libraries containing 150bp reads, we can have confidence in alignments with an L value greater than 75 bp. The longer the alignment, the greater trust we have in this result.
-
- Repeat this sequence of steps for the taxa with the next highest rPMs that you noted earlier.
The Coverage Visualization is another tool to assess whether a particular microbe is present or not. This type of visualization shows the range and uniformity of sequencing coverage for an accession identified in the sample.
- Choose a taxa that has a couple of different measures of success already (i.e. high rPM, high L value). Also, make sure the taxon has at least one NT contig.
- Hover over the taxon row and click on the Coverage Visualization icon (the bar graph).
- The top accession is the one shown by default upon opening the coverage visualization (there may be more than one matching accession; you can read more about that here.). Use the table below to help assess the depth and breadth of the match.
Depth |
Breadth |
How to Interpret |
High |
High |
Provides you with confidence that this is a true hit! |
Low |
High |
If there is coverage across multiple regions of the genome, this is likely a real hit. Evaluate the assembled contigs (NT L) and ensure the alignment is good (% identity). |
High |
Low |
Depends on whether the small portion that has high coverage is unique to that particular taxon and the quality of the alignment and assembly. The 16S rRNA region is highly conserved, which is both good and bad. It’s good because you can use it to ID bacteria if you have high confidence in your 16S rRNA sequence. But since it is highly conserved, if you don’t have high confidence in the sequence (the error rate is high), it’s easy to mistake it for a different bacteria. |
Low |
Low |
May be a result of genomic similarity at the 16S rRNA region |
- Note whether the genome in the Coverage Visualization is Viral or Bacterial. Because viral genomes are smaller, it is common to see relatively high coverage. Use the two examples below to familiarize yourself with what high coverage looks like for these two different types of genomes.
Example of High Coverage - Viral Genome
Example of High Coverage - Bacterial Genome
- Summarize your findings.
- Repeat this process for other taxa in your sample that you want to gain confidence in.
Apply the background model
It is possible, especially if there was no infection or relatively little input nucleic acid, that the most abundant microbe in the sample was actually a contaminant. Now that you’ve evaluated whether a microbe is truly present, it is important to evaluate whether it was present at levels greater than the control samples.
Z SCORE
Choose a sample that you were able to draw some conclusions about with the heatmap and your previous analysis. Click on any of the squares in the heatmap for that sample to return to its report page.
- Because the Z Score is based off of the chosen Background Model, we need to make sure we have chosen the correct one. Click on Background in the filter section above the table and choose the Model containing your water controls.
- Now the Z Score is an informative value! Sort by Z Score by clicking on the Z Score column header.
- Are there any taxa with Z score values above 1?
-
-
- Yes - note the Z Score value. There may be many taxa with Z scores > 1. When analyzing the results, relevant microbes will be at high abundance (rPM) and significance (Z score).
-
AGGREGATE SCORE
- Finally, let’s take a look at the Score metric. This is an experimental metric that combines relative abundance and data and z-scores. This can be a helpful way of identifying standout taxa in your sample.
- Sort the table by Score by clicking on the Score column header.
- Do you see any taxa with an outlier Score value?
-
-
- Yes! - indicates that both a high abundance of reads matched to that taxon in the NT and NR databases, and that these abundance values are significantly higher than in the selected background model.
- No - score is an experimental value so just because there are no high values does not mean that there is not something interesting in your sample.
-
SAMPLE TAKEAWAYS
- Repeat these data interpretation steps for 5 other samples in your project.
- Think about: What are you finding? What patterns are you noticing across samples? Are you noticing anything interesting in the pathogen landscape?
Part Four: Recap - Tie it All Together
When you upload sequencing data to CZ ID, you can repeat these steps to make sense of the data! In practice, we suggest the following steps to help you gain intuition about your samples.
- View your samples’ QC metrics on the project page.
- As you gain an understanding about your samples, look for any outliers.
-
-
- Consider Sample Type, Total Reads, ERCC Reads, Passed QC, Passed Filters, and DCR in evaluating QC.
- For any outliers, dig into the “Reads Remaining” table to understand what may have happened during library prep.
-
- Create a heatmap to view the samples.
- Add some metadata.
- Make sure you are using an appropriate background model. If you don’t have one, create a background model from water and/or healthy samples processed in your lab.
- Apply some filters to the heatmap.
-
-
- Look at Taxon Level = Genus to get a high-level view.
- Use the filters and a background model to remove noise and contamination.
-
- Look for trends. You may play around with the filters to see what makes sense for your datasets.
- Dive into individual hits using the sample reports. Make sure to evaluate the hits based on several metrics
-
-
- rPM, abundance within the sample
- Z score, significance as compared to the background
- Coverage visualization, to assess breadth and depth of reads to an accession
-
- Finally, any hits you find by mNGS can be validated by qPCR and you’ll soon be an expert in interpreting your data.
Comments
0 comments
Please sign in to leave a comment.