Overview
Nextclade performs four consensus genome quality control checks. This document will cover how to troubleshoot issues that are flagged in Nextclade, as well as other potential QC issues.
Troubleshooting too many N’s
Sites where a base could not be called: Areas with low or no sequencing coverage have no information to tell you which base should be at that site. These sites are labeled with N’s. When a sequence has two many N’s it is both hard to align and place on the tree, and thus they are removed from analyses. By default, Nextstrain will drop sequences with less than 27,000 non-ambiguous bases.
Options:
- Check the sequencing metrics to determine if it was a successful run. Q score, % passing filter, and cluster density should be taken into account.
- Resequence the sample. However, before deciding to resequence take the following into consideration:
- The Ct value. If the Ct value is high, it may not be worth resequencing. If it is >30 it is probably not worth resequencing.
- If the Ct value is between 25-30 you can resequence, but you will need more depth. See table here.
- If the sample is resequenced, there is the option to concatenate fastq files prior to CZ ID upload to increase the sequencing depth. See how to concatenate files here.
Troubleshooting mixed sites
Mixed sites: If many sequencing reads support more than one base at a site, those sites will be designated with an IUPAC ambiguity code, that tells you which set of mutations were found at the site. While this can happen given a co-infection event (rare), it more commonly occurs due to cross-contamination (more common).
Options:
- Check for cross-contamination:
- Check plate map. Keep track of the location of low Ct values. Samples with higher viral load are more likely to cause contamination.
- Check the barcodes used during the library prep. Shared barcodes may cause bleed-over during sequencing.
- View the primertrimmed.bam file.
- Our pipeline requires 75% frequency of a base at a specific location to be called. If there are too many mixed sites, check the bam file to see if any bases can be confidently called (i.e. 74% one base and 26% another in a region with high coverage).
- Make sure to pay attention to the location of the ambiguous bases occur- the ends of reads tend to have lower-quality bases and are less trustworthy.
Troubleshooting private mutations
Private mutations: If a sequence differs from the Wuhan reference genome by (currently) more than 24 mutations, it will be flagged as having a high number of “private” mutations. The threshold for flagging a sequence as problematic will be changed as the diversity of SARS-CoV-2 increases over the pandemic.
Options:
- View the primertrimmed.bam file.
- If there is high coverage in that location and all of the reads show the same base call is a good sign that mutation is real.
- If there is low coverage and/or reads with different base calls, it could be a sign of contamination.
- SNPs are only a problem if there are too many. The threshold for ‘too many’ will change over time. You can use Nextclade as a resource for flagging consensus genomes with a high number of SNPs.
Troubleshooting mutation clusters
Clusters of mutations: If your sequence has one or more areas with 6 mutations within a 100nt wide window, then that will be considered a “cluster of mutations” and it will be flagged unless it occurs at an area of the genome with known high diversity. Such clusters of mutations are often artifactual, resulting from challenges aligning the sequence.
Options:
- View the primertrimmed.bam file.
- Other issues such as long stretches of NNNN’s cause issues like this.
- If it is not salvageable, resequencing may be necessary (dependent on the Ct value).
Troubleshooting Frameshift mutations
Frameshift mutations happen when there are deletions or insertions that affect the open reading frames (ORFs). If there are frameshift mutations, the consensus genome will not be accepted into GISAID or Genbank and all of the reads will be kicked back.
Check for frameshift mutations:
- Align the consensus genome back to the reference genome
- Check the open reading frames
- If you have Geneious, you can use this software. If not, you can do this in BLAST.
- Make sure the ORFs are in line with the reference genome
- If they are not, have a closer look at the alignment and check for insertions or deletions (it can be as simple as an insertion as an N).
Other QC checks:
- Always have water controls! Negative controls are useful as well.
- It's normal to see a handful of SARS-CoV-2 reads in controls, but be concerned if recovering full amplicons, this is a sign of contamination.
- Keep track of shared barcodes before this can cause bleed-over.
- Always have maps and note where the Ct values are located.
Comments
0 comments
Please sign in to leave a comment.