Consensus Genome Quality Checks

Jump to Section:

Assembly Metrics
Quality Checks
Troubleshooting

Overview

You can check the quality of viral consensus genomes by looking at assembly metrics. Here we describe assembly metrics reported in the genome coverage panel that can help you assess the quality of viral genomes. Additionally, we discuss troubleshooting tips for improving genome quality.

After reading this guide, you will:

Understand genome quality control (QC) metrics
Be able to perform key genome QC checks
Troubleshoot strategies to improve genome quality

Assembly Metrics

You will be able to see various metrics on the genome coverage panel after assembling a viral consensus genome.

Assembly Metrics
Coverage Stats
Coverage Visualization

Metrics include:

Coverage plot - Graph depicting the number of reads covering a given nucleotide of the reference sequence. A minimum read depth of 5 or 10 reads is required to call a base in the consensus genome when processing mNGS or WGS data, respectively.
% Genome Called - Refers to the percentage of the genome meeting thresholds for calling consensus bases. The closer this number is to 100%, the better.
SNPs - Indicates the number of single nucleotide polymorphisms. SNPs represent single nucleotide variations between the reference accession and consensus genome.
Informative bases - Specifies the number of base calls (C, T, G, A) in the genome.
Ambiguous bases - If multiple sequencing reads support more than one nucleotide at a given site, those sites will be designated with an IUPAC ambiguity code. This metric specifies the number of non-C, T, G, A nucleotides in the consensus genome. The consensus genome pipeline only calls nucleotides that are detected at least at 75% frequency.
Mapped reads - Refers to the total number of reads that mapped to the reference genome.
GC content - Percentage of G and C nucleotides in the consensus sequence. The GC content of the consensus sequence should be close to that of the reference sequence.

You should become familiar with the viral genome you are trying to assemble to be able to assess its quality. The International Committee on Taxonomy of Viruses (ICTV) and ViralZone are great resources for learning about viruses of interest. When investigating genomic features, take note of:

Genome size: How many base pairs is it?
Genome organization: How many open reading frames (ORFs)? What is their orientation?
Repeat regions: Are there known repeat regions? What are their positions? If reads don’t span this region (i.e., region covered by single reads) the assembly or consensus sequence over these regions should not be trusted.
Low or high GC content areas: Are there genomic regions with low or high GC content? Is important to inspect the assembly over these regions because they are prone to have sequencing bias or errors.

When assessing genome quality, keep in mind the following checklist :

Coverage plot: The coverage plot provides an initial and quick quality check of the consensus genome. The coverage plot is a depiction of the number of reads that cover the reference genome. The y-axis shows the number of reads (depth), and the x-axis shows the position on the genome. The greater the coverage across the genome, the better the consensus genome.
SNPs: The number of acceptable SNPs will differ depending on the virus (type and genome length). A general rule of thumb is that double-stranded (ds)DNA viruses evolve more slowly than RNA viruses. Therefore, you would not expect to see many SNPs for dsDNA viruses compared to RNA viruses. Single-stranded DNA viruses evolve at rates closer to those observed for RNA viruses.
Ambiguous bases: The fewer ambiguous bases, the better the genome quality. We recommend having less than ten ambiguous bases.
Gaps: Gaps in the genome will reduce the quality of the consensus genome and, thus, will affect the accuracy of downstream analysis (e.g., phylogenetic trees).
Frameshifts: Frameshifts happen when there are deletions, insertions, or SNPs that affect coding regions. If you notice frameshits after aligning the consensus genome with the reference sequence, you should do your best to distinguish natural mutations from sequencing or bioinformatic artifacts (see troubleshooting tips below).

Troubleshooting: Checking Read Alignments

If there are unexpected features in assembled consensus genomes (e.g., early stop codons, frameshifts, too many ambiguous or undetermined bases) you should evaluate read alignment against the reference. You will need to use a genome browser where you can view read alignment using BAM files containing read mapping information (i.e., “primertrimmed.bam” and “primertrimmed.bai”) and the reference sequence. Click here to learn how to download intermediate files, including BAM files and references.

We suggest viewing read alignments with the Integrative Genomics Viewer web app (IGV-Web). IGV-Web is a free online interactive tool used to explore genomic data. Note that there is an IGV desktop app that you can download and install for free on your computer. See user guides for IGV desktop and IGV-Web for detailed user manuals. Below we walk you through features you should keep in mind when evaluating read alignments, including gaps, ambiguous bases, and SNPs.

Gaps or low coverage of consensus genome

Areas with low or no sequencing coverage do not have enough information for base calling. A minimum read depth of 5 or 10 reads is required to call a base in the consensus genome when processing mNGS or WGS data, respectively. Otherwise the site will be labelled with "N". Too many missing sites (N’s) will affect the reliability of downstream phylogenetic analysis.

What can be done about gaps? Try the following options:

1. Choose a new reference genome.

2. Design primers to fill in gaps.

3. Resequence sample with fewer samples for deeper sequencing, which should result in more coverage.

4. If the sample is resequenced, there is the option to concatenate fastq files prior to CZ ID upload to increase the sequencing depth. See how to concatenate files here.

Ambiguous bases

If many sequencing reads support more than one base at a site, those sites will be designated with an IUPAC ambiguity code. This code specifies which nucleotides were observed at the site.

What can be done about ambiguous bases? Try the following options:

1. Evaluate read alignment: Our pipeline requires 75% frequency of a base at a specific location. Checking the BAM file for bases that can be called (i.e. 74 of one base and 25 another) can lower the number of ambiguous bases. Note that read ends tend to have lower-quality bases and, therefore, are less trustworthy.

2. Use PCR to amplify the regions with ambiguities and use Sanger sequencing to confirm the sequence.

SNPs

SNPs are called in genome sites where the consensus sequence has bases that differ from the reference genome.

What can be done about SNPs?

Evaluate read alignment to try to distinguish between real mutations and artifacts. If there is high coverage for the region in question and all of the reads show the same base call, then there is evidence indicating that the mutation is real. On the other hand, if there is low coverage and/or reads with different base calls, it could be a sign of contamination.

Articles in this section

Consensus Genome Quality Checks

Jump to Section:

Overview

Assembly Metrics