Jump to Section:
Overview
Building a consensus genome for SARS-CoV-2 is an essential step in monitoring genomic changes and informing public health officials on transmission and virus evolution. This document will help you understand:
- How to perform quality control checks
- How to analyze the consensus genome
- How to upload the consensus genome to public repositories
SARS-CoV-2 information
It will be easiest to identify potential errors in the consensus genome you’ve built if you are familiar with the reference genome [SARS-CoV-2 genome]. Check out the SARS-Cov-2 resource page on NCBI for additional resources: https://www.ncbi.nlm.nih.gov/sars-cov-2/.
Performing quality control checks
Overview
Once the pipeline has finished running, you can find your completed consensus genomes in their project. Navigate to the project where you uploaded your samples and click on a single consensus genome result. You will be able to identify your consensus genome samples from the mNGS samples, as these samples have the prefix “[Consensus Genome]” in front of each sample name.
Often, the first step you will want to take when reviewing your consensus genome is doing a quality control review. You can do this by reviewing the metrics provided in the consensus genome sample report.
When reviewing the consensus genome, there are three initial metrics to evaluate:
- Coverage plot - the number of times a nucleotide is read during a sequence. The consensus genome must have >10 reads for a specific location on the genome for a base to be called.
- % genome called- Recovering a complete genome is important for phylogenetic analysis. Since SARS-CoV-2 is slow to mutate, the genome is used in phylogenetic analyses. Nextstrain will only accept genomes with >92% coverage. Make sure there are no stretches of Ns in the consensus genome.
- The number of single nucleotide polymorphisms (SNPs)- these are variations of a single base between reference and consensus genomes. Again, because SARS-CoV-2 is slow to mutate having 30 or more SNP’s should warrant greater investigation of the reads that have aligned to produce the consensus genome.
- Informative bases provided the number of C,T,G,A in the genome. At least 27,510 is required to be put on Nextstrain.
If these metrics look good, then you can proceed with uploading the consensus genome to public repositories and performing phylogenetic analyses (detailed below). If there are potential issues with the consensus genome, troubleshooting and other QC metrics are recommended below.
We recommend viewing the coverage plot for each consensus genome first. This is an important QC check to view the coverage depth and breadth of the reference genome. There are many reasons for poor genome coverage including low viral load, sample degradation, or issues with the library preparation.
The following files are provided from the 'download all' button.
File |
Description |
Use |
consensus.fa |
The consensus genome! |
The consensus genome |
depths.png |
Coverage plots |
Determine genome coverage |
report.tsv |
QUAST report |
Quality Control |
Aligned reads.bam |
Initial reads that aligned to the reference genome |
Can use in genome browser |
ercc_stats.txt |
ERCC spike in stats |
Used for QC of ERCC control |
no_host_1.fq.gz & no_host_2.fq.gz |
Non host raw reads |
Upload to SRA |
Primer trimmed.bam.bai |
Aligned reads with trimmed primers (companion to .bam file) |
used for interrogating coverage results and ensuring quality mappings |
Primer trimmed.bam |
Aligned reads with trimmed primers |
used for interrogating coverage results and ensuring quality mappings |
stats.json |
QC |
Secondary QC if the coverage looks weird |
.VCF |
Variant call format |
Can be used to view variants and identify SNP locations |
VADR |
Viral Annotation DefineR |
used to annotate and validate that consensus genomes can successfully be uploaded to Genbank and GISAID |
The metrics to look out for are highlighted below. We have provided guidelines for creating a consensus genome with 92% coverage of the reference genome - this is just our recommendation and not part of the guidelines for submitting to public repositories.
Coverage plot (depths.png)
The coverage plot should be used to do an initial QC check of the consensus genome. The coverage plot is a depiction of the number of reads that cover the SARS-CoV-2 reference genome. The y-axis shows the number of reads (depth) and the x-axis shows the position on the genome. The greater the depth across the genome, the better the consensus genome.
QUAST (report.txt)
Quality assessment tool (QUAST) for evaluating and comparing genome assemblies. While the QUAST (report.txt) produces many metrics for reference, we recommend focusing on the following:
- Total Length: length of the consensus genome. The reference genome is 29,903 bp and the consensus genome should be close to that number.
- Genome Fraction: % of the reference genome covered. >95% of the reference genome should be covered.
Output stats (stats.json)
The stats.json file provides the following stats that we believe are important to focus on. These stats are also provided in the user interface.
- Total Reads - total reads sequenced
- Mapped Reads - number of reads that mapped to the reference genome
- Ref SNPs - number of single nucleotide polymorphisms
If qPCR was performed on the samples, the Ct value can help determine how likely you are to get a full genome based on the number of reads sequenced. The lower the Ct value, the higher the viral load, therefore it becomes more difficult to sequence the genome when the Ct value is >30. When looking at the table below, if mNGS was done, you would need > 40 million reads to recover the SARS-CoV-2 genome from a sample with a Ct value between 25 and 30.
Expected read # needed to recover the genome, based on Ct value and sequencer. The Ct values are based on two qPCR assays, one that targeted the N gene and the other targeted the E gene. This chart is provided by Amy Kistler, Jack Kamm, and the CZ Biohub.
|
Ct<20 |
Ct<20-25 |
Ct<25-30 |
Ct 31+ |
mNGS |
1-4 million |
>4 million |
>40 million |
not possible to recover full genome |
MSSPE |
100k-1 million |
1 million |
10 million |
not possible to recover full genome |
ARTIC amplicon |
30k |
100k |
1 million |
recovery is spotty and with higher rates of sequencing error, but still possible for some genomes with 100k + reads |
consensus sequence (consensus.fa)
This folder contains the consensus genome fasta file. Once QC has been done, the consensus genome(s) can be uploaded to public repositories such as GISAID, NextStrain, and GenBank.
Aligned reads (primertrimmed.bam & .bai)
This file contains the trimmed reads that aligned to the reference genome. If you are interested, the reads can be downloaded and aligned back to the reference genome using samtools or IGV, but the index file (primertrimmed.bai) must also be uploaded with the .bam file. IGV has a web application where the reference genome and the reads that have aligned can be viewed. It can be found here.
ERCC stats
If an ERCC spike-in was performed, the ERCC reads are quantified and can be found in the file ERCC stats.bam. In a clean, high-quality sequenced sample, read counts for each ERCC sequence should linearly track their spike-in concentrations.
sample variants (VCF.gz)
VCF stands for variant call format and this vcf.gz file can also be viewed using IGV to determine the number of variants within a host and to identify the SNP locations. More about viewing a .VCF file can be found here.
Viral Annotation DefineR (VADR)
VADR is a suite of tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. It has been tested primarily for the analysis of Norovirus, Dengue, and SARS-CoV-2 virus sequences in preparation for submission to the GenBank database.
Consensus genomes submitted to GenBank or GISAID are automatically evaluated using VADR before being accepted. If the consensus genome is divergent in various ways (e.g. early stop codon, regions of low nucleotide similarity), then the sequence fails. Failure means that the sequence is flagged for manual review by an NCBI expert curator, called a “GenBank indexer”. If all sequences in a submission pass VADR checks, all of the sequences will automatically be deposited into GenBank. Unfortunately, if one genome, in a large batch of genomes fails, the whole submission will be kicked back to the uploader. By checking the VADR output from the consensus genome pipeline before submission, you can ensure all of the genomes you plan to submit will pass the evaluation.
Once you have downloaded the zip file, if you are using a Windows device, you will need to right-click on the VADR files, choose ‘open with’, and open with your preferred application such as Notepad, Microsoft Word, or Excel.
Vadr-output.vadr.sqc - will tell you if the consensus genome will pass or fail the NCBI submission evaluation. Check out the column with the header “p/f” to see if the consensus genome passed (“p” = “pass”; “f” = “fail”). If you are interested, you can learn about the other fields included in this output file here.
If the consensus genome passed, the vadr-output.vadr.alt.list file will be empty. If it failed, this file will list the error(s) that caused the failure. See the errors and their descriptions here.
We recommend checking the VADR output prior to uploading to public repositories to avoid uploads being rejected.
Reference: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
Upload to public repositories
Sharing SARS-CoV-2 consensus genomes and non-host sequence data is essential to the COVID-19 pandemic response. Submitting data to both the Global Initiative for Sharing All Influenza Data (GISAID) EpiCoV™ (https://www.gisaid.org) and NCBI repositories ensure that data are widely available, maximizing the impact of these data for public health surveillance and research. Consensus genomes should promptly be uploaded to GISAID and NCBI. Non-host reads can be submitted to the SRA at a later date.
Upload to GISAID
- Create a user account. It may take a few days for GISAID to approve your account so we recommend registering if you plan on submitting a consensus sequence in the future.
- Under the EpiCoV tab, click Upload.
- You can perform a single upload or a batch upload. If you are doing a batch upload, GISAID will prompt you to provide an e-mail and phone number so directions can be sent to you.
- For a single upload, the following is needed: (can write “unknown” if the information is not known)
- Virus name in the following format hCoV-19/Country/Identifier/2020
- Collection date
- Location (continent/country/region)
- Passage details/history: if not from cell culture write “original”
- Host
- Gender
- Patient age
- Patient status
- Outbreak detail
- Sequencing technology
- Originating lab
- Address
- Submitting lab
- Authors
- Fasta file of sequence (provided by CZ ID “consensus seqs.fasta”)
Upload to Genbank
The consensus genome should also be uploaded to Genbank. For NCBI submissions, BioSample records should be created upon GenBank record submission. The same BioSample should be used when submitting the corresponding raw data to SRA.
Upload to the SRA database
Non-host reads (.fastq) and aligned reads (.bam) can be downloaded from (aligned reads.bam and uploaded to the SRA database.
- All file names must be unique and cannot contain any sensitive information. File names as submitted appear publicly in the Google and AWS clouds.
- Each file must be listed in the SRA metadata table. If you are uploading a tar archive, list each file name, not the archive name.
- Use the preload option if you are uploading files over 10 GB or more than 300 files. All files for a submission must be uploaded into a single folder. Options to preload data:
- Aspera browser plugin upload
- Aspera command-line upload
- FTP upload
- Amazon S3 instructions
- SRA Submission Wizard Help. Contact sra@ncbi.nlm.nih.gov with any question or concern about your data or submission.
Additional analysis of SARS-CoV-2 consensus genomes
Use Nextclade to determine the clade and perform additional QC checks
Nextclade is a free web-tool that identifies the differences between the consensus genome(s) you upload and the reference genome MN908947.3. By uploading your sequences to Nextclade you will be able to view:
- The clade of your genome based on differences from the reference genome. Nextstrain has grouped variants of SARS-CoV-2 into clades based on specific signature mutations.
- The location and base call of SNPs
- The open reading frames
- QC check- gaps, N’s
Running a BLAST to gather important metrics
Running a BLAST will be a helpful way to orient yourself on where differences between the reference genome and consensus genome are located. We recommend navigating to the blastn page and clicking on the “align two or more sequences” (captured below). All your consensus sequences can be run against the reference genome. Noting the % ID will be useful for determining which genomes are more divergent.
Creating an alignment
Building a multiple sequencing alignment is an important step to visualize differences and similarities between sequences and creating a phylogenetic tree. If you are interested in creating an alignment of your sequences the following steps should be taken to limit phylogenetic noise. The following steps are from a publication that can be found here.
- Mask the ends of the genome- they are prone to sequencing errors (~55 bp on each end)
- Mask potential sequencing artifacts or hypermutable sites that contribute to phylogenetic noise. This includes the following sites at positions: 187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, and 29700
- Mask the following positions that have a variation that is lab or geographically specific: 4050 and 13401
- Mask the positions that are hypermutable: 11053, 15324, and 21575
- Position 11083 is associated with a strange deletion pattern or a G to T SNP that is most likely due to sequencing error and should be ignored.
Determining within-sample variation
- Download the reads that align to the reference genome.
- Variants should only be considered real if the depth of coverage at that point is ≥ 5 reads
- Generally, most samples contain 0-5 variants, with a median of 1 and a mean of 8.2.
Comments
0 comments
Please sign in to leave a comment.