Jump to Section:
Overview
The consensus genome pipeline differs from the mNGS pipeline in that it focuses on a single viral taxon of choice. There are two things to note: 1) if you are doing metagenomic sequencing with spiked primer enrichment (MSSPE) for SC2 specifically, upload to SC2 pipeline and choose MSSPE as the wet lab method to ensure that the primers are trimmed, and 2) this pipeline should be on metagenomic samples prepped without any enrichment steps because there is no primer trimming step.
This article covers:
- How to create a consensus genome in CZ ID
- Pipeline overview and validation
- Performing QC check to ensure a high-quality consensus genome
- Troubleshooting QC issues
- Uploading to Genbank
Building a viral consensus genome in CZ ID
If a sample has viral taxon present at high levels, it may be possible to build a consensus genome. Consensus genomes can only be created for viral species with at least one contig aligning to the NCBI nucleotide database.
To build a consensus genome, navigate to the viral species of interest and click the icon, which will be active on species eligible for consensus genome creation.
Choosing a reference genome
A modal will appear with accession numbers that can be chosen as the reference genome for the consensus genome pipeline.
The available accession numbers are the same as those shown in the coverage visualization. These accessions are pulled during the mNGS pipeline when sequences have a minimum of 36 bp of homology and one contig aligning to the NCBI’s NT database. Not all available accessions chosen as a reference will yield the same quality consensus genome and should be selected based on the following criteria:
- Coverage visualization - can be leveraged to choose the best reference genome by selecting the accession with the highest depth and longest breadth of coverage.
- % ID - generally, the higher the % identity, the better the resulting consensus genome.
- Accession length - NCBI has full-length and partial genomes in their database. When choosing a reference genome, it is essential to pick a genome and not just a partial sequence.
- Literature review - it can also help do a literature search to determine if there is a specific reference genome that is typically used. One caveat is that only the accessions in the coverage plots will be available to choose for a reference genome.
The results will be available to view in the consensus genome tab. If more than one reference genome is used for the same viral species, the latest run metrics will be present in the sample results table. To view the other runs, click on the same and choose the reference genome in the dropdown menu.
Pipeline overview and validation
Pipeline Overview
The CZ ID viral consensus genomes pipeline was adapted from the initial SARS-CoV-2 consensus genomes pipeline, which was released in August 2020. Modifications to the pipeline include removing the SARS-CoV-2-specific Kraken filtering and primer trimming steps. After human and host reads are filtered out, the remaining reads are aligned to the reference accession of choice. It should be noted that there is no primer trimming step in this protocol and, therefore, should not be used on viruses enriched by MSSPE. Here is an overview of all of the steps in the consensus genome pipeline:
- Non-host reads are aligned to the reference genome of choice using minimap2.
- The aligned reads are then trimmed using trim galore. This step removes adapter sequences, reads of low quality (defined here as a Phred score <20), and sequences shorter than 20 bp.
- The consensus genome is called using iVar consensus. A base is called as long as it has a depth of 5 or more reads. If a base cannot be called, it is identified as N.
- Variants (SNPs) are called to evaluate sequence differences as compared to the reference genome. This is done using samtools and bcftools.
- Additional quality metrics are computed using QUAST and other python scripts.
Validation
To validate that the pipeline could generate quality consensus genomes, two approaches were taken.
- Publicly available data (raw data in SRA) with an associated public consensus genome sequence, were used to evaluate the ability to reconstruct the published sequence using our pipeline.
- For viruses where public data was not readily accessible or traceable, generate simulated data from reference genomes (using InSilicoSeq REF) and reconstruct the original sequence.
- For a subset of samples where raw data is available, but no published genome exists, genomes were manually verified by taking into account QC metrics and genome orientation.
The initial validation experiments focused on organisms supported by NextStrain builds (https://nextstrain.org/) or spanned the viral type diversity (ssRNA, dsDNA, segmented, etc.).
In total, 30 samples were tested and verified across the following range of 17 viral species:
- Chikungunya Virus
- Dengue Virus
- Enterovirus D68
- Epstein-Barr Virus
- Hepatitis B virus
- Human coxsackievirus
- Human metapneumovirus
- Human parechovirus
- Human respirovirus
- Human RSV A
- Mumps
- Norovirus GII
- Rhinovirus C
- Rotavirus A
- West nile virus
- Zaire ebolavirus
- Zika Virus
When compared against known published genomes, samples achieved > 99.9% identity to the known reference sequence. All deviations were interrogated manually and attributed to areas of ambiguity where the CZ ID pipeline would call an IUPAC ambiguity code and the published genome would contain a standard base (ACGT). The CZ ID viral consensus genome pipeline is conservative in calling consensus bases. It requires at least 10x coverage to make any base call and > 90% frequency of a single nucleotide to call a standard base. Below that, it will revert to IUPAC ambiguity codes. In all cases of ambiguity-induced deviation from the published genome, the IUPAC code identified included the published nucleotide. This highlights that the pipeline is sensitive to the true consensus sequence and reiterates the importance of manual inspection in cases where a genome has low coverage or high numbers of ambiguous bases (a metric which is reported in the output stats.json file).
A note on segmented genomes - In the v0 version of the CZ ID viral consensus genome pipeline, is run against reference accessions from the NCBI database. In cases of segmented genomes, their reference accessions may be split across segments. In those cases, the “consensus genome” will result in only the consensus sequence for the selected segment. It is possible to re-run the analysis with a different segment as the “reference genome” to obtain each segment independently. This is an area for active future development of the pipeline.
It is possible to run the pipeline outside of the web interface (as noted here), in which case you can deal with segmented genomes by concatenating each segment with a stretch of NNNs to generate a consensus sequence in one iteration.
As you use the pipeline, we would love to hear about how your work progresses and where there may be room for improvement! Don’t hesitate to send us a message.
Metrics & performing QC checks
Metrics
- Coverage plot - the number of times a nucleotide is read during a sequence. The consensus genome must have >5 reads for a specific location on the genome for a base to be called.
- % genome called - Recovering a complete genome is essential for phylogenetic analysis.
- The number of single nucleotide polymorphisms (SNPs) - these are variations of a single base between reference and consensus genomes.
- Informative bases - provided the number of C,T,G,A in the genome.
- Ambiguous bases - If many sequencing reads support more than one base at a site, those sites will be designated with an IUPAC ambiguity code.
- Mapped reads - number of reads that mapped to the reference genome
- GC content - the GC content is the percentage of G and C in the consensus sequence. The GC content of the consensus sequence should be close to that of the reference accession.
It will be easiest to identify potential errors of the consensus genome if you are familiar with the genome. International Committee on Taxonomy of Viruses (ICTV) is a great resource for learning about the virus of interest. Take note of:
- genome size- how many base pairs is it?
- genome organization- how many open reading frames (ORFs) and what is their orientation?
- repeat regions- if reads don’t span this region (long reads) the assembly over these regions cannot be trusted
- Be aware of low/high GC content areas because these areas are likely to incur sequencing bias or errors.
Checklist:
- Coverage plot: The coverage plot should be used to do an initial QC check of the consensus genome. The coverage plot is a depiction of the number of reads that cover the SARS-CoV-2 reference genome. The y-axis shows the number of reads (depth), and the x-axis shows the position on the genome. The greater the coverage across the genome, the better the consensus genome.
- SNPs: an acceptable number of SNPs will differ depending on the virus (type and genome length). A general rule of thumb is double-stranded DNA viruses evolve more slowly than RNA viruses, so they should contain fewer SNPs. Single-stranded DNA viruses evolve at similar rates to RNA viruses.
- Ambiguous bases: The fewer ambiguous bases, the better the genome quality. We recommend having less than ten ambiguous bases.
- Gaps: Gaps in the genome will reduce the quality of the consensus genome and the quality of downstream applications such as creating phylogenetic trees.
- Frameshift mutations. Frameshift mutations happen when there are deletions, insertions, or SNPs that affect the open reading frames.
- Align the consensus genome back to the reference genome
- Check the open reading frames
- If you have Geneious, you can use this software. If not, you can do this in BLAST.
- Make sure the ORFs are in line with the reference genome
- If they are not, have a closer look at the alignment and check for insertions or deletions (it can be as simple as an insertion as an N).
Available intermediate files
Intermediate files are available to download to troubleshoot any QC issues that may come up. While many of the useful metrics are reported on the consensus genome results page, these files can be accessed from the "download all' icon.
File |
Description |
Use |
consensus.fa |
The consensus genome! |
The consensus genome |
depths.png |
Coverage plots |
Determine genome coverage |
report.tsv |
QUAST report |
Quality Control |
Aligned reads.bam |
Initial reads that aligned to the reference genome |
Can be used in a genome browser to view the alignment at the read level to interrogate SNPs, ambiguous bases, etc. |
ercc_stats.txt |
ERCC spike-in stats |
Used for QC of ERCC control. Note: ERCC stats are also computed during the mNGS pipeline and are available in the sample details panel. The metrics may differ slightly due to different calculation methods. |
no_host_1.fq.gz & no_host_2.fq.gz |
Non-host raw reads |
The non-host reads are also available for bulk download from the sample table. These should be uploaded to the sequence read archive (SRA). |
Primer trimmed.bam.bai |
Aligned reads with trimmed primers (companion to .bam file |
used with primertrimmed.bam file to view in genome browser |
Primer trimmed.bam |
Aligned reads with trimmed primers Note: no primers are trimmed in this pipeline |
This should be the same file as the aligned.bam file because no primers were trimmed. Can be used to view the alignment at the read level to interrogate SNPs, ambiguous bases, etc. |
stats.json |
QC |
Secondary QC if the coverage looks weird |
.VCF |
Variant call format |
Can be used to view variants and identify SNP locations. Can be viewed using IGV to determine the number of variants within a host and to identify the SNP locations. More about viewing a VCF file can be found here. |
The bam files are used to view how the reads aligned to the reference genome and are useful for how to troubleshoot QC issues such as ambiguous bases and SNPs.
Troubleshooting poor quality consensus genomes
The aligned reads (primertrimmed.bam & .bai) file contains the reads that aligned to the reference genome (same as alignmed.bam). The reads can be downloaded and aligned back to the reference genome using a genome browser such as Geneious or IGV. If using IGV, note that the index file (primertrimmed.bai) must be uploaded with the .bam file. IGV has a web application where you can view the reference genome and the aligned reads .
Gaps or low coverage of consensus genome
Areas with low or no sequencing coverage have no information to tell you which base should be at that site. These sites are labeled with N’s. When a sequence has two many N’s it is both hard to align and place on a tree, and thus they are removed from analyses.
Options:
- Choose a new reference genome.
- Design primers to fill in gaps.
- Resequence sample with fewer samples for deeper sequencing, which should result in more coverage.
- If the sample is resequenced, there is the option to concatenate fastq files prior to CZ ID upload to increase the sequencing depth. See how to concatenate files here.
Ambiguous bases
If many sequencing reads support more than one base at a site, those sites will be designated with an IUPAC ambiguity code, that tells you which set of mutations were found at the site.
Options:
- View the primertrimmed.bam file.
- Our pipeline requires 75% frequency of a base at a specific location. By checking the bam file for bases that can be called (i.e. 74 of one base and 25 another)can lower the number of ambiguous bases.
- Make sure to pay attention to the location of the ambiguous bases- the ends of reads tend to have lower-quality bases and are therefore less trustworthy.
- PCR area of ambiguity and sequence (can sanger sequence amplicon)
SNPs
If a consensus sequence has bases that differ from the reference genome a SNP will be called.
Options:
- View the primertrimmed.bam file.
- If there is high coverage in that location and all of the reads show the same base call is a good sign that mutation is real.
- If there is low coverage and/or reads with different base calls, it could be a sign of contamination.
Initiate a bulk download
- Select the samples that you want to be included in your download by clicking on the checkbox to the left of each sample name. You can select all samples in a project by clicking the checkbox in the table header. All of the consensus genomes created for a specific species will be downloaded.
- Click on the Download button in the upper-right hand corner of the table.
- A modal will appear with options for download types. There are three types of downloads:
-
-
- Consensus Genomes (Fasta)
- Metadata (CSV)
- Consensus Genome Overview (CSV)
-
- Click Start Generating Download at the bottom of the modal.
- The modal will disappear, and you will see a notification in the upper-right hand corner of your screen. You will see a link in that notification to check on the status of your download.
- CZ ID will now start preparing your data for download. This involves calculating and compiling the data from the various samples you selected into a single file that is ready to download. This process can take several hours if hundreds of samples are selected. When the process is complete, you will be able to download your file from the Downloads page.
- The files will be compressed into a tar.gz file. If you are using Windows, you can use this document to learn how to open this file type.
Upload to public repositories
Genbank
The consensus genome should also be uploaded to Genbank. For NCBI submissions, BioSample records should be created upon GenBank record submission. The same BioSample should be used when submitting the corresponding raw data to SRA.
Sequence Read Archive (SRA)
Non-host reads (.fastq) and aligned reads (.bam) can be downloaded and uploaded to the SRA database.
- All file names must be unique and cannot contain any sensitive information. File names as submitted appear publicly in the Google and AWS clouds.
- Each file must be listed in the SRA metadata table. If you are uploading a tar archive, list each file name, not the archive name.
- Use the preload option if you are uploading files over 10 GB or more than 300 files. All files for a submission must be uploaded into a single folder. Options to preload data:
- Aspera browser plugin upload
- Aspera command-line upload
- FTP upload
- Amazon S3 instructions
- SRA Submission Wizard Help. Contact sra@ncbi.nlm.nih.gov with any question or concern about your data or submission.
Comments
0 comments
Please sign in to leave a comment.