Jump to Section:
Overview
Here we provide an overview of the pipeline used to assemble viral consensus genomes from the Sample Report. In addition, we outline approaches used to validate the pipeline.
After reading this guide, you will:
- Become familiar with the consensus genome pipeline workflow
- Understand how the pipeline was validated
Pipeline Overview
The CZ ID viral consensus genomes pipeline used to assemble genomes directly from the Sample Report was adapted from the initial SARS-CoV-2 genome pipeline released in August 2020. Modifications to the pipeline include removing the SARS-CoV-2-specific Kraken filtering and primer trimming steps. After human reads are filtered out, the remaining reads are aligned to the reference accession of choice. Note that there is no primer trimming step in this pipeline workflow. Therefore, this pipeline should not be used with data obtained through enrichment (MSSPE) or PCR-based methods.
Here is an overview of all of the steps in the consensus genome pipeline:
- Human reads are removed by aligning sequencing reads against the HG38 reference genome using minimap2.
- Non-human reads are aligned to the reference genome of choice using minimap2.
- The aligned reads are then trimmed using Trim Galore. This step removes adapter sequences, low quality reads (defined here as a Phred score <20), and sequences shorter than 20 bp.
- The consensus genome is called using iVar consensus. A base is called as long as it has a depth of 10 or more reads. If a base cannot be called, it is identified as N.
- Variants (SNPs) are called to evaluate sequence differences between the newly assembled consensus genome and the reference. This is done using SAMtools and BCFtools.
- Additional assembly metrics are computed using QUAST and other python scripts.
Pipeline Validation
We evaluated if the pipeline could generate quality consensus genomes. We validated the pipeline's ability to reconstruct consensus viral genomes from three types of data, including:
- Publicly available data (raw data in SRA) with an associated public consensus genome sequence.
- Generated simulated data from reference genomes using InSilicoSeq for viruses with no next generation sequencing (NGS) data (public data was not readily accessible or traceable).
- Manually verified genomes for a subset of viruses for which NGS data is available, but no published genome exists. Genomes were verified taking into account QC metrics and genome orientation.
The initial validation experiments focused on organisms supported by NextStrain builds (https://nextstrain.org/) and spanned a diversity of viral genome types (e.g., ssRNA, dsDNA, segmented). In total, 30 samples were tested and verified. The following 17 viral species were evaluated across samples:
- Chikungunya Virus
- Dengue Virus
- Enterovirus D68
- Epstein-Barr Virus
- Hepatitis B virus
- Human coxsackievirus
- Human metapneumovirus
- Human parechovirus
- Human respirovirus
- Human RSV A
- Mumps
- Norovirus GII
- Rhinovirus C
- Rotavirus A
- West nile virus
- Zaire ebolavirus
- Zika Virus
When compared against known published genomes, samples achieved > 99.9% identity to the known reference sequence. All deviations were interrogated manually and attributed to areas of ambiguity where the CZ ID pipeline would call an IUPAC ambiguity code and the published genome would contain a standard base (ACGT). The CZ ID viral consensus genome pipeline is conservative in calling consensus bases. It requires at least 10x coverage to make any base call and > 75% frequency of a single nucleotide to call a standard base. Below that, it will revert to IUPAC ambiguity codes. In all cases of ambiguity-induced deviation from the published genome, the IUPAC code identified included the published nucleotide. This highlights that the pipeline is sensitive to the true consensus sequence and reiterates the importance of manual inspection in cases where a genome has low coverage or high numbers of ambiguous bases (a metric which is reported in the output stats.json file).
Note on segmented genomes
The CZ ID viral consensus genome pipeline accessed through the Sample Report assembles genomes against reference accessions from the NCBI database. In cases of segmented genomes, their reference accessions may be split across segments. In those cases, the “consensus genome” will result in only the consensus sequence for the selected segment. It is possible to re-run the analysis with a different segment as the “reference genome” to obtain each segment independently. This is an area for active future development of the pipeline. In the mean time, if you are interested in providing your own reference sequence (e.g., a reference where segments are concatenated into a single sequence) please see our guide for uploading data directly to the Viral Consensus Genome pipeline.
As you use the pipeline, we would love to hear about how your work progresses and where there may be room for improvement! Don’t hesitate to send us a message.
Comments
0 comments
Please sign in to leave a comment.