Jump to Section:
Overview
The CZ ID SARS-CoV-2 pipeline enables you to assemble consensus genomes and explore genome coverage. Here we describe the pipelines used to assemble SARS-CoV-2 genomes from short- (Illumina) and long-read (Nanopore) data.
Pipeline for SARS-CoV-2 Illumina data
The SARS-CoV-2 consensus genome pipeline was created by CZ Biohub. Check out their Nextflow pipeline here. The consensus genome pipeline is different from the mNGS pipeline in that it focuses solely on SARS-CoV-2. The host reads are still filtered out and the reads are aligned to the 29,903 bp reference genome (SARS-CoV-2 Wuhan-Hu-1) instead of all taxa in the NCBI index database. This reference genome is the community standard for downstream phylogenetic analysis.
The consensus genome pipeline includes the following steps:
- Host reads are filtered out by aligning the reads to the HG38 human reference genome.
- If the sample was spiked with ERCCs those will be quantified.
- Reads are aligned to the SARS-CoV-2 reference genome (accession ID MN908947.3) using minimap2.
- Reads that aligned to the reference genome are then classified using Kraken2 against a database containing SARS-CoV-2 sequences. This step removes any aligned reads that are not SARS-CoV-2.
- Now that the majority of the contaminating reads have been removed, the remaining reads are trimmed using Trim Galore. This step removes adapter sequences, reads of low-quality (defined here as a Phred score <20), and sequences shorter than 20 bp.
- The trimmed reads are then aligned again to the SARS-CoV-2 reference genome (accession ID MN908947.3), using minimap2.
- Primers are trimmed using iVar. iVar uses primer positions provided in a BED file (automatically added by the pipeline) to soft clip primer sequences from an aligned and sorted BAM file.
- The consensus genome is called using iVar consensus. For base calling, bases need to have a depth of 10 or more reads. If a base cannot be called due to low coverage it is identified as a missing base ("N").
- Variants (SNPs) are called to evaluate sequence changes relative to the reference genome. This is done using SAMtools and BCFtools.
- Additional quality metrics are computed using QUAST and other python scripts.
Pipeline for Nanopore data
The SARS-CoV-2 consensus genome pipeline for Nanopore data uses the ARTIC Network’s nCoV-2019 novel coronavirus Nanopore bioinformatics protocol. Information about this pipeline can be found here. Below we describe default parameters that were modified from the original protocol, along with validation steps.
Adjusted parameters from ARTIC Network nCOV-2019 protocol
During internal validation of the ARTIC Network’s Nanopore pipeline, we made a few modifications. These modifications to the default parameters ensured that the quality of consensus genomes generated by the pipeline matched the quality of existing genomes submitted to public data repositories. See table below for details.
Comments
0 comments
Please sign in to leave a comment.