- Choose “SARS-CoV-2 generate consensus genomes”. The typical IDseq pipeline to analyze metagenomic data will not be run. Only the consensus genome pipeline for SARS-CoV-2 will be run.
- Once you have selected the consensus genome pipeline, you will be prompted to choose which wet-lab protocol you used. Our assembly supports wet-lab protocols MSSPE and ARTIC v3. Choosing which wet-lab protocol is a critical step because the protocol chosen dictates the primers that will be removed during analysis.
- Primer file used for MSSPE: https://idseq-database.s3-us-west-2.amazonaws.com/consensus-genome/msspe_primers.bed
- Primer file user for ARTIC v3: https://idseq-database.s3-us-west-2.amazonaws.com/consensus-genome/artic_v3_primers.bed
How the consensus genome is assembled in IDseq
The CZ Biohub created the pipeline. Check out their Nextflow pipeline here.
The consensus genome pipeline is different from the mNGS pipeline in that it focuses solely on SARS-CoV-2. The host reads are still filtered out, but the reads are aligned to the 29,903 bp reference genome MN908947.3 instead of all taxa in the NCBI index database. This reference genome was chosen because it is the current community standard for downstream phylogenetics. When the consensus genome pipeline kicks off, the following steps are run:
- Host reads are filtered out by aligning the reads to the HG38 human reference genome.
- If the sample was spiked with ERCCs those will be quantified.
- Reads are aligned to the reference genome for SARS-CoV-2, MN908947.3, using minimap2.
- Reads that aligned to the reference genome are then classified using Kraken2 against a database containing SARS-CoV-2 sequences taken from here: [https://genexa.ch/sars2-bioinformatics-resources/]. This is to remove any homologous aligned reads that are not SARS-CoV-2.
- Now that the majority of the contaminating reads have been removed, the remaining reads are then trimmed using trim galore. This step removes adapter sequences, reads of low-quality (defined here as a Phred score <20), and sequences shorter than 20 bp.
- The trimmed reads are then aligned again to the SARS-CoV-2 reference genome, MN908947.3, using minimap2.
- Primers are trimmed using iVar. From iVar docs: “iVar uses primer positions supplied in a BED file (chosen during upload) to soft clip primer sequences from an aligned and sorted BAM file.“
- The consensus genome is called using iVar consensus. A base is called as long as it has a depth of 10 or more reads. If a base cannot be called it is identified as N.
- Variants (SNPs) are called to evaluate sequence differences as compared to the reference genome. This is done using samtools and bcftools.
- Additional quality metrics are computed using QUAST and other python scripts.