Jump to Section:
Upload
- Click "Upload" in the upper right corner of your screen.
- Choose “SARS-CoV-2 generate consensus genomes”. The typical CZ ID pipeline to analyze metagenomic data will not be run. Only the consensus genome pipeline for SARS-CoV-2 will be run.
- Once you have selected the consensus genome pipeline, you will be prompted to choose the sequencing platform you used. Our assembly supports Illumina and Nanopore platforms.
- Choose the wet-lab protocol that was used to amplify SARS-CoV-2. Picking the correct wet-lab protocol is a critical step because the protocol chosen dictates the primers that will be removed during analysis.
-
-
- If you are using Nanopore, choose the correct Medaka model for the best results.
- Medaka models are named to indicate i) the pore type, ii) the sequencing device (MinION or PromethION), iii) the base caller variant, and iv) the base caller version
- There is a default consensus genome medaka model:
r941_min_high_g360
, which can be used if you are unsure of which medaka model to select. - Where a version of Guppy has been used without an exactly corresponding medaka model, the medaka model with the highest version equal to or less than the guppy version should be selected.
- Use the below flow chart to choose the correct Medaka model.
-
-
If you're having trouble viewing the above diagram, click here.
Pipeline overview Illumina
How the consensus genome is assembled in CZ ID
The CZ Biohub created the pipeline. Check out their Nextflow pipeline here.
The consensus genome pipeline is different from the mNGS pipeline in that it focuses solely on SARS-CoV-2. The host reads are still filtered out, but the reads are aligned to the 29,903 bp reference genome MN908947.3 instead of all taxa in the NCBI index database. This reference genome was chosen because it is the current community standard for downstream phylogenetics. When the consensus genome pipeline kicks off, the following steps are run:
- Host reads are filtered out by aligning the reads to the HG38 human reference genome.
- If the sample was spiked with ERCCs those will be quantified.
- Reads are aligned to the reference genome for SARS-CoV-2, MN908947.3, using minimap2.
- Reads that aligned to the reference genome are then classified using Kraken2 against a database containing SARS-CoV-2 sequences taken from here: [https://genexa.ch/sars2-bioinformatics-resources/]. This is to remove any homologous aligned reads that are not SARS-CoV-2.
- Now that the majority of the contaminating reads have been removed, the remaining reads are then trimmed using trim galore. This step removes adapter sequences, reads of low-quality (defined here as a Phred score <20), and sequences shorter than 20 bp.
- The trimmed reads are then aligned again to the SARS-CoV-2 reference genome, MN908947.3, using minimap2.
- Primers are trimmed using iVar. From iVar docs: “iVar uses primer positions supplied in a BED file (chosen during upload) to soft clip primer sequences from an aligned and sorted BAM file.“
- The consensus genome is called using iVar consensus. A base is called as long as it has a depth of 10 or more reads. If a base cannot be called it is identified as N.
- Variants (SNPs) are called to evaluate sequence differences as compared to the reference genome. This is done using samtools and bcftools.
- Additional quality metrics are computed using QUAST and other python scripts.
Pipeline Overview Nanopore
Overview
We are using the ARTIC Network’s nCoV-2019 novel coronavirus Nanopore bioinformatics protocol for building consensus genomes. Information about this pipeline can be found here. Default parameters that were modified can be found below, along with validation steps. Details about analysis and QC can be found in our Help Center.
Parameters Updated
During internal validation of the ARTIC Network’s Nanopore pipeline, we made a few modifications to their default parameters to ensure the quality of consensus genomes generated by the pipeline matched the quality of existing genomes submitted to public data repositories.
Parameter |
Update |
Reasoning |
Normalise - used to normalise the coverage to save pipeline run time |
1,000 |
Resulting genomes are higher quality with updated parameter |
Medaka model - used to call the consensus genome and SNPs |
r941_min_high_g360 |
Produced the best result across a variety of samples. This value matches the default model implemented by the medaka library itself. |
Min_length parameter - used to set the minimum length of the reads that can be used in the assembly |
350 |
The default value results in most of the tested ClearLabs data being filtered out. The min_length value has been reduced to match the ClearLabs value of 350. |
Comments
0 comments
Please sign in to leave a comment.