Jump to Section:
Overview
You can easily download virus consensus genome data, including consensus genome sequences (FASTA format) and intermediate files produced throughout the pipeline. Here we outline steps to download consensus genome data.
After reading this guide, you will:
- Learn different options for downloading consensus genome data
- Become familiar with available intermediate files
Download Data for a Single Consensus Genome
You can download data for a single consensus genome from the Sample Report page. Here you can download the consensus genome sequence and generated intermediate files in a single folder.
To download a folder with consensus genome data:
1. Navigate to the Consensus Genome tab for the sample.
2. If you assembled multiple genomes using reads from the same sample, you can change the displayed consensus genome. To change the consensus genome, select the genome of interest from the dropdown menu. 3. To download all the data associated with the displayed consensus genome, click the "Download All” button on the right-hand side of the page.
Download Data for One or Multiple Consensus Genomes
You can download data for a single or multiple consensus genomes at the same time (bulk download) from the Consensus Genome tab for a project of interest. From this tab you can download the consensus genome sequence, assembly metrics, sample metadata, and intermediate files.
To download consensus genome files of interest:
- Navigate to the Consensus Genomes tab found on the Project page of interest.
- Select genomes of interest.
- Click the download icon.
- Select download type: A modal will appear to select the download type. Select the file of interest and click "Start Generating Download" button.
- Find Downloads page: Some files will download directly to your device. However, most downloads will be available through the Downloads page. To get to the Downloads page, open the dropdown menu by your user name and select "Downloads".
- Check download status: Once you navigate to the Downloads page, check the status of the download. Note that files available through the Downloads page will be deleted after 7 days of creating the download.
- Download file: When the download is "complete", click the Download File link to download to your device.
Available Intermediate Files
You can download Intermediate files, including mapping information contained within BAM files, to troubleshoot genome quality issues that may need to be evaluated and submit data to public repositories. Below we describe available intermediate files for download through the Sample or Project pages for consensus genomes.
Filename | Description | Use |
consensus.fa | Consensus genome sequence (FASTA format) | Assembled consensus genome that can be used for downstream analyses (e.g., phylogenetic tree builds) |
depths.png | Image of coverage plot | Visualize genome coverage |
report.tsv/report.txt | QUAST report in TSV and TXT format. | Evaluate assembly metrics |
aligned_reads.bam | Initial reads that aligned to the reference genome | Can be used in a genome browser to view read-level alignments to the reference sequence and evaluate SNPs, ambiguous bases, etc. |
primertrimmed.bam |
Aligned reads after soft-clipping primer sequences.
Note: The consensus genome pipeline available through the mNGS Sample Report does not include a primer trimming step. Therefore, the “aligned_reads.bam” and “primertrimmed.bam” files are the same for genomes generated through the mNGS Sample Report. |
Can be used in a genome browser to view read-level alignments to the reference sequence and evaluate SNPs, ambiguous bases, etc. |
primertrimmed.bam.bai | Companion index file for primertrimmed.bam (same as aligned_reads.bam file) | Used with primertrimmed.bam file to view read alignments in genome browser |
sample.muscle.out.fasta | MUSCLE pairwise alignment between reference and consensus genome sequences in FASTA format. | Can be used to inspect alignment between reference and assembled consensus genomes |
ercc_stats.txt | ERCC spike-in stats | Used for evaluating ERCC spike-in controls. Note that ERCC stats are also computed through the mNGS pipeline and are available in the sample details panel. The metrics may differ slightly due to different calculation methods. |
no_host_1.fq.gz and no_host_2.fq.gz | Reads after subtracting host/human sequences (referred to as “non-host” reads) | Non-host reads cab be uploaded to the sequence read archive (SRA) |
samtools_depth.txt | Text file summarizing read depth at each position of the reference sequence | Can be used to plot coverage |
stats.json | Text file summarizing assembly metrics | Secondary quality control check for coverage |
variants.vcf.gz | Single nucleotide polymorphism (SNP) data in variant call format (VCF) | Can be used to view variants and identify SNP locations. File can be viewed using the Integrative Genomics Viewer (IGV) to determine the number of variants within a host and identify SNP locations. |
Note: BAM and VCF files can be viewed using the freely available Integrative Genomics Viewer (IGV), which includes a Web App for analyzing genomes online.
Note on Consensus Genome Data Submission to NCBI
Downloaded data can be submitted to NCBI’s public repositories for the benefit of the broader scientific community. Consensus genomes can be submitted to GenBank, whereas reads (non-host reads) and BAM files can be submitted to the Sequence Read Archive (SRA). Click here for an overview of how to submit sequence data to NCBI.
Comments
0 comments
Please sign in to leave a comment.