Overview
You can assemble viral consensus genomes from one or multiple samples through CZ ID's Viral Consensus Genome pipeline. You can process genome data obtained through primer spiking for target enrichment (e.g., MSSPE), PCR, whole genome sequence or shotgun sequencing methods. After uploading data, consensus genomes will be automatically assembled against a user-provided reference sequence. This contrasts with the consensus genome pipeline accessible through the mNGS Sample Report page, where you can only assemble genomes from metagenomic data using listed reference sequences. Here we describe how to upload short-read (Illumina) data for viral consensus genome assembly.
After reading this guide, you will be able to:
- Upload data to assemble viral consensus genomes
- View genome status
Upload Viral Consensus Genome Data
Follow the steps listed below to upload data and assemble viral genomes:
- Step 1: Go to Upload Page
- Step 2: Specify Project
- Step 3: Select Analysis Type
- Step 4: Select Sequencing Files
- Step 5: Add Metadata
- Step 6: Review
- Step 7: View Genome Status and Report
Step 1: Go to Upload Page
Log in to CZ ID using your email and password. Once logged in, you will see your name in the upper right-hand corner of the application. You will see a link to the Upload page next to your username.
- Upload Page Link: Click this link to open the Upload page.
- Upload Steps: Upload is divided into three general stages to upload samples ("Samples"), add metadata ("Metadata"), and review the information ("Review"). The current stage will be highlighted in blue.
Step 2: Specify Project
Select or create a project within the Select Samples page. The project selection will affect the pipeline version used to run the samples given that the pipeline version for all analysis types is assigned upon project creation (see Pipeline Version for details). Therefore, all samples within a project will run on the same major pipeline version.
- Create New Project: Use this link to create a new project. A dialog box will appear to enter the new project information.
- Project Dropdown Menu: Use the dropdown menu to upload samples to an existing project.
When creating a new project, you will need to add a project name, select if the project will be public within CZ ID or private, and provide a project description. Click the Create Project button to finish creating the new project.
Step 3: Select Analysis Type
Under “Analysis Type”, select Viral Consensus Genome. You will be prompted to provide a taxon name and reference sequence file. You can opt to add a primer BED file to trim primers from reads during consensus genome assembly.
- Pipeline Version: Specifies the pipeline version that will be used to run samples.
- Taxon Name Box: Specify the taxon name for the viral consensus genome you are trying to assemble using the search box dropdown menu. If the taxon name is unknown or you can't find it in the suggested list of taxa, select "Unknown" (see Taxon Name).
- Reference Sequence Link: Use this link to upload the reference sequence you would like use to assemble the viral consensus genome by mapping reads against it. The reference sequence should be submitted in FASTA format (see Reference Sequence).
- Primer File Link (optional): Use this link to upload a primer BED file (see Trim Primers). Click here to learn how to prepare BED files.
You can choose to run metagenomic analysis ("Metagenomics") in parallel to viral genome assembly. To do this, select Metagenomics and Viral Consensus Genome under the Analysis Type section. Note that the Viral Consensus Genome pipeline is only available for short-read data (Illumina). If you select the Metagenomics option, you will see the Pipeline Version and NCBI Index Date that will be used to run the metagenomic analysis.
- Sequencing Platform Options: When running mNGS analysis in addition to Viral Consensus Genomes, you can only select the short-read (Illumina) platform.
- Pipeline Version: Specifies the mNGS Illumina pipeline version that will be used to run samples.
- NCBI Index Date: Specifies the date nucleotide (NT) and protein (NR) databases used to analyze samples were downloaded from NCBI.
Taxon name
Please provide the taxon name for the viral consensus genome you are trying to assemble. This information will help you search for genomes of interest within your account more effectively. Click the Taxon Name box, type the taxon name within the search box, and select the name from the list of suggested taxa. If the taxon name is unknown or you can't find it in the suggested list of taxa, select "Unknown".
Reference Sequence
Upload the reference sequence you would like use to assemble the viral consensus genome by mapping reads against it. The reference sequence should be submitted in FASTA format (see note below regarding the reference file name). You should select a reference sequence that is most similar to the genome sequence you are trying to assemble (e.g., a sequence used to design primers for the target enrichment or PCR-based assays used to obtain data). However, if you are not sure about which reference sequence to use, you can do the following:
Option 1: Use the standard reference used by the research community for the virus of interest. This will help keep variant or mutation calls consistent with other groups and facilitate downstream phylogenetic analyses.
Option 2: Use the mNGS pipeline to find the reference sequence most similar to the sequences of interest in your data. This will be especially helpful if you are working with a virus species for which there are no standardized assays and the primers were designed to anneal to various species or variants and you are not sure which one is represented in your data.
Notes regarding reference sequence file:
- Reference filename: The reference FASTA filename cannot contain any spaces, it can only include letters, numbers, dashes, parentheses, and underscores.
- Reference sequence: If "U" is present in RNA reference sequences, it needs to be replaced with "T" before upload. The reference sequence can only include IUPAC sequence (A, T, G, C) and ambiguous characters ("N", "Y", "R", "S", "W", "K", "M", "H", "B", "V", "D").
- You can only upload one reference sequence. If you are working with a segmented virus, you will have to generate consensus genomes for each segment independently. Alternatively, you can generate an artificial reference sequence by concatenating all genome segments into a single sequence (we suggest distinguishing segment ends with stretches of Ns).
- If you are assembling multiple viral consensus genomes, you should use the same reference for all the genomes you are planning to compare through alignments and downstream phylogenetic analysis.
Trim Primers
You should submit a BED file specifying primer positions when uploading viral sequence data obtained through primer spiking for target enrichment (e.g., MSSPE) or PCR-based assays. Click here to learn about BED files and how to prepare them for upload.
Pipeline Version
The pipeline version that will be used to run uploaded samples can be seen once you select a project and analysis type (i.e. Viral Consensus Genome). CZ ID uses a three-level pipeline versioning system where the first number indicates the major pipeline version followed by numbers that specify minor version and patch updates. For example, pipeline v1.2.15 refers to major pipeline version 1, minor pipeline version 2, and patch version 15.
The project’s pipeline version will be automatically assigned upon project creation based on the latest version available for each analysis type. This pipeline version pinning by project helps to ensure that all sample runs within a project are comparable. For example, if your project is pinned to Viral Consensus Genome pipeline v3.4.18, all new samples uploaded to that project will run on major pipeline v3. This system enables minor pipeline updates to be associated with the same major version while still allowing your results to be comparable.
You will see a Warning Icon ( ) if there is a new major pipeline version available. To use this new pipeline, you must create a new project.
*Note: Projects created before February 08, 2024 may include multiple major pipeline versions.
NCBI Index Date
If you select Metagenomics - Illumina, in addition to Viral Consensus Genome analysis, you will see the NCBI Index that will be used to process samples uploaded to the mNGS pipeline. The NCBI Index Date indicates the date the NCBI NT and NR databases were downloaded for use by CZ ID. This index date can be used to find associated GenBank release numbers (see GenBank release notes). The downloaded databases are then compressed and indexed by CZ ID. Newer versions of the index will have the most up to date taxon information from NCBI.
The project’s NCBI Index Date will be automatically assigned upon project creation based on the latest version available. Each project is pinned to one NCBI Index Date to ensure that all sample runs within a project are comparable*. For example, if your project is pinned to NCBI Index Date 2021-01-22, all new samples uploaded to that project will run on Index 2021-01-22.
You will see a Warning Icon ( ) if there is a new NCBI Index available. To use this new Index, you must create a new project.
*Note: Projects created before February 08, 2024 may include multiple NCBI Index Dates.
Step 4: Select Sequencing Files
After specifying the analysis type, scroll down to "Select Files" to upload FASTQ (“.fastq” or “.fq”) or compressed FASTQ (“.fastq.gz” or “.fq.gz”) files directly from your computer or BaseSpace account. Click here if you have FASTA files.
Upload Files from Your Computer
Select Your Computer tab to upload files directly from your computer.
- Your Computer Tab: Use this tab (default) to select sequencing files found in your computer.
- Upload Box: Drag and drop files into the provided box or click the link to use your file browser.
- Sample List: Sequencing files ready for upload will be listed here. Sample names will be based on the sequence filenames (see file requirements).
- Continue Button: After selecting files, use this button to continue to the Upload Metadata section.
If you have sequencing files split over multiple lanes per sample, CZ ID will automatically detect files representing the same sample based on Illumina's naming convention and concatenate these files for you. For example, if you were to upload one paired-end sequence sample split over three lanes, such sample would have six files. In the screenshot below you can see that CZ ID automatically detects that each file is part of the same sample.
Upload Files from BaseSpace
If your Illumina sequencing files are hosted on BaseSpace, you can pull them directly into CZ ID. Select the BaseSpace tab under the Select Files section to access your files. Click the Connect to BaseSpace button to launch the BaseSpace login page.
Use your credentials to log in to BaseSpace and select files for upload.
Once you have selected and reviewed the files you want to process, click the Continue button at the bottom of the screen to continue to the next step (Upload Metadata).
Step 5: Add Metadata
Add the appropriate sample metadata through the Upload Metadata page. There are six required metadata fields, including: Host Organism, Sample Type, Water Control, Nucleotide Type, Collection Date, and Collection Location (see Adding Metadata for details). You can enter metadata manually or upload a metadata file in comma-delimited format (".csv” file extension).
Manual Metadata Entry
Use the “Manual Input” tab (default). Fill in metadata information using the provided fields directly through the web interface. After entering information for all the required fields, click the Continue button to go to the Review section.
- Manual Input Tab: Use this tab (default) when entering metadata directly throught the web interface.
- Metadata Table: Enter information for each column or field. By default, the required fields will be listed on the table. You can add additional columns through the Metadata Dropdown Menu.
- Metadata Dropdown Menu: Click the plus sign to view and add optional metadata fields to the Metadata Table.
Upload Metadata File
Use the "CSV Upload" tab to upload a comma-separated value (CSV) file with metadata. If there are no errors, click the Continue button to go to the Review section.
- CSV Upload Tab: Select this tab to upload a metadata file.
- Metadata Template: Click template link to download a CSV file that will be already populated with sample names and metadata fields. Edit the file to include the appropriate metadata and save it. You are not required to use the provided template.
- Metadata File Upload Box: Use this box to upload the metadata file.
Step 6: Review
Use the Review page to review the project, sample, and analysis information. The "Edit" links by each section can be used to edit project and sample information if you need to correct anything before upload. After reviewing sample and metadata information, please accept CZ ID's Terms of Service and Data Privacy Policy. Press Start Upload to begin the upload process to our server and kick off the analysis pipeline.
After pressing Start Upload, you will see a modal showing the upload progress. DO NOT close the web page while the upload is in progress. Otherwise, the upload will be canceled and you will have to start your upload over.
Wait until you will see an "Uploads completed!" confirmation message confirming that your samples have been uploaded successfully. Once you see the confirmation message you can close your window or press "Go to Project" to navigate to the Project page where you can view sample status and analyze results.
Step 7: View Genome Status and Report
You can see the status of your run by going to the Project Page of interest and selecting the Consensus Genome tab.
- Project Name
- Consensus Genome Tab
- Sample Status: Specifies sample progress. When the run is successfully completed, you will see a "Complete" status highlighted in green.
After the sample run has completed, click on the sample to view the genome report (see example report below). Assess the quality of the genome and/or download data.
Comments
0 comments
Please sign in to leave a comment.