Jump to Section:
Overview
Ready to upload metagenomic samples to CZ ID and explore results? Here we describe how to upload Illumina sequencing files and sample data (metadata) to your CZ ID account. Click here if you are working with Nanopore sequencing data.
Uploading Data
To upload new samples to CZ ID, log in to the application using your email and password. Once logged in, you will see your name in the upper right-hand corner of the application. Next to your name, you will see a link to the Upload page.
- Upload Page Link: Click this link to open the Upload page
- Upload Steps: Upload is divided into three general stages to upload samples ("Samples"), add metadata ("Metadata"), and review the information ("Review"). The current step will be highlighted in blue.
The Upload page provides users different settings and options for uploading data. Note that the upload process is divided into three general stages, including Samples, Metadata, and Review. In the "Samples" step you will select a project, analysis type, and sample files to upload.
Project Selection
Samples uploaded to CZ ID are organized into projects. Use the Select Project section to specify an existing project or create a new project to upload samples. The project selection will affect the pipeline and database version used to run the samples given that the pipeline version for all analysis types and NCBI Index Date are assigned upon project creation (see Pipeline Version and NCBI Index Date for details). Therefore, all samples within a project will run on the same major pipeline version and use the same NCBI Index Date.
- Create New Project: Use this link to create a new project. A dialog box will appear to enter the new project information.
- Project Dropdown Menu: Use the dropdown menu to upload samples to an existing project.
When creating a new project, provide a name for the project and select its privacy status using the New Project dialog box. You also have the option to add a brief project description to document the background or context of your project. After entering the new project information, click "Create Project" and proceed to select the analysis type.
Project Sharing
Below we describe privacy status for samples uploaded to private and public projects.
- Private Projects - Samples uploaded to private projects will remain private to you and your collaborators until you decide to share it with other researchers by making it public on CZ ID. CZ ID will not automatically change your samples from private to public - the decision if and when to share samples is entirely yours.
- Public Projects - Samples uploaded to public projects are discoverable to all CZ ID users. Note: raw sample data (i.e., genetic sequence files in FASTQ or FASTA format) that have been uploaded to CZ ID are only available to the original uploader, no matter if your sample is public, private, or shared via a project. Raw data is not shared with any other CZ ID user, nor is it ever accessed by anyone working on CZ ID unless specifically requested by a user, such as to debug an issue.
Note that you can change the privacy status of your project from Private to Public at any time. However, you cannot do the opposite. If you or your collaborators make a project Public by mistake, please contact our team immediately by sending an email to help@czid.org.
Selecting Analysis Type
To select the analysis type, go to the Analysis Type section listing analysis options. Select Metagenomics from the main list and Illumina from the dropdown options. The Pipeline Version and NCBI Index Date that will be used to run the samples will be specified under the selected analysis.
- Sequencing Platform Options: You can select from short- (Illumina) or long-read (Nanopore) platforms. Select Illumina.
- Pipeline Version: Specifies the mNGS Illumina pipeline version that will be used to run samples.
- NCBI Index Date: Specifies the date nucleotide (NT) and protein (NR) databases used to analyze samples were downloaded from NCBI.
Pipeline Version
The pipeline version that will be used to run uploaded samples can be seen once you select a project and an analysis type (i.e. Metagenomics - Illumina). CZ ID uses a three-level pipeline versioning system where the first number indicates the major pipeline version followed by numbers that specify minor version and patch updates. For example, pipeline v1.2.15 refers to major pipeline version 1, minor pipeline version 2, and patch version 15.
The project’s pipeline version will be automatically assigned upon project creation based on the latest version available for each analysis type. This pipeline version pinning by project helps to ensure that all sample runs within a project are comparable*. For example, if your project is pinned to mNGS Illumina pipeline v8.1.0, all new samples uploaded to that project will run on major pipeline v8. This system enables minor pipeline updates to be associated with the same major version while still allowing your results to be comparable.
You will see a Warning Icon ( ) if there is a new major pipeline version available. To use this new pipeline, you must create a new project.
*Note: Projects created before February 8, 2024 may include multiple major pipeline versions. Click here to learn about QC and host filtering step differences between major pipeline versions.
NCBI Index Date
The NCBI Index that will be used to process uploaded samples can be seen once you select a project and analysis type (i.e. Metagenomics - Illumina). The NCBI Index Date indicates the date the NCBI NT and NR databases were downloaded for use by CZ ID. This index date can be used to find associated GenBank release numbers (see GenBank release notes). The downloaded databases are then compressed and indexed by CZ ID. Newer versions of the index will have the most up to date taxon information from NCBI.
The project’s NCBI Index Date will be automatically assigned upon project creation based on the latest version available. Each project is pinned to one NCBI Index Date to ensure that all sample runs within a project are comparable*. For example, if your project is pinned to NCBI Index Date 2021-01-22, all new samples uploaded to that project will run on Index 2021-01-22.
You will see a Warning Icon ( ) if there is a new NCBI Index available. To use this new NCBI Index, you must create a new project.
*Note: Projects created before February 8, 2024 may include multiple NCBI Index Dates.
Selecting Files
There are 2 ways to select the files you want to upload to CZ ID. You can upload files directly from your computer (default) or retrieve samples from your BaseSpace account. Regardless of your choice, sequencing files must follow certain requirements.
File Requirements for mNGS Illumina Pipeline
- Pipeline accepts FASTQ files (".fastq", ".fq", "fastq.gz", ".fq.gz"). Click here if you have FASTA files.
- CZ ID can process single- and paired-end read files.
- Paired files must be labeled with "_R1" or "_R2" at the end of the basename. CZ ID will automatically detect paired reads based on this naming convention.
- File names must be no longer than 120 characters and can only contain letters from the English alphabet (A-Z, upper and lower case), numbers (0-9), periods (.), hyphens (-) and underscores (_). Spaces are not allowed.
Upload from Your Computer
Scroll down to the "Select Files" section of the Select Samples page and select "Your Computer" tab to upload files directly from your computer.
- Your Computer Tab: Use this tab (default) to select sequencing files found in your computer.
- Upload Box: Drag and drop files into the provided box or click the link to use your file browser.
- Sample List: Sample sequencing files ready for upload will be listed here. Sample names will be based on the sequence filenames.
- Continue Button: After selecting files, use this button to continue to the Upload Metadata section.
If you have sequencing files split over multiple lanes per sample, CZ ID will automatically detect files representing the same sample based on Illumina's naming convention and concatenate these files for you. For example, if you were to upload one paired-end sequence sample split over three lanes, such sample would have six files. In the screenshot below you can see that CZ ID automatically detects that each file is part of the same sample.
Upload from BaseSpace
If your files are hosted on BaseSpace, you can pull them directly into CZ ID. Select the BaseSpace tab under the Select Files section to access your files. Click the Connect to BaseSpace button to launch the BaseSpace login page.
Use your credentials to log in to BaseSpace and select files for upload.
Once you have selected and reviewed the files you want to process, click the Continue button at the bottom of the screen to continue to the next step (Upload Metadata).
Adding Metadata
After uploading sequencing files, you have two options to upload sample metadata to CZ ID. You can add metadata to your samples by manually entering data through the interface or uploading a CSV file with metadata.
There are six required metadata fields. However, we encourage users to upload as much metadata as possible. Metadata associated with public projects helps our users compare across samples and find meaningful patterns. You can edit sample metadata to add or update information at any time.
You can learn everything you need to know about the metadata fields by looking at the metadata dictionary. Below we describe required metadata fields.
Required Metadata Fields
-
Host Organism - Refers to the organism from which you collected your metagenomic sample. If the sample is a cultured isolate or does not contain any host reads, specify “ERCC only”. Note the following:
-
-
- Your choice of host organism will determine which genome gets subtracted out during the host subtraction step in the pipeline. If your host organism maps to one of the available host genomes on CZ ID, reads aligning to that genome will be removed. The available host genomes are updated often and listed at the top of the Upload Metadata page. You will see "Host will not be subtracted" in the host organism dropdown menu if we do not have a genome for your chosen host organism.
- Regardless of your choice of host, the pipeline will always remove ERCCs (synthetic RNA spike-ins) and reads aligning to the Human genome. If you are unsure which host to select or if your desired host is not in the available options, you can select “ERCC only” as the Host Organism, in which case no host subtraction will be performed.
-
-
- Sample Type - Type of sample, tissue or site most accurately describes sample. "Suggested" list shown when entering metadata manually is based on Host Organism selection.
- Water Control - Indicates whether or not the sample represents a negative water control
- Nucleotide Type - DNA or RNA
- Collection Date - Date sample was originally collected. For privacy reasons, only use month or year for human data.
- Collection Location - Location sample was originally collected. For privacy reasons, only use country, state, or county/sub-division for human data. If you enter a more specific location (e.g. city-level), the platform will only save the location up to the county-level.
Manual Data Entry
Enter the information in the provided metadata table. By default, only the required fields will be shown. However, you can add metadata fields by clicking the "plus" sign to the right of the table. After entering information for all the required fields, click the Continue button to go to the Review section.
- Manual Input Tab: Use this tab (default) when entering metadata directly throught the web interface.
- Metadata Table: Enter information for each column or field. By default, the required fields will be listed on the table. You can add additional columns through the Metadata Dropdown Menu.
- Metadata Dropdown Menu: Click the plus sign to view and add optional metadata fields to the Metadata Table.
CSV Upload Instructions
Use the "CSV Upload" tab to upload a comma-separated value (CSV) file with metadata. If there are no errors, click the Continue button to go to the Review section.
- CSV Upload Tab: Select this tab to upload a metadata file.
- Metadata Template: Click template link to download a CSV file that will be already populated with sample names and metadata fields. Edit the file to include the appropriate metadata and save it. You are not required to use the provided template.
- Metadata File Upload Box: Use this box to upload the metadata file.
Note the following:
- Review the fields in our metadata dictionary, where you will find definitions and format requirements.
-
-
- Take special note of the required fields, which you must provide when uploading a new sample.
- Make sure your column headers match the required naming convention.
- Make sure your metadata values are in the correct format.
-
-
- You can use your own CSV or copy your metadata into our CSV template.
- If you entered a Host Organism that does not match a supported host genome, the pipeline will only subtract reads aligning to ERCCs and the Human genome. You can read more about how to request a new genome to be added to CZ ID here.
- If there are errors, you will see an error message after uploading your CSV file. Please make the necessary changes in your CSV and re-upload the file.
Reviewing Data
After adding your metadata, you will be directed to the Review page where you can view samples and metadata ready to be uploaded. The "Edit" links by each section can be used to edit project and sample information if you need to correct anything before upload. After reviewing sample and metadata information, please accept CZ ID's Terms of Service and Data Privacy Policy. Press Start Upload to begin the upload process to our server and kick off the analysis pipeline.
Note regarding host filtering:
"Host Subtraction" information is located below the table listing samples to be uploaded. This information tells you how your selection of host organism will affect the pipeline, specifically the host subtraction step. If CZ ID has the genome of the host organism, the pipeline will subtract out reads aligning to that genome. Regardless of your choice of host, the pipeline will always remove ERCCs and reads aligning to the Human genome (reference: HG38 and T2T-CHM13 assemblies). If CZ ID does not have the genome that matches to your host organism, you can request it by following the instructions in our FAQs.
After pressing Start Upload, you will see a modal showing the upload progress. DO NOT close the web page while the upload is in progress. Otherwise, the upload will be canceled and you will have to start your upload over.
Wait until you will see an "Uploads completed!" confirmation message confirming that your samples have been uploaded successfully. Once you see the confirmation message you can close your window or return to your project page by pressing "Go to Project".
The CZ ID pipeline can take a few hours to complete. You can see the status of your run by returning to the Project Page. The image below highlights features of a Project page listing the status of metagenomics samples.
- Project Name
- Metagenomics Tab
- Sample Status: Specifies sample progress. When the run is successfully completed, you will see a "Complete" status highlighted in green.
Once completed, your samples will be flagged as "Complete". You can now explore the Sample Report. If you encounter issues, please get in touch with our team by selecting "Contact Us" from the Username dropdown menu in the upper right hand corner of your screen.
Comments
0 comments
Please sign in to leave a comment.