Jump to Section:
Ready to upload metagenomic samples to CZ ID and explore results? Here we describe how to upload Illumina sequencing files and sample data (metadata) to your CZ ID account. Click here if you are working with Nanopore sequencing data.
To upload new samples to CZ ID, log in to the application using your email and password. Once logged in, you will see your name in the upper right-hand corner of the application. Next to your name, you will see a link to the Upload page.
The Upload page provides users different settings and options for uploading data. Note that the upload process is divided into three general steps, including Samples, Metadata, and Review upload pages. In the "Samples" step you will select a project, analysis type, and sample files to upload.
Samples uploaded to CZ ID are organized into projects. You can upload samples to an existing project or create a new project. The project selection will affect the pipeline version used to run the samples (see Pipeline Version below). When creating a new project, provide a name for the project and select its privacy status. You also have the option to add a brief project description to document the background or context of your project. Select the analysis type after specifying the project where samples will be uploaded.
Below we describe privacy status for samples uploaded to private and public projects.
- Private Projects - Samples uploaded to private projects will remain private to you and your collaborators until you decide to share it with other researchers by making it public on CZ ID. CZ ID will not automatically change your samples from private to public - the decision if and when to share samples is entirely yours.
- Public Projects - Samples uploaded to public projects are discoverable to all CZ ID users. Note: raw sample data (i.e., genetic sequence files in FASTQ or FASTA format) that have been uploaded to CZ ID are only available to the original uploader, no matter if your sample is public, private, or shared via a project. Raw data is not shared with any other CZ ID user, nor is it ever accessed by anyone working on CZ ID unless specifically requested by a user, such as to debug an issue.
Note that you can change the privacy status of your project from Private to Public at any time. However, you cannot do the opposite. If you or your collaborators make a project Public by mistake, please contact our team immediately by sending an email to firstname.lastname@example.org.
Selecting Analysis Type
For analysis type, select Illumina under the Metagenomics dropdown options.
The pipeline version that will be used to run uploaded samples is specified within your choice of analysis type. Metagenomic samples uploaded to projects created after April 19, 2023 (mNGS pipeline v8.0 and up) will run on the latest pipeline available with updated QC and host filtering steps. Samples uploaded to projects created prior to this date will run on pipeline version 7. Learn more about QC and host filtering pipeline updates here.
Keep track of the pipeline version when comparing samples. If you are planning to compare samples, you should run them using the same pipeline version.
File Requirements for mNGS Illumina Pipeline
- Pipeline accepts FASTQ files (".fastq", ".fq", "fastq.gz", ".fq.gz"). Click here if you need to upload FASTA files.
- CZ ID can process single- and paired-end read files.
- Paired files must be labeled with "_R1" or "_R2" at the end of the basename. CZ ID will automatically detect paired reads based on the naming convention.
- File names must be no longer than 120 characters and can only contain letters from the English alphabet (A-Z, upper and lower case), numbers (0-9), periods (.), hyphens (-) and underscores (_). Spaces are not allowed.
Upload from Your Computer
To upload files directly from your computer you can select them through our file browser or drag and drop them directly in CZ ID.
CZ ID will automatically name your samples based on your file name. Once you have selected and reviewed the files you want to process, click the Continue button at the bottom of the screen to continue to the next step (Add Metadata).
If you have sequencing files split over multiple lanes per sample, CZ ID will automatically detect this based on Illumina's naming convention and concatenate these files for you. For example, the NovaSeq system provides so much sequencing data that one sample may be split over 4 lanes. Such NovaSeq paired-end sample would produce 8 files. In the screenshot below you can see that CZ ID automatically detects that each file is part of the same sample.
Upload from BaseSpace
If your files are hosted on the BaseSpace cloud, you can pull them directly into CZ ID. Select the Upload from BaseSpace tab on the upload page to access your files. Select Connect to BaseSpace to launch the BaseSpace login page. Use your credentials to log in to BaseSpace and select files for upload.
Once you have selected and reviewed the files you want to process, click the Continue button at the bottom of the screen to continue to the next step (Add Metadata).
After uploading sequencing files, you have two options to upload sample metadata to CZ ID. You can add metadata to your samples by manually entering data through the interface or uploading a CSV file with metadata.
There are six required metadata fields. However, we encourage our users to upload as much metadata as possible. Metadata associated with public projects helps our users compare across samples and find meaningful patterns. You can edit sample metadata to add or update information at any time.
You can learn everything you need to know about the metadata fields by looking at the metadata dictionary. Below we describe required metadata fields.
- Host Organism - Refers to the organism from which you collected your metagenomic sample. If the sample is a cultured isolate or does not contain any host reads, specify “ERCC only”. Note the following:
- Your choice of host organism will determine which genome gets subtracted out during the host subtraction step in the pipeline. If your host organism maps to one of the available host genomes on CZ ID, reads aligning to that genome will be removed. The available host genomes are updated often and listed at the top of the Upload Metadata page. You will see "Host will not be subtracted" in the host organism dropdown menu if we do not have a genome for your chosen host organism.
- Regardless of your choice of host, the pipeline will always remove ERCCs (synthetic RNA spike-ins) and reads aligning to the Human genome. If you are unsure which host to select or if your desired host is not in the available options, you can select “ERCC only” as the Host Organism, in which case no host subtraction will be performed.
- Sample Type - Type of sample, tissue or site most accurately describes sample.
- Water Control - Indicates whether or not the sample represents a negative water control
- Nucleotide Type - DNA or RNA
- Collection Date - Date sample was originally collected. For privacy reasons, only use month or year for human data.
- Collection Location - Location sample was originally collected. For privacy reasons, only use country, state, or county/sub-division for human data. If you enter a more specific location (e.g. city-level), the platform will only save the location up to the county-level.
Manual Data Entry
Enter the information in the provided metadata table. By default, only the required fields will be shown. However, you can add metadata fields by clicking the "plus" sign to the right of the table.
Host Organism: Organism from which the sample was collected. If the sample is a cultured isolate or does not contain any host reads, select “ERCC only”.
Sample Type: Tissue or site that most accurately describes sample. "Suggested" list is based on Host Organism selection.
Water Control: Whether or not the sample is a water control.
Nucleotide Type: RNA or DNA.
Collection Date: The month and year the sample was originally collected.
Collection Location: Location where the sample was originally collected. For privacy reasons, location data for human samples can only be collected on the state or county level.
CSV Upload Instructions
Use the "CSV Upload" tab to upload a comma-separated value (CSV) file with metadata.
Note the following:
- Review the fields in our metadata dictionary, where you will find definitions and format requirements.
- Take special note of the required fields, which you must provide when uploading a new sample.
- Make sure your column headers match the required naming convention.
- Make sure your metadata values are in the correct format.
- You can use your own CSV or copy your metadata into our CSV template.
- If you entered a Host Organism that does not match a supported host genome, the pipeline will only subtract reads aligning to ERCCs and the Human genome. You can read more about how to request a new genome to be added to CZ ID here.
- If there are errors, you will see an error message after uploading your CSV file. Please make the necessary changes in your CSV and re-upload the file.
After adding your metadata, you will be directed to the Review page where you can view samples and metadata ready to be uploaded. If you see an issue, you can edit your projects and your samples before uploading (note "Edit" links by each review section).
Below the table listing samples to be uploaded, you will see "Host Subtraction" information. This information tells you how your selection of host organism will affect the pipeline, specifically the host subtraction step. If CZ ID has the genome of the host organism, the pipeline will subtract out reads aligning to that genome. Regardless of your choice of host, the pipeline will always remove ERCCs and reads aligning to the Human genome (hg38). If CZ ID does not have the genome that matches to your host organism, you can request it by following the instructions in our FAQs.
Do not close the web page when your samples are uploading to our servers. The upload will be canceled and you will have to start your upload over. You will see an "Uploads completed" confirmation when your samples have been uploaded successfully.
Once you see the confirmation page you can close your window or return to your project page. The CZ ID pipeline can take 30 minutes or a couple of hours to complete. You can see the status of your run by returning to the project page.
Once completed, your samples will be flagged as "Complete" on your Project Page. You can now explore the Sample Report. If you encounter issues, please get in touch with our team by selecting "Contact Us" from the Username dropdown menu in the upper right hand corner of your screen.