Jump to Section:
Overview
CZ ID only accommodates one sequence file per sample when uploading sequencing data to its pipelines. The platform automatically concatenates FASTQ files from multiple lanes (Illumina) for one sample during upload. However, if you have multiple Nanopore FASTQ files for a given sample, you may have to concatenate (or combine) them into a single file before upload depending on sequencing filenames. CZ ID automatically concatenates Nanopore FASTQ files if filenames follow a certain format (see Automatic Concatenation of Nanopore Files for details).
This guide describes different options for concatenating files. This will help you concatenate multiple Nanopore FASTQ files or combine files to suit your needs (e.g., concatenating files to increase coverage for consensus genomes) before upload. Even though the instructions focus on sequencing data (FASTQ files), the same process can be applied to concatenate files with any file extension.
After reading this guide you will:
- Learn how to concatenate files using a web-based platform.
- Learn how to concatenate files using command line in Windows and Mac operating systems.
- Understand file name structure used for automatic concatenation of Nanopore sequencing files.
Concatenate Files Using a Web Tool
Galaxy is a free, web-based platform for genomic analyses that offers tools for concatenating files in various formats, including raw FASTQ sequencing files. View Galaxy Tutorials to become familiar with the platform. To use the concatenation tools:
1. Go to Galaxy
2. Upload data files to be concatenated
3. Search for "concatenate", select the tool of interest, and follow the prompts for selecting data files and running the job.
Note: When concatenating the reads from a paired-end sequencing run (Illumina), make sure to concatenate R1 reads separately from R2. Doing so produces "concatenated R1" and "concatenated R2" files that you can then upload to CZ ID.
Although web tools are convenient, it might take time to upload files from your computer. You can easily concatenate files locally on your computer using the command line. Below are step-by-step instructions for how to concatenate files in Windows and Mac operating systems.
Concatenate Files Using Command Line in Windows OS
To concatenate files in Windows OS:
1. Place all of the files to be concatenated into one folder. You can concatenate files in various formats as long as they have the same file extension. Here we focus on sequencing data files, which can be uncompressed (e.g., file extension ".fq") or compressed (e.g., file extension ".fq.gz"). If you are working with paired-end reads, make one folder for each R1 and R2 reads. This will allow you to create "concatenated R1" and "concatenated R2" files that you can then upload to CZ ID.
2. Copy the path by right-clicking on the folder and selecting "Copy as path" from the dropdown menu.
3. Open the command prompt by typing "command prompt" in the Windows Explorer search bar.
4. Set the directory to the folder where the sequencing files are located by using the "cd" command and pasting the path to the folder within quotation marks. Alternatively, you can drag and drop the folder into the command prompt window to automatically fill in the path (remember to add the quotation marks). For example:
cd "C:\Users\RosalindFranklin\Desktop\Sequencing"
5. Use the "type" command to concatenate all files with the file extension of interest (input). Use the asterisk symbol ( * ) to specify the file extension and the "greater than" symbol ( > ) to specify the new concatenated file name (output). Note that input files and output concatenated file should have the same file extension. For example:
type *.fq > concatenated.fq
type *.fq.gz > concatenated.fq.gz
6. The new concatenated file (e.g., concatenated.fq or concatenated.fq.gz) should be located in the same folder as the input files.
Concatenate Files Using Command Line in Mac OS
To concatenate files in Mac OS:
1. Place all of the files to be concatenated into one folder. You can concatenate files in various formats as long as they have the same file extension. Here we focus on sequencing data files, which can be uncompressed (e.g., file extension ".fq") or compressed (e.g., file extension ".fq.gz"). If you are working with paired-end reads, make one folder for each R1 and R2 reads. This will allow you to create "concatenated R1" and "concatenated R2" files that you can then upload to CZ ID.
2. Open the command line by typing "Terminal" in the File Finder.
3. Set the directory to the folder where the sequencing files are located by using the "cd" command followed by the path to the folder (drag and drop the folder into the terminal to automatically fill in the path). For example:
cd /Users/RosalindFranklin/Desktop/Sequencing
4. Use the "cat" command to concatenate all files with the file extension of interest (input). Use the asterisk symbol ( * ) to specify the file extension and the "greater than" symbol ( > ) to specify the new concatenated file name (output). Note that input files and output concatenated file should have the same file extension. For example:
cat *.fq > concatenated.fq
cat *.fq.gz > concatenated.fq.gz
5. The new file (e.g., concatenated.fq or concatenated.fq.gz) should be located in the same folder as the input files.
Automatic Concatenation of Nanopore Sequencing Files
CZ ID will automatically concatenate Nanopore sequencing data if files for a given sample are named using the same base name, including the "_pass_" or “fastq_runid_” qualifiers, and differentiated using file numbers prior to the file extension. If your filenames do not follow this naming convention, you will need to concatenate the files prior to upload using a web tool or the command line in Windows or Mac devices.
The table below provides examples of multiple files associated with one sample and how they will be automatically concatenated and named by the platform. Note that the automatic concatenation also applies to compressed files.
You will be able to verify if sequence files uploaded correctly to the platform.
Example of automatic file concatenation after selecting 50 Nanopore FASTQ files associated with a single sample for upload. When uploading multiple sequencing files per sample, make sure the total number of samples and associated files match what you are expecting.
Comments
0 comments
Please sign in to leave a comment.