How can I randomly subsample FASTQ files?

Jump to Section:

Instructions for Mac/Linux
Instructions for Windows

Overview

CZ ID's Web App only accepts sequence FASTQ files that are less than 35 GB. If your files are larger than 35 GB, you will need to reduce their size by compressing (".gz" format) and/or randomly subsampling reads. Alternatively, you can upload files through the command line interface which doesn't have file size limits. Here we list steps to subsample FASTQ files using an ultrafast toolkit for FASTA/Q file manipulation named SeqKit. This toolkit is compatible with multiple operating systems (OS), including Mac, Linux, and Windows, and can easily be used without dependencies or pre-configurations.

Below we provide instructions for subsampling FASTQ files on MacOS/Linux and Windows systems. If you have questions or need help, contact our team by sending an email to help@czid.org.

MacOS/Linux Systems

To subsample FASTQ files on your MacOS or Linux system:

1. Open the Terminal and install Seqkit (go to step 2 if tool is already installed)

Option 1: If you have Homebrew installed on your computer, install Seqkit with the following command:

brew install seqkit

Option 2: Click here for other installation options, including instructions for installation via Conda, git clone, downloading from binaries and compiling from source.

2. Set the directory to the folder where the FASTQ sequence files are located by typing the "cd" command followed by path. For example:

cd /Users/RosalindFranklin/Documents/CZ_ID/InputFiles

Note: Seqkit supports reading and writing of compressed files accepted by CZ ID (".gz" format) among others.

3. Evaluate how many sequences are in your file using the "stats" command and use this information to gauge what proportion of the reads needs to be subsampled. Note that CZ ID will truncate files at 150 million reads for paired-end data (75 million for single-end data).

seqkit stats input_file.fastq

4. To subsample, run the following command where "p" is used to subsample reads using a specified proportion (e.g., 0.5 will subsample ~50% of the reads) and "s" refers to the random seed. Make sure to use the same "p" and "s" arguments for paired-end files.

seqkit sample -p 0.5 -s 100 input_file_R1.fastq -o subsampled_output_file_R1.fastq.gz

seqkit sample -p 0.5 -s 100 input_file_R2.fastq -o subsampled_output_file_R2.fastq.gz

Note: By specifying the compressed format ".gz" in the output filename, Seqkit will compress the output file automatically.

5. Are subsampled files < 35 GB? If so, upload subsampled files to CZ ID using the Web App.

Windows OS

To subsample FASTQ files on your Windows device:

1. Install Seqkit (go to step 2 if tool is already installed)

a. Click here to download the latest compressed executable file for your Windows OS:

32-bit: seqkit_windows_386.exe.tar.gz

64-bit: seqkit_windows_amd64.exe.tar.gz

b. Use 7-Zip or similar program to extract executable file from the downloaded ".tar.gz" file

c. Copy seqkit.exe file to the "Systems32" folder found in the C drive (path C:\WINDOWS\System32 )

2. Open the Command Prompt (search "command" using the File Explorer and click the App to open)

3. Set the directory to the folder where the FASTQ sequence files are located by typing the "cd" command followed by path. For example:

cd C:\Users\RosalindFranklin\Documents\CZ_ID\InputFiles

Note: Seqkit supports reading and writing of compressed files accepted by CZ ID (".gz" format) among others.

4. Evaluate how many sequences are in your file using the "stats" command and use this information to gauge what proportion of the reads needs to be subsampled. Note that CZ ID will truncate files at 150 million reads for paired-end data (75 million for single-end data).

seqkit stats input_file.fastq

5. To subsample, run the following command where "p" is used to subsample reads using a specified proportion (e.g., 0.5 will subsample ~50% of the reads) and "s" refers to the random seed. Make sure to use the same "p" and "s" arguments for paired-end files.

seqkit sample -p 0.5 -s 100 input_file_R1.fastq -o subsampled_output_file_R1.fastq.gz

seqkit sample -p 0.5 -s 100 input_file_R2.fastq -o subsampled_output_file_R2.fastq.gz

Note: By specifying the compressed format ".gz" in the output filename, Seqkit will compress the output file automatically.

6. Are subsampled files < 35 GB? If so, upload subsampled files to CZ ID using the Web App.

Articles in this section

Jump to Section:

Overview

MacOS/Linux Systems

Windows OS

Comments

Articles in this section

Jump to Section:

Overview

MacOS/Linux Systems

Windows OS

Related articles