This document focuses on concatenating (or joining) multiple files into one single file, which can be especially useful when building consensus genomes. The ability to concatenate files is useful for multiple reasons such as consolidating consensus genomes for easy bulk upload or increasing consensus genome coverage by consolidating the raw reads from multiple runs of the same sample. This document provides multiple ways of concatenating files, including the command line.
Note: you can now bulk download concatenated consensus genomes in a single fasta file. See how to do that here.
Galaxy has a free tool that can be used to concatenate fasta and fastq files, which can be found here. When concatenating the reads from a paired-end sequencing run, make sure to concatenate the R1 reads separately from the R2 reads. Doing so produces ‘concatenated R1’ and ‘concatenated R2’ files that you can then upload to CZ ID. This tool can also be used to combine consensus genomes into one fasta file, so they can simultaneously be uploaded in bulk to another tool, like Nexclade.
If you would prefer to use the command line, you can concatenate the raw reads (fastq files - including compressed files such as fastq.gz) or the consensus genomes (.fa) by doing the following:
If using Windows:
- If concatenating consensus genomes, place all of the consensus.fa files (output from the CZ ID consensus genome pipeline) in one folder on your Desktop. If concatenating raw reads, place those in one folder on your Desktop.
- Open the command prompt by typing ‘command prompt’ in the windows search bar.
- Navigate to the folder where the consensus genomes (or raw reads) are located by using the ‘cd’ command. Note: you will need to replace “Rosalind Franklin” with your user name and “consensus genomes” with your folder name.
cd c:\Users\Rosalind Franklin\Documents\consensus genomes
- Use the ‘type’ command to concatenate the files. * .fa (or .fastq.gz for raw reads) will specify that all of the files with the extension .fa will be concatenated. Then use the ‘>’ to call the new concatenated file name. Note: you will need to replace consensus.fa and consensus2.fa with the filenames you wish to concatenate. You can list as many files here as you want.
type *.fa > concatenated.fa
- The new file (concatenated.fa) should be located in the same folder as the other consensus genomes.
If using Mac:
- If concatenating consensus genomes, place all of the consensus.fa files (output from the CZ ID consensus genome pipeline) in one folder on your Desktop. If concatenating raw reads, place those in one folder on your Desktop.
- Open the command line by typing ‘Terminal’ in Spotlight.
- Navigate to the folder where the consensus genomes are located by using the ‘cd’ command. Note: you will need to replace “Rosalind Franklin” with your user name and “consensus genomes” with your folder name.
cd c:\Users\Rosalind Franklin\Documents\consensus genomes
- Use the ‘cat’ command to concatenate the files. List the file names with the extension .fa (.fastq for raw reads). Then use the ‘>’ to call the new concatenated file name. Note: you will need to replace consensus.fa and consensus2.fa with the filenames you wish to concatenate. You can list as many files here as you want.
cat *.fa > concatenated.fa
- The new file (concatenated.fa) should be located in the same folder as the other consensus genomes.
Comments
0 comments
Please sign in to leave a comment.