This document will provide an introduction on how to use BLAST to confirm potential hits identified by CZ ID.
Overview
BLAST (Basic Local Alignment Search Tool) is an NCBI tool that finds regions of similarity between sequences by comparing the query sequence to a database of known sequences (CZ ID uses the nucleotide and protein databases from NCBI). It allows gaps, insertions, and deletions. It is an incredibly useful tool for confirming hits and annotating sequences.
There are five different types of BLAST, each one is listed below in the chart, along with its general use.
BLAST type |
Query sequence |
Database |
Alignment |
Use |
blastn |
nucleotide |
nucleotide |
nucleotide |
sequence identity, useful for all taxa categories |
blastx |
nucleotide (translated to protein) |
protein |
protein |
Identify encoded proteins, detection of novel viruses |
blastp |
protein |
protein |
protein |
sequence ID and similarity search |
tblastx |
nucleotide (translated to protein) |
nucleotide (translated to protein |
protein |
ID nucleotide sequences with coding regions similar to the query |
tblastn |
protein |
nucleotide (translated to protein) |
protein |
ID database sequences encoding proteins similar to the query |
Performing a BLAST
- Access BLAST here.
- Choose the type of BLAST based on your goal. Blastn is usually sufficient to confirm a hit. However, if the contig happens to belong to a divergent virus, the blastn may not be accurate. Blastx can be used to detect divergent viruses because the proteins evolve more slowly.
- Paste your query sequence in the box or upload a fasta file with multiple sequences.
- Scroll down on the page and choose a program to optimize for. Navigate to the “Program Selection” section. For blastn you can choose between:
-
- Megablast- can be used to find the best matching sequence.
- Discontiguous megablast- used to find more dissimilar sequences.
- Blastn- used to find related sequences from other organisms.
Megablast is generally used for the first pass, but if no taxon is identified, blastn can be used to search for homologous sequences.
- Click ‘BLAST’ button.
Interpreting the BLAST Results:
The BLAST results show all of the taxa available in a database that have sequence similarity with the query sequence. The reported metrics and graphics can help determine the quality of the hit.
- The results page will show the following summary of the blast search:
- The database used for the search.
- The length of the query sequence.
- Results tabs (Descriptions, Graphic Summary, Alignments, Taxonomy)
- Navigate to the “Descriptions” tab that has metrics that can help you determine the quality of the hit.
-
- Max score: the highest bit score that is calculated from alignment matches and mismatches. The bit-score is the required size of a sequence database in which the current match could be found by chance. It is derived from the raw alignment score; the higher the score, the better the alignment.
- Total score: the sum of the alignment scores of all of the segments from the sequence. The higher the score, the better the alignment.
-
-
- One thing to note is that CZ ID uses an algorithm equivalent to "total score" to assign taxonomic IDs. If you are to find taxon A in the sample report on CZ ID and run BLAST, it is possible that taxon A would not be the top hit in the BLAST table because BLAST sorts by e value.
-
-
- Query coverage: the % of the contig length that aligns with the NCBI hit. A small query coverage % means only a tiny portion of the contig is aligning. If there is an alignment with 100% identity and a 5% query coverage, the sequence is probably not that taxon.
- E value: the number of hits expected to be seen by chance. The closer to 0, the better. The hits are automatically sorted by E value (best to worst). This metric is extremely useful for identifying real hits.
-
-
- E value 1e-50 small E value: low number of hits, but of high quality. Blast hits with an E value smaller than 1e-50 include database matches of very high quality.
- E value 0.01: Blast hits with E value smaller than 0.01 can still be considered a good hit for homology matches (ok for divergent viruses), but this is not considered a ‘good’ e value.
- E value 10 large E value: many hits, partly of low quality. E value smaller than ten will include hits that cannot be considered as significant as a low e value, but if it is a divergent virus, the e value may be high.
-
-
- Percent identity: the % of bases that are identical to the reference genome. A query sequence can have a low % identity, but still be a real hit. It is essential to take the e value into account and look for homology between conserved regions- this will be evident at the protein level.
- Click on the “Alignments” tab.
-
- You will see the query sequence in blue across the top.
- The location and length of the sequence alignments are represented below. Each row represents a taxon.
- The color of the alignment represents the quality of the alignment, based on the alignment score. Red represents the alignments with the highest score (best alignment), while black is the worst score and cannot necessarily be trusted.
- The gray horizontal lines represent gaps in the alignment.
- Click the alignment to view the alignment at the nucleotide (protein for blastx ) level.
- You can download the alignment or complete sequence by clicking the download button and choosing the file of interest in the dropdown.
- When looking at a bacterial or eukaryotic sequence, you can also view the ‘graphics’ tab to assess whether an alignment is to an rRNA region. The rRNA region is conserved (16s rRNA for bacteria and 18srRNA for eukaryotes) which can often result in sequences that hit multiple taxa, resulting in the inability to determine the sequence identity.
- Record the following metrics as they are often reported in publications:
-
- accession number of the best hit (usually the top hit).
- E value
- query coverage
- If you blastn a potential virus, it is a good idea to perform a blastx to double-check the hit. If it is a divergent virus, it may poorly align to a bacterial sequence, but have a quality alignment to a virus.
- On the other hand, the rRNA sequences of bacteria are not found in blastx because it is a conserved region. If you blast
- If the blast result does not match the mNGS result in CZ ID, that’s ok. There is a lot of noise in mNGS, but Blast can help identify that noise or confirm a hit of interest.
- Once the taxon has been confirmed or determined to be a false positive, you can record this in the notes section of CZ ID.
-
- Click ‘sample details’.
- Click ‘notes’ in the right-hand panel.
For a glossary of terms and additional information about how the alignments and scores are derived click here.
Comments
0 comments
Please sign in to leave a comment.