Introduction to BLAST
BLAST (Basic Local Alignment Search Tool) is a widely-used bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) to identify sequence similarity in nucleotide or protein sequences. It functions as a sequence similarity search tool by comparing a user-provided query sequence against large databases.
Learn more about the Comprehensive Insights into EBI and Essential Bioinformatics Tools to understand where BLAST fits in the broader bioinformatics landscape.
Understanding the Query and Database
- Query Sequence: The input sequence to be searched.
- Database: Large collections of nucleotide or protein sequences against which the query is compared.
For an expanded overview of protein sequence resources, consult the Comprehensive Guide to Protein Databases: Types and Key Examples.
The BLAST Search Analogy
Searching a query sequence is analogous to finding a specific book in a vast library. Just as a librarian narrows down a search by categories (e.g., higher studies → life sciences → bioinformatics → foreign authors), BLAST efficiently narrows down sequence matches using algorithmic steps.
BLAST Algorithm Overview
Step 1: Removing Low Complexity Regions
- Identifies and removes repetitive or low complexity areas (e.g., repetitive amino acids) in the query by replacing them with placeholder characters (X for proteins, N for nucleotides).
Step 2: Word List Creation and Scoring
- The query sequence is parsed into fixed-length words: typically 11-mers for nucleotides and 3-mers for proteins.
- Each word is scored against sequences in the database using a scoring matrix.
- Matches scoring above a chosen threshold (T value) are considered significant for further analysis.
Example:
- Query word: 'PQG'
- Possible database words scored against it with scores of 18, 15, 13, and 12.
- Threshold value: 13
- Words scoring 6513 are accepted as hits.
Step 3: Hit Formation
- Each word match that meets or exceeds the threshold becomes a 'hit,' recorded and stored for further extension.
Step 4: Extension of Hits
- Hits are expanded in both left and right directions to find longer matching sequences, stopping when scores begin to decline.
- The extended matching region is termed a High-Scoring Segment Pair (HSP).
For deeper understanding of sequence alignment algorithms, see Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide.
Scoring Systems in BLAST
- Raw Score: Sum of individual match scores in an alignment.
- Bit Score: Raw score transformed onto a normalized logarithmic scale.
- E-value: Statistical measure indicating the likelihood an alignment occurred by chance; a lower E-value indicates a more significant match.
Types of BLAST Services at NCBI
- Nucleotide BLAST: Nucleotide query vs. nucleotide database.
- Protein BLAST: Protein query vs. protein database.
- BLASTX: Translates nucleotide query into protein in all reading frames and searches a protein database.
- TBLASTN: Protein query searched against a nucleotide database translated in six reading frames.
- MegaBLAST: Faster searches optimized for large nucleotide queries.
- PSI-BLAST: Detects distant protein homologs using iterative profile searches.
- PHI-BLAST: Searches for protein sequences that contain particular patterns.
For an in-depth exploration of recombinant proteins relevant to BLAST protein searches, reference the Comprehensive Guide to Recombinant Protein Expression and Structural Biology.
BLAST vs. FASTA
Besides BLAST, FASTA is another tool for sequence similarity searches, offering alternative algorithmic approaches.
Practical Considerations
- Understanding these steps enhances proper usage and interpretation of BLAST.
- Proper threshold and scoring considerations prevent unnecessary computation and enhance result reliability.
By grasping the BLAST workflow02Dfrom query processing, word scoring, hit identification, to final alignment scoringDresearchers can efficiently search large databases for meaningful sequence similarities in both nucleotides and proteins.
all right so I'll take a color and uh I'm going to talk a little bit about basic local alignment Search tool now
blast and uh this is something whenever you talk about bi informatics blast is something that we always uh need to
understand always need to perform practically as well so what is a full form of blast
BL all caps basic local alignment Search tool okay now basically it's a tool for searching similarity which is is
developed by ncbi okay ncbi developed blast is very very popular tool to find out sequence similarity okay so the
sequence similarity Search tool we can say that sequence similarity this is sequence similarity
Search tool now how to search for sequence similarity it can be either nucleotide sequence or it can be so
either nucleotide sequence or amino acid that means protein sequences and we can search that sequence basically okay we
can search that sequence so uh so in this case we have a sequence so let's say the sequence that we use to search
throughout the all database under ncip that database is known as the query query sequence query sequence is the
sequence to be tested tested means to be searched okay to find for similarity
and and then there is the whole database whole database a sequence database is there to find the match okay
so whenever the match is there it's like you know you are trying to get a B informatics books for your preparation I
told you the name of the book you have the book in your hand and now you went to a library or a shop where you want to
get that same book so what you do you just show it to the librarian and tell them to search the whole library of
books and to get that same book out okay so what that librarian will do the library will follow follow a protocol
right because the library is filled with thousands of books so the librarian will take your book and first it will check
what kind of book it is whether it is of lower study or higher study it will be a higher study book because bi informatics
is for higher studies so there is a higher study section in the library so now the search is limited so from the
whole library now it's going to the higher study now inside higher study what subject it is B informatics so
under Life Sciences B informatics so the will now stick to the life science portion where only the life science
books are there so now in the this will be even smaller than the whole Library than even smaller than the total higher
education book section and now in the life science section textbook section particularly the person will search for
all the different subjects zoology botony physiology bioinformatics biochemistry then the person will find
out uh uh you know a place where the bioinformatics only bioinformatics books are there and then in that particular uh
let's say uh storage unit of bioinformatics book then the person will search for what kind of book it is
whether it's an Indian author book or a foreign author book now let's say it's a foreign author book then you go to the
foreign author section of the bioinformatics uh Library section and the foreign author book then the person
will search for that book now the search becomes so organized when we search it like this or now if you randomly give
this bioinformatics book to someone who don't know about anything about bioinfo haven't heard about it or don't anything
about uh it's not educated at all and you give this book to the person the person will now run and try to see all
the books in the library try to match the front cover and you it will return your a query so who will take more time
obviously the person letter will take more time and actually it's not logical at all not scientific at all so there
should be a proper approach this approach is known as algorithm so when we talk about algorithm algorithm is a
process which is feeded to a software platform which the software runs because you know software uh knows only binary
one or zero right we are not binary we always think with quantitative measurements no binary thing so for a
software it's all binary one or zero So based on that binary values we can create an algorithm now the software
will follow the algorithm blindfolded and we know that we are going to get our data okay so we just just simply I give
you an example of an algorithm of searching a book so similarly if you have a query sequence and you need to
search that query sequence throughout the all database under ncibi because ncbi runs multiple databases we know
that okay and very popular as well the traffic is also very good so your query sequence you want to find match whether
your query is matching to any other query any other sequence in the database okay that's what we run Blast for okay
so blast algorithm has a step what are the step here basically blast algorithm follows
uh the processing of query is a proper uh it process the query with a proper stages okay so for example uh what I can
tell you is that uh this query sequence you you put it with a query sequence the input is your query sequence and your
query sequence the very first step is removing removing let me write removing low
complexity area low complexity area so any low complexity area will be
removed for example let's assume that we talking about a protein sequence and uh let's say we're talking
about a protein sequence of uh so l l l l k r k d k l
l k k k k so low complexity area means those which are containing repetitive sequence so these are repetitive
sequence this is also repetitive sequence okay so repetitive sequence are not usually uh allowed and the first
thing that the blast algorithm do is that it removes all the repetitive sequence and instead of repetitive
sequence they Place X for what x and N X for protein and uh n for nucleotide so it will be 4X k r k d l uh DK sorry DK l
l and then four more X so X means you're not going to consider those sequence because of low complexity we only take
sequence which are unique one to two sequence is fine but no repetitive sequence okay now after this thing is
done after this thing is made the list of words the list of words uh for each of the word in the query is scored and
we can give the score for each of it so I'm not going to talk the you know the the hard and fast rules of blast
algorithm there I'm going to share it in a way so that you understand that's very important because you know that blast
algorithm background is not the important what is important is how to run a blast in Practical so practical
knowledge for blast is more important practical knowledge for faster is more important we'll do that later on but
just try to understand the situation now so we have value right for individual for individual position we have a score
or scoring system we have the scoring system okay and uh the word so so whatever query
we're searching we call it word okay how many word for a nucleotide sequence we use 11 11 words for
proteins we generally take three words okay three for protein sequence 11 for nucleotide that is the fixed length for
which we check we run the sequence similarity search okay so now what we do is that based on
each of these word matching so whenever there is a match of word there's a score given based on that match if there's a
mismatch then there's a different score there's a match there's a score so for 11 different uh 11 stretch of a word uh
there will be a particular score given to that sequence okay now by this fashion maximum score what will be the
maximum score ask yourself what will be the maximum score if all 11 words are matched all
matched then only if all of them are matched then only what we will get then only we'll get maximum score and and
generally in this case the maximum score value is known as threshold value okay so maximum score not not
threshold sorry this is a maximum score when you get uh full match but full match is not a possibility so sometimes
you know out of 11 there may be two matches out of 11 eight matches out of 11 10 matches out of out of 11 five
matches there are different uh possibilities possible we have a maximum score if all of them matched and we have
a minimum score minimum means no match basically that is minimum so between
minimum score and maximum score there should be something in the middle that we need to keep and decide
so that the search can run because you can see the database is huge remember we found this algorithm without the
algorithm finding out the bioinformatics book from the library will be very difficult similar thing here so the
library if our query sequence is only of 11 stretch and it's searching for the whole databases under NCB take ages to
revert us with the data so what they do is that they always kept a threshold threshold value which is known as T
value T value capital T or threshold value this threshold value is decided in such a way that all the word pair with
the score greater than the t so only we will choose the score only we will choose those sequences where the value
is either equal to T threshold value or greater than t so if and only if our uh query sequence
is matching with the database sequences and the score getting equal to threshold value or more than threshold value then
only we are going to keep that comparison otherwise we're going to discard
it okay so what we can clearly say here I can tell you that this is let's say uh the query sequence and let's say the
length is capital L okay and what what else we have we have our we can only find out so let's say
from here from here we are finding out overlapping sections like this what are this from the query sequence we are
finding individual word each word from the query sequence each word only taken from the
query sequence if it responds to and give us the T value greater than equal to sorry the value here the value must
be greater than equal to T value so the score actually the score value of individual word
here must be greater than or equal to T value then only we'll take that otherwise we'll don't take that so now
let's look at the with an example so let's say LP then we have p q
g l l let's say this is the query sequence Q what else we have we have database sequence
MP p e g l l this is our database sequence
okay so now we do the comparison what kind of comparison we'll do as I mentioned earlier that what we perform
here we perform individual word sections and we compare it with the query sequence so basically here our maximum
length what be a maximum length here L minus W + 1 what is w w is the total number kept
for nucleotide is 11 for protein is three so in this case it's protein we're looking at protein sequence so looking
at maximum this value will be three value to be three so three words at a time so what are the possibilities here
three words at a time in the query sequence again so query sequence is something that we'll deal with so in the
query sequence what are the possibility the first three l p p then the next possibility one word apart
PPQ then PQ g qg l and GL l l all these are examples in the query these are all query
sequence okay and what is w these are all W these are the W
values Okay so let's assume one one such example is pqg let's say
this pqg okay and from
here what we are going to find out the threshold value for pqg the threshold value came to be for example here
13 okay so if the threshold value is 13 for all these six different five different possibil is we are only going
to take values which are equal to or greater than 13 okay so now let me tell
you that for every single comparison what will be the comparison our compar comparison let's start with it
pqg let's say pqg this is our queries query word basically query word pqg let me write query word here or query
sequence whatever see pqg this is query word and we are matching with the database words here
okay so what will matching with pqg with pqg all right so the very first thing is that if we compare pqg with pqg there
will be 100% math the maximum score in this case maximum score will be obtained and that is 18 in this case okay which
is obviously greater than 13 so we'll take it then there is pqg and pqg with
p pqg with Peg so pqg and PG are checked together then there is two match P matches with p g
matches with G Q and E A mismatch based on that what score we got we got a score of 15 then similarly we have pqg query
sequence matching with PSG okay so now from this again s and Q are different PNG both match we got a
score of 13 then we have pqg matching with P QA we have p and Q matched G and a
mismatch we get the score to be 12 so now what we are getting from here we are getting uh maximum score 18 pqg compared
to pqg pqg with PG only 15 and pqg with PSG 13 pqg with pqa 12 so remember the threshold value is 13 or greater than 13
so we'll not take this value rest of the three values we took okay three values we'll take and this is how we do
okay this is how you do it so this is the very first thing that is the processing of the query the processing
of the query is done and after processing of the query the Second Step comes in is a creation of hits so that
is creation of hits now the query sequence is
represented by group okay and we call it neighbor where there is a matching there is a there
will be known as a neighbor there and word is uh compared with the word list of the database if one of the
neighbor word is identical to the word of the databas sequence a hit is recorded okay so hit means what whenever
your word that is q word query word matches matches with the database the
keyword matches with then the neighbor of the keywords actually the match with the sequence of the database sequence
quence of database DBS okay then I'll be considered to be a heat and the heat will be
recorded okay so heat will be recorded heat means Capital H let's write it heat heat is recorded like that
okay so we know the threshold value T is 13 in this case so what are the list of possible matches to
pqg from Peg to PSG we have we have recorded it so once the heat we know the heat is recorded like this then what
else we need to do we need to have a third step of the blust algorithm and that is extension of
Heats extension of Heats so once the heat is generated it
need to be extended in left and right side until the score begins to decrease the value begins to decrease you know
score is decreasing means obviously we are going to some we searching some sequence or a word which is not matching
anymore so we'll stop the search Once there is no such match found later on okay so the stretch of sequence for
which we are checking our query sequence for or against is known as high segmental pair
HSP last stretch of the sequence is known as high segmental pair and after all the words are tested the best set of
high segmental pair is chosen for that database sequence okay so for example example we go something like this start
with MP and that is Peg l l this is the database sequence and our sequence will be
pqg l l okay so this is the query sequence so what we are going to do is that here you can see there's a
particular score that we can uh generate at this point a particular score that we can generate whatever
value we can generate the number of nucle sequence we can move in the left hand side as well as we will move to
right hand side of our match pair or matching pair that we're trying to find out so whatever pair we are trying to
find out from that pair we'll go left hand side and we'll go right hand side to find out the score of the hit okay so
this is how we get the HSP score the HSP score HSP for high segmental pair the large stretch so what we are looking at
here we have this sequence and remember pqg and actually we are matching pqg with PEG similarly we have PSG similarly
we have P EG okay these are all the word list list of words that are there and remember the list of words is composed
of three set here three alphabet here in case of protein but in case of nucleotide it will be of 11 alphabet
it's difficult to explain with nucleotide that's why I chose protein to explain in this case
Okay so based on this we are going to get the value of HSP now who will give keep the score the there is a scoring
system basically here we use Blossom 62 as a scoring system Blossom 62 what is Blossom what is Pam Matrix and all this
we have discussed about it you can watch that lecture on Blossom and Pam Matrix basically this is the Matrix way to put
the scores when we are matching and where we comparing two sequences so here based on this scoring system what we'll
get we'll get the value and you'll get the value to 776 for this three uh letter here and 27 left first stretch
right first stretch 44 okay similarly what we can count we can count this uh HSP score and whatever value of HSP
score that we are getting uh we will move that one one step ahead in the left hand side
and as well as in the right hand side until or unless the value gets lower and lower okay so you can see the 776 is the
value in the center as we moving one one portion in the left side 27 one portion to the right 44 similarly the value will
decrease and will continue to move there until it reaches the very lower value okay and remember after all words are
tested the best set of HSP is chosen for the database sequence so what is the output of the
blast in the output of the blast there will be a list of database sequence list of sequence that we find from the
database which is matching with the query sequence and and some statistical parameters are also there statistical
parameters are also explained there okay and based on the blast there's a score also given for
every match or mismatch there's a score given the score can be of three different value one is a raw
score raw score the second one is bit score and the third one is e value these are the three scores that are
available what is raw score the sum of msps that make up alignment that is the RW score all the
sum of the msps that are out there okay that will be known as a raw score what is a bit
score the raw score is modified to the log base of scoring Matrix
okay so it is converted so the bit score is raw score converted to log scale log scale okay that is bit score
and what is e value e value provides a likelihood that a given sequence alignment is
significant likelihood that the sequence whatever data we got
from the database the sequence from the database and that alignment is significant or how much
significant the alignment is that is provided by the evalue for example you get a good match okay the database
sequences are shared but the E value is extremely low that means we cannot rely on the data although the data is good
but we cannot rely on the data much good and higher EV value is desired okay remember that and finally we'll conclude
with the blast services blast services that are available what are the
services there are various kind of blast services in ncibi website you can clearly see that I'll write some of them
nucleotide blast nucleotide blust Compares what nucleotide so let me write it here query
and DB like that what is the query sequence for nucleotide blast nucleotide the database also of nucleotide then
there are protein blast protein blast what do you mean by protein blast amino acid against the
protein sequence database so it will compare amino acid to the protein database sequence type okay protein
sequence database then we have blast X blast x what it does the query sequence will
be nucleotide and the sequence is translated in all reading frames
against the protein sequence in database okay so nucleotide against the protein database then we have t
blast n t blast n so what we do protein query sequence here is protein
against what nucleotide database T blast X Compares again six reading frame
translation so again nucleotide query but six reading frame translation query against six reading frame nucleotide
database Mega Blast what is Mega Blast
large sequence against nucleotide sequence database then there are five blast okay so what five blast do
expression patterns and homologous to a query protein
sequence expression pattern of queries protein that is also protein type of
database expression pattern is obviously stored in the protein type and there are some P
blast position specific interactions are determined with the cast so what Cy blast is used
for distant finding out distantly related members of protein family
okay find related members distantly related members of protein in
family okay that is C so these are all the different kinds of blast available in the ncbi site now
when we are talking about the sequence search and Alignment Tool then obviously one more thing that we must
discuss that is faster about the faster what is fasta there's also another similarity Search tool like blast fasta
is another kind of sequence similarity Search tool
BLAST (Basic Local Alignment Search Tool) is a bioinformatics program that compares a user-provided nucleotide or protein sequence against large databases to identify regions of similarity. It helps researchers find homologous sequences quickly by efficiently narrowing down potential matches using word-based algorithmic steps.
BLAST first removes low complexity or repetitive regions in the query sequence by replacing them with placeholder characters (X for proteins, N for nucleotides) to avoid misleading matches. Then, it breaks the query into fixed-length words (11-mers for nucleotides, 3-mers for proteins), scores these words against database sequences, and selects those surpassing a threshold for further analysis.
BLAST uses several scoring metrics including the raw score (sum of individual match scores), bit score (a normalized logarithmic transformation of the raw score), and E-value, which estimates the probability that the alignment occurred by chance. A lower E-value indicates a more statistically significant match, helping users judge the reliability of hits.
NCBI offers multiple BLAST services: Nucleotide BLAST for nucleotide queries against nucleotide databases, Protein BLAST for protein queries, BLASTX for nucleotide sequences translated into protein searched against protein databases, TBLASTN for protein queries searched against translated nucleotide databases, MegaBLAST optimized for large nucleotide queries, PSI-BLAST for detecting remote protein homologs via iterative searches, and PHI-BLAST for pattern-specific protein sequence searches. Choose based on your query type and research goal.
After identifying initial word hits scoring above a threshold, BLAST extends each hit in both directions to capture longer matching regions called High-Scoring Segment Pairs (HSPs). This extension continues until the alignment score starts to decline, ensuring that only significant local alignments are considered for the final results.
While both BLAST and FASTA are tools for sequence similarity searches, BLAST uses a word-based heuristic approach optimized for speed and sensitivity by focusing on high-scoring word matches and extending them. FASTA uses a different algorithm emphasizing local alignments through initial exhaustive word searches and subsequent refinements. Choice depends on the specific research needs and desired sensitivity.
Understanding BLAST's workflow—including query preprocessing, word scoring, hit identification, and alignment scoring—enables users to set appropriate thresholds and interpret results accurately. This knowledge helps avoid unnecessary computations, reduces false positives from low-complexity regions, and ensures reliable identification of biologically relevant sequence similarities.
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries
Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST
Explore how the FASTA algorithm performs sequence similarity searches using k-tuples, dot plots, and local alignment with dynamic programming. Understand different FASTA types like TFAST and FASTX/Y and how they compare protein and nucleotide sequences, highlighting differences from BLAST.
Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide
Discover how global sequence alignment works using the Needleman-Wunsch algorithm, including step-by-step procedures for initialization, matrix filling, and traceback. Learn the scoring system, gap handling, and how heuristic methods optimize sequence searches without sacrificing sensitivity or specificity.
Comprehensive Guide to Sequence File Formats in Bioinformatics
This article provides an in-depth overview of primary and secondary sequence data used in bioinformatics, explaining various sequence and molecular file formats. It covers formats like FASTA, GenBank, GCG, EMBL, ClustalW, and UniProt, detailing their structure, usage, and significance in sequence analysis and molecular studies.
Comprehensive Insights into EBI and Essential Bioinformatics Tools
Explore the pivotal role of the European Bioinformatics Institute (EBI) in managing diverse biological databases and discover key bioinformatics tools for sequence analysis, pattern recognition, and structural comparison. Understand the synergy between wet labs and dry labs in modern bioinformatics and how EBI supports genomic and proteomic research.
Comprehensive Guide to Molecular File Formats for Protein 3D Modeling
Explore the essential molecular file formats like PDB, mmCIF, CHARMM, MDL, and Mopac used in protein 3D structure modeling. Understand their specific sections, applications in crystallography and molecular dynamics, and learn about key file conversion tools to integrate diverse data sources effectively.
Most Viewed Summaries
Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.
A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.
Pamamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakaran ng mga Espanyol sa Pilipinas, at ang epekto nito sa mga Pilipino.
Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.
Pamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakarang kolonyal ng mga Espanyol sa Pilipinas at ang mga epekto nito sa mga Pilipino.

