Global Sequence Alignment Explained: Needleman-Wunsch Algorithm Guide

Understanding Global Sequence Alignment

Global sequence alignment is a method used to compare two nucleotide or protein sequences along their entire length. It differs from local alignment, which focuses on matching subsequences. The Needleman-Wunsch algorithm is the standard approach for global alignment, while the Smith-Waterman algorithm is used for local alignment.

Key Stages of the Needleman-Wunsch Algorithm

Initialization:
- Construct a scoring matrix with sequences laid out along the x-axis and y-axis.
- Initialize the first row and column with zeros or gap penalties.
Matrix Filling:
- Evaluate each cell by comparing corresponding sequence elements.
- Assign scores based on matches (+1), mismatches (0 or penalty), and gaps.
- Calculate cell scores from three possible directions: diagonal (match/mismatch), left (gap), and up (gap).
- Use the maximum of these scores to fill the matrix cell.
Traceback:
- Start from the bottom-right cell (maximum score) and trace back through the path of optimal scores to the top-left cell.
- Follow diagonal moves representing matches/mismatches and horizontal or vertical moves representing gaps.

Scoring Example and Matrix Filling

Matches score +1, mismatches or gaps score 0 in this model.
For instance, comparing two 'G's: diagonal cell score + 1 equals new cell score.
Values propagate rightward, downward, or diagonally.
Only add match score to the diagonal value; do not sum previous non-diagonal values.
This process continues until the entire matrix is filled.

Traceback and Alignment Representation

Traceback identifies the aligned sequences and gap positions.
Diagonal arrows indicate matches in the optimal alignment.
Gaps are represented by horizontal or vertical movements without a match.
The resulting alignment is displayed with vertical bars for matches and dashes for gaps.

Heuristic Methods in Sequence Alignment

Heuristic methods speed up sequence search by focusing on common subsequences (words).
Instead of searching the entire database, they locate potential matches based on small query fragments.
This approach increases computational efficiency.

Balancing Sensitivity and Specificity

Sensitivity: Probability of correctly identifying true positive matches.
Specificity: Probability of correctly identifying true negative results (non-matches).
Four possible outcomes in assessments:
- True Positive (correct match)
- False Positive (incorrect match)
- True Negative (correct non-match)
- False Negative (missed match)
Both sensitivity and specificity are crucial to ensure reliable and accurate alignment results.

Practical Tips for Global Alignment

Always start with a clear understanding of scoring criteria.
Use matrix filling and traceback logically to generate final alignments.
Represent alignments visually using vertical lines for matches and gaps for insertions/deletions.
Understand differences between global and local alignment to choose the right method.

By mastering the Needleman-Wunsch algorithm and heuristic approaches, bioinformatics practitioners can effectively perform global sequence alignments, essential for genomic and proteomic analyses. For further exploration of key resources in bioinformatics, consider reviewing Comprehensive Insights into EBI and Essential Bioinformatics Tools and the Comprehensive Guide to Protein Databases: Types and Key Examples. These resources complement the understanding of sequence alignment within the broader context of bioinformatics.

but now we are going to solve what is known as a global alignment what is a global remember I talked about global

alignment and local alignment and you know there are two different algorithms that that we use we use needleman wch

algorithm for This Global and also we have Smith Waterman algorithm for the local alignment okay we have this uh

processes so this this alignments they have three different uh three different stages one is initialization second is

Matrix filling third is Trace back these are the three step if they ask you if any kind of algorithm that we are going

to do either Smith Waterman or needleman wch any of that particularly the me waterm algorithm here they have three

different stages initial initialization second one is fing of the Matrix Matrix filling and the third one is Tres back

okay IMT so this is how we are going to go now I'm not going to talk about any theoretical knowledge behind it because

I already told you that there are this two sequence you can clearly see the sequence number one here you can see in

this side in the x- axis and there's a sequence number uh two here in the y axis okay this one y sequence this is

x-axis sequence now we are aligning this two these two sequence so what to do how to align that what is the procedure of

aligning this two sequence okay now try to understand this process very clearly you can pause this video and whenever

you find it difficult pause it rewind that portion to get it understand get it clear okay see one thing let me tell you

one thing here the process of this as I mentioned initialization Matrix fill up and Trace back now what do we mean by

all this we are going to talk about that we're going to discuss that so see uh that basically when we start with this

the very first step is whenever let's say you need to draw this table in this table what we have we have x-axis and y

axis in the xaxis we have sequence one in the y axis we have sequence two and you can see that right next to them I

clear a small column and row and basically what you need to do is that you need to put zero out there basically

you can either put zero there or you you may not draw this column it's up to you okay basically this is the value uh the

lowest value that we can achieve the lowest value and remember when we are trying to align to sequence okay two

sequence alignment in this case then there are two possibility one is there is a

match another one is there is a gap no match no match okay and what is the marking

scheme if there's a match we call it plus one and if there is no match then there is zero or we also call it as a

penalty if there is no match we call it a penalty or give a value zero you know there are different approaches in

earlier videos when I made videos 10 years ago almost I I uh I started with this idea and we put the value of no

match to minus one or something like that but right now we are considering this for match we have plus one for no

match or penalty value we have zero now that means we we will always check the cross means G with G if there is a match

then the value will be one if there's a m mismatch the value will be zero okay we'll do that for all the sequences and

connecting cross talking between all the sequences okay this is the first point second point is that you also need to

make sure that how you're adding the values means this match value or no match value whatever value we are adding

we are always adding it from the diagonal line remember that so let me tell you this simple idea for example in

the very beginning we have G and G both are a match due to the match we know that it will come down to one okay match

means plus one so we'll start with plus one there zero was the start point and actually we'll move diagonal

diagonal movement remember that diagonal movement so we'll start with one here diagonal movement start with one okay so

generally 0 + one so diagonal movement will give you one so whatever value we are going to get from here now try to

understand this value is flee free floating value the same value can be added up in the right hand side it can

be added up to the bottom okay or it can flow to the diagonal so one value that we got can move in three different

direction okay to the next column to the next row okay or to the diagonal column so this is possibility

this is the possibility remember that okay so now one as per this rule it will move in three Direction and we are also

going to check the next place we're going to check the next place so G with another G like that so

we are going to do that always remember that very very important okay and when we are going to compare that g with G if

we have a match Plus One will be added to the diagonal value that is here in this case

zero so let me solve it otherwise you cannot understand so one will migrate to here

okay and what what else we'll get so one migrates there and this one will migrate here but at this point we

know that G and G another match so Plus One will be added so already we have one here it should migrate here plus one

should be added so should we add 1 + 1 two no we will add this match value + one to

the diagonal value diagonal is also 0er so 0 + 1 1 we get one no change in the value okay no change in the value now

what we are going to see is going to see the rest how we are going to see that let's see that so again this one it will

also migrate to the next it can come at this point so you see a and g no match so again one

T and G no match it will come as one C and G no match it will come one G and G another match remember but again what 0

+ 1 it will be 1 okay so one all the values are one here in this side now in this side we have one this is also one

then a with a another match a with a so as per this rule this one can come to the bottom field but in this case a and

a match so plus one and if there is a plus one then the diagonal value it was one there so 1 + 1 gives us

2 got it if not repeat this place remember a matches with a so what we can put Plus One will be added to who this

one or this one this one diagonal one so get two so now this two will move to the next quite

easily isn't it it plays there like that now we're going to see a with t no match two c with a no match two G with a no

match a with a another match so+ one + one will be added with this plus one gives us two remember many people will

make a mistake that here there was a two now plus one will be added people add plus one to this two to get three no+

one Whatever value for the math will be added to the diagonal value that was 1 1 + 1 2 got it this is how it will work

and now what else we know so this is where AG the value will be

one this is 1 a with a it was already the value is already transferred to be two but still you can see that y it is

two either this two slides there or you can see that a and a match so 1 + 1 two that will also give us two right and

then a with t two a with c 2 a with g 2 and now at the end here you can see that a

with a another match so match means plus one it will be added to this diagonal 2 + 1 will be

3 got it now let's move to the next one in this case also GT it will be one here normal sliding of one this is also GT

sliding of one uh this one and at also one TT there's a match here T and T match and you know match means Plus One

will be added to whom the diagonal value 2 + 1 will be three okay and then migration of three CNT no match three

GNT no match three a andt no match three next round again one g&t no match one a and t no match one t

and t at this point T and T So match one plus will be added here so 1 + 1 2 so actually this two migrates here

sorry you can see the two always migrates in this right hand side to two migrates here so there's a mistake here

two migrates here so two already present in this diagonal and uh there's a match so 2 + 1 it will be three not two it

will be three okay because of the rule of sliding the three will slide here as

well like that okay so t with C no match three t with g no match three a with t no match

three this starts with one CNG no match so one a Ang G no match with one okay but it won't be one this two won't be

one why because it is two so if it's two then two will slide the value will be two in all these cases got it now t with

C the value already know c c with C now another match C with C so one added with with what this three 3 + 1 will be

four okay then afterwards C with g no match so it will remain four A and C no match so 4 + 0 will be

4 end so again we start here then AG no match it will be one two will slide here obviously so a

and a another match you can clearly see that so 1 plus one will be two either you'll get two like this or like this

then from this two uh a and t no match so two but actually three sliding so this value will also be three okay and

then a with C A with C no match but again four we have already received here so this value will also be sliding to

four G with a the value will be four a with a now another match a with

a + one will be added to what value four 4 + 1 will be five five value will come here so one so you can simply write it

like this till the end G with G the value will be two you can see that because G with G one will

be added with this cross so it will be two okay so G with a the value will be two Okay g with t as per our rule the

value is two but three will slide here the value of three will slide here the value of four will be sliding here and

then there is a G and G so 1 plus will be added 4 + 1 will be five and this five value will slide there so at the

end here two will slide the end and here we have t with a so

two two and two no issues and then we have T and T So + 1 2 + 1 will be 3 then this three should slide here t and

c 3 + 0 3 but actually four already received so four will slide here as it is as a four not like so then we have G

and t five will slide there and there is a GNA so five will remain as it is and here we have a T and T

similarity you can see T and T similarity so 2 + 1 will be three right four will slide

here five will slide here five will slide here like that and at the end what else we have we have this value for a

right two earlier so a with t three will slide here four will slide here five will slide here and last we

have a with a so similarity Plus One will be added to five we'll get six so that is how we can fill the

Matrix so initialization is done we put the values zero in both X and Y axis near to the sequences and then what else

we put we put the Matrix values the rule is that whatever value we received the value will slide uh you know in the

right hand side each each of this tab at a Time Each of this square at a time okay and we starting

our journey from here zero this is our start point and ending our journey here this

is our end point start and end so end maximum value we received is six start point is

zero the third step of this process is tracing back we need to trace it back okay Trace back from what the highest

value that we got at the end to the what to the lowest value that we started with at the beginning so

basically whatever value we are sliding to the right hand side is not important the important value is when we add + one

and we have a angular value so what are those where we add values so we add values here this is where the value

changes six we add values here at this point okay we have 35 next to it we have adding values here

then what else we have added value here 2 fours then we have added so all the cross Arrow or diagonal Arrow Arrow are

the one that we need to figure out and that is known as stress backing from the highest value we received and we are

going back to the zero that is known as Trace backing and in Trace backing we always going to find out the diagonal

Arrow so this is diagonal Arrow number one this is diagonal Arrow number two this is diagonal Arrow number three and

this is diagonal Arrow number four here at this point diagonal Arrow number four uh only this three this is diagonal

Arrow four then we have this two and we have another diagonal Arrow diagonal Arrow four this is diagonal Arrow five

and then what else what else we have this two as a one and we have a diagonal Arrow here at this point okay this

one okay six here here at this point actually here at this point six diagonal AR seven

so total seven diagonal Arrow were drawn not this these ones these ones are not the one because we actually follow the

sliding rule in all these cases these are the diagonal arrow that we form and actually it will also form a diagonal

straight line from zero to the maximum value that is six okay so seven different positions seven different

location seven different location where we have a diagonal diagonal Arrow and we get this value okay so what value

we got here we got this somewhere where between this to a at this point okay we have this somewhere near this G so think

about it we we have this okay we have this different locations so what are the locations we

have a G and G this is one location uh then we have this GA a that Gap there are two a gaps there present

okay and then we have this gap of T then we have a gap of two I mean between C and A there and

then we have this value between gtt there and final value six okay so now we know how to fill it we know how to back

trace it but now it's time to put it into the data set basically how exactly we represent the data so from this table

how would you represent the data it's very important that's what we'll do now all right so this is what we did I

just took a screenshot of that and put it and now I'm going to draw the actual line how to draw it how to draw it so

see basically what we know is that we have the similarity and dissimilarity try to understand it so so we have this

two sequences we need to put two sequences draw two sequences so this is sequence number one on the top

G okay and we have a similarity you can see the G the very first one g with G

there's a similarity so G with G there's a pairing we do a vertical line like this we draw it and then in the bottom

we have another G okay in the bottom we have a g but here we don't have any similarity where do we have similarity

with a okay so there's a gap here here then we have a on the top okay and what we have we have a

here some maybe some uh Gap in the middle but we have a and a another similarity then on the top what else we

have we have what shouldn't draw this so after a so we have repeatedly a a t

t c a g t t a okay and in the bottom you can see that we have a g there's a this this vertical line means there's a

pairing G and G this a with this a okay the second G don't have any pairing so you keep it blank the second G no

pairing so you keep it blank now after that what else we have we have similarity of a with

a this a with a okay so I either this a or that a which one obviously after G there's only

one a and this is the third a so of course this third a must be placed somewhere here and there should be

connection between them not between this a got it and then afterwards what else we know we know that there will be

another similarity between t here there's a gap another similarity between the C next to each other this t

this Tre then G and a then there's a gap G and there's a Gap a similarity similarity similarity

similarity how to know whether there is a gap or not how to know that okay very very important let me

tell you that as well now I'll be telling you that just focus on the box okay starting with this box starting

with this box box number one fill it up the second box is this box box number two third box and third and fourth is

this two then this two remember I already cleared it out why we say this different

boxes this one then this two okay and finally this box number six

these are the locations this is the sequence alignment rule so after back tracing all you need to do is simply

make them highlighted so that you get to know about where there's a pairing now try to understand

whenever whenever there is this shaded values the start Point remember that the

start point of

shaded cell the start point of the Shaded cell will be

taken as similarity will be considered as

similarity Okay so at the very beginning a with a so we put a a with a similarity then where this stretch but

what I mention start point of shaded cell start point is here G so upper G with lower

G got it so whatever things present in the middle are Gap you don't need to do anything else

only thing that you need to do here let me eras it and try to give you a clear idea about

it take a eraser let me erase this so that you understand this process quite well because it may be complicated a

little bit okay so let's assume this so first write

down the bigger sequence because obviously the Gap is present in the smaller sequence

so start with this a with a full match so I'll I'll take different color for the smaller sequence so a and a match

okay done a with a as per our rule the second one match with be start point of the Shaded cell so shaded cell start

point is this G so this G with this G so put G with g a match then what again start point of sh

cell this is C with this C so C with C match okay then what t with this t

Okay t with t then a with this a okay then another one g with G first so

basically this G with G or this G with G we start with the last one g with G

Okay g with G so now we know the sequence the upper sequence is already written the lower sequence G A TC g g a

TC GA but there's one more G that should be here somewhere at this point okay there's a gap here Gap here

Gap here so these are all Gap there's the possibility of Gap no issues okay so this would be our

alignment so whenever they ask this alignment question they always want to get an

answer like this where you have nucleotide sequences vertical line means they are similarities they have

similarities and horiz Al line means there's a gap in that area so this is how

you should solve this kind of problems if they ask any problem like that you can solve it like this

okay so in local alignment what happens this is the global alignment in local alignment also basically local alignment

matches similar regions and sequences without the Gap basically the length will be short

shorter the difference is there only in the traceback step

okay so in case of local alignment once we fill the Matrix sometimes the value becomes negative and if the value

becomes negative there's another rule we'll set it to zero negative means we'll set it to zero and then we'll

again start the process of marking as per Matrix fill up the rule of Matrix fill up that is the idea this is how we

solve this kind of global alignment and local alignment problems as well

okay so this is global alignment and how to perform Global alignment okay this is how we can

perform the global alignment and there is one more term known as heuristic method what is

heuristic method let me write it down for you heuristic

method heuristic method relies on what some common words in the sequence the query sequence that you feed in and that

feeding sequence will move to each database to perform the search so basically when we put the

searching with a query sequence with a query sequence the whole sequence is not searched for the m match instead some

fragment of that query sequence is used some common words are used for match to find out match this is the approach of

searching uh the match so whatever match it finds first of all with some common words then they will Digger deep for a

effective bigger search this way the process can be faster and can be more

effective to make it fast pace we use heuristic method okay but there's always a question to do fast

whether the work is compromised the quality of the work compromised the thing is in euristic methods again we

check two important parameters the sensitivity of the test and the specificity of the test so both are

checked sensitivity and specificity both are equally important

to check without knowing this we cannot comment on any technique but remember in euristic method basically let's say

there is a library and I gave you uh a name of a book or I gave you for example a stretch

of a paragraph and I told like from which book probably I get this paragraph So what you're going to do is that

you're going to check for some words in the paragraph to find out whether that which genre that belongs whether it's a

suspense whether that is comedy whether that is romantic whether that is devotional based on that you search only

that section and you'll get the results faster that is the idea of euristic model okay but again as I told you we

cannot compromise the Sens we cannot compromise the quality so that's why two things

very necessary sensitivity and specificity and based on that we have four different outcomes what are the

four outcomes we have true positive okay so basically in this case the test

is positive for a condition and that is true that is desirable true homologues are examples

where the they are tested positive for the condition of homology then there are false positive that

means the test positive but actually are not true okay false nonhomologous can be of

this type for example false positive means the data set is suggesting positive but

actually they are not satisfying the criteria of positivity or positiveness here that is false positive third one

true negative in this case the test is negative but actually

false test is negative but actually there are false so so it's true

negative and there is false negative where the test is negative but actually the test is negative but

actually it's supposed to be positive that is false negative so basically this terms this is not only

specific for bioinformatics but also for any uh any kind of experiments or any uh particular ular

machine that we design we always check sensitivity and specificity okay so what are these terms

you know based on the parameters of sensitivity and specificity we can generate a

probability okay so what is specificity specificity gives us the probability of correctly predicting a negative

example specificity is the probability of correctly predicting negative

example and sensitivity is the probability of correctly predicting a positive example okay so if anybody ask

you this question what is sensitivity and what is specificity in case of Bio info then you say that sensitivity is

the probability of correctly predicting a positive

example and specificity is a probability of of correctly predicting a negative example

okay so basically sensitivity equals to what we say true positive divided by true positive

plus false negative and specificity equals to True positive by true positive plus

false positive just you can remember you can remember it or simply think about it why we say

this why this is as per uh what we wrote why it is like that you can this this is kind of a homework you can do that okay

because they don't ask any question from sensitivity specific like this but without mentioning sensitivity or spe

specific specifity sorry without mentioning sensitivity and specificity blast basic local alignment Search tool

cannot work and actually in any kind of work any kind of tool in B informatics will give you a confidence value a value

at which the the the the software runs because it all depends on the softwares and they algorithm based systems so

they'll always display the sensitivity they'll always display the specificity and also they will display uh at what

confidence level they are showing you the data or they're explaining you to the data and based on that we get uh our

search results in Blast okay as well as in in fast f as well okay we'll discuss about blast we'll discuss about faster

in details in the later stages of bioinformatics lecture but I believe you have a clear idea about uh this sequence

alignment process

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Guide to BLAST: Basic Local Alignment Search Tool Explained

This article provides an in-depth overview of BLAST, the Basic Local Alignment Search Tool developed by NCBI, explaining its algorithm, practical usage, scoring system, and various types of BLAST services. Understand how BLAST processes sequences, filters low complexity regions, scores matches, and identifies significant alignments in nucleotide and protein databases.

Comprehensive Guide to FASTA: Algorithm, Types, and Comparison with BLAST

Explore how the FASTA algorithm performs sequence similarity searches using k-tuples, dot plots, and local alignment with dynamic programming. Understand different FASTA types like TFAST and FASTX/Y and how they compare protein and nucleotide sequences, highlighting differences from BLAST.

Comprehensive Guide to Sequence File Formats in Bioinformatics

This article provides an in-depth overview of primary and secondary sequence data used in bioinformatics, explaining various sequence and molecular file formats. It covers formats like FASTA, GenBank, GCG, EMBL, ClustalW, and UniProt, detailing their structure, usage, and significance in sequence analysis and molecular studies.

Comprehensive Guide to Molecular File Formats for Protein 3D Modeling

Explore the essential molecular file formats like PDB, mmCIF, CHARMM, MDL, and Mopac used in protein 3D structure modeling. Understand their specific sections, applications in crystallography and molecular dynamics, and learn about key file conversion tools to integrate diverse data sources effectively.

Understanding DNA Matching: The Role of Gel Electrophoresis in Forensic Science

This video explains how scientists use gel electrophoresis to match DNA from crime suspects to DNA found at crime scenes. It details the process of cutting DNA with restriction enzymes and how the resulting fragments are separated based on size to create unique banding patterns.