Download Subtitles and Captions for Any Video Easily
null
null
SRT - Most compatible format for video players (VLC, media players, video editors)
VTT - Web Video Text Tracks for HTML5 video and browsers
TXT - Plain text with timestamps for easy reading and editing
Scroll to view all subtitles
so today's session what all things we
are basically going to discuss so first
of all we going to discuss about
different types of machine learning
algorithm like how many different types
of machine learning
algor understand the purpose of taking
this session is to clear the interviews
okay clear the interviews once you go
for a data science interviews and all
the main purpose is to clear the
interviews I've seen people who knew
machine learning algorithms in a proper
way okay they were definitely able to
clear it because they just explain the
algorithms in a better way to the
recruiter so that they got hired first
of all is the introduction to machine
learning here I'm just specifically
going to talk about AI versus ml versus
DL versus data sign then the second
thing that we are going to talk about
over here is the difference between
supervised MS
and unsupervised ml the third thing that
we are probably going to discuss about
is something called as linear regression
so we are going to clearly understand
the maths and geometric intuition the
next thing that we are probably going to
discuss about is R square and adjusted R
square the fifth topic that we are going
to discuss about is Ridge and lasso
regression the first topic that we are
going to discuss about is AI versus ml
versus DL versus data science so this is
the first topic that we are probably
going to discuss if you really want to
understand the difference between AI
versus ml versus DL versus data science
we will go in this specific format so
just imagine the entire universe so this
entire universe I will probably call it
as an AI now specifically when I say AI
this basically means AI artificial
intelligence whatever role you are in
you are as a machine learning developer
you working as a deep learning developer
Vision developer or a data scientist or
an AI engineer at the end of the day you
are actually creating AI application so
if I really want to Define what is this
artificial intelligence you can just say
that it is a process wherein we create
some kind of applications in which it
will be able to do its task without any
human intervention so that basically
means a person need not monitor this AI
application automatically it'll be able
to make decisions it will be able to
perform its task and it will be able to
do many things so this is what an AI
application is some of the examples that
I would definitely like to consider so
the first example that I would like to
consider AI application AI module
Netflix has an AI module suppose if you
see a kind of action movie for some time
then the kind of AI work or AI work that
is basically implemented over here is
something called as recommendation
so here through this application what
happens is that when you're continuously
seeing the action movies then
automatically the AI module that is
present inside Netflix will make sure
that it gives us recommendation on
action movies second if I take an
example of comedy movie If I
continuously see comedy movie then also
it'll give us the recommendation of the
comedy movie so this through this what
happens is that it understands your
behavior and it is being able to do its
task without asking you anything the
second example that I would like to take
up in is
amazon.in now amazon.in again if you buy
an
iPhone then it may recommend you a
headphones so this kind of
recommendation is also a part of AI
module that is integrated with the
amazon.in website the ads that you see
probably when you opening my channel
through which I get paid a little bit
from my from a from the hard work that I
do in YouTube right so through that ads
how that is recommended to you uh that
is also an AI engine that is included in
the YouTube channel itself which really
plays it is a business-driven goal
understand it is a business driven
things that we basically do with the
help of AI one more example that I would
like to give you is if I consider it
self-driving cars so here you'll be able
to see self-driving cars if you take an
example of Tesla so self-driving cars
what happens based on the road it is
able ble to drive it automatically who
is doing that there is an AI application
integrated with the car itself right so
if I consider all these things these all
are AI application at the end of the day
whatever role you do you are going to
create an AI application this is the
common mistake what people do you know
like our CEO sudhansu Kumar he has
written in his profile that he's an AI
engineer that basically means his goal
is to create an AI application so
probably in a product based companies
you'll be seeing this kind of roles
called as AI engineer now let's go to
the next role which is called as machine
learning so where does machine learning
comes into existence so if I try to
create this machine learning is a subset
of AI and what is the role of machine
learning it provides stats
tools
to analyze the data visualize the data
and apart from that to do
predictions I'm
forecasting so you will be seeing a lot
of machine learning algorithms so
internally those machine learning
algorithm the equation that we are
basically using it is basically using it
is having a kind of stats tool stat
techniques because whenever we work with
data statistics is definitely very much
important so this exactly is called as
machine learning so it is a subset of AI
this is very much important to
understand ml is a subset of AI so here
you can see that it is a part of this
now let's go to the next one which is
called called as deep learning deep
learning is again a subset of ml now
let's consider why deep learning came
into existence because in 1950s 60s
scientists thought that can we make
machine learn like how we human being
learn so for that particular purpose
deep learning came into existence here
the plan is to basically mimic human
brain so when I say mimicking human
brain that basically means we are trying
to mimic the human brain to implement
something to learn something so for this
you use something called as
multi-layered neural networks so this is
what deep learning is it is a subset of
machine learning its main aim is to
mimic human brain so they actually
create multi-layer neural network and
this multi-layered neural network will
basically help you to train the machines
or applications whatever we are trying
to create and deep learning has really
really done an amazing work with the
help of deep learning we are able to
solve such a complex complex complex use
cases that we will be probably
discussing as we go ahead now if I come
to data science see this is the thing
guys if you want to say yourself as a
data scientist tomorrow you given a
business use case and situation comes
that you probably have to solve that use
case with the help of machine learning
algorithms or deep learning algorithms
again the final goal is to create an AI
application right you cannot say that I
am a data scientist and I'll just work
in machine learning I or I'll work in
deep learning or I may I don't know how
to analyze the data no you cannot do
that when I was working in Panasonic I
got various different kind of task
sometime I was told to use W powerbi to
visualize analyze the data sometime I
was given a machine learning project
sometime I was given a deep learning
project so as a data scientist if I
consider where does data scientist fall
into this it will be a part of
everything so if I talk about machine
learning and deep learning with respect
to any kind of problem statement that we
solve the majority of the business use
cases will be falling in two sections
one is supervised machine learning one
is unsupervised machine learning so most
of the problems that you are basically
solving this is with respect to this two
problem statement two different types of
machine learning algorithms that is
supervised machine learning and deep
learning if I talk about supervised
machine learning two major problem
statements that you are basically
solving here also one is regression
problem
and the other one is something called as
classification problem and in the case
of unsupervised machine learning problem
statement you are basically solving two
different types of problem one is
clustering and one is dimensionality
reduction and there is also one more
type which is called as reinforcement
learning reinforcement learning I can I
I will definitely talk about this not
right now right now we are just focusing
on all these things now understand what
happens in supervised machine learning
let's consider consider a data set so
here I have a data set which says this
is my age and this is my weight suppose
I have these two specific features let's
say that I have values like 24 62 25 63
21 72
257 uh 62 and many more data over here
let's say that my task is to basically
take this particular data and create a
model wherein so suppose my task is that
I need to create a model whenever it
takes the New Age first of all we train
this model with this data and whenever
we take age a new age it should be able
to give us the output of weight this
particular model is also called as
hypothesis okay I'll discuss about this
today when I we discussing about linear
regression now what are the important
components whenever we have this kind of
problem statement first of all you need
to understand there are two important
things one is independent features and
the other one is something called as
dependent features now let's go ahead
and discuss what is independent feature
independent feature basically means in
this particular case since the input
that I'm basically training in all those
features becomes an independent feature
now in this particular case my age is
independent feature and whatever I'm
actually predicting so when I say
predicting I know this is my output okay
this is the what I have to basically
make my model uh give this as a an
output so in this particular casee my
dependent feature becomes weight why we
specifically say a dependent feature
because this is completely dependent on
this value whenever this is increasing
or decreasing this value is basically
getting changed so that is the reason
why we basically say this has
independent and dependent feature
whenever we are solving a problem right
in the case of supervised machine
learning remember they will be one
dependent feature and there can be any
number of independent features now let's
go ahead and let's discuss about
regression and classification what is
the difference between them now let
let's go ahead and let's discuss about
two things one
is let's say I want a regression problem
statement suppose I take the same
example as age and weight so I have
values like as discussed 24 72 23
71 uh 24 or 25
71.5 okay so this kind of data I have
see this is my output variable which is
my dependent feature now in this
particular dependent feature now
whenever I'm trying to find out the
output and in this particular output you
have a continuous variable when you have
a continuous variable then this becomes
a regression problem statement now one
example I would like to give suppose
this is my data set right this is my age
this is my weight suppose I am
populating this particular data set with
the help of scatter plot then in order
to basically solve this problem what
we'll do suppose if I take an example of
linear regression I will try to draw a
straight line and this particular line
is my equation which is called as yal mx
+ C and with the help of this particular
equation I will try to find out the
predicted points so this will be my
predicted point this will be my
predicted point this this any new points
that I see over here will basically be
my predicted point with respect to Y so
in this way we basically solve a
regression problem statement so this is
very much important to understand let's
go to the always understand in a
regression problem statement your output
will be a continuous variable the second
one is basically a classification
problem now in classification problem
suppose I have a data set let's say that
number of hours study number of study
hours number of play
hours so this is my independent feature
let's say a number of sleeping hours and
finally I have my output which will will
be pass or fail so in this I have all
this as my independent features and this
is my dependent feature so I will be
having some values like this and here
either you'll be pass or fail or pass or
fail now whenever you have in your
output fixed number of categories then
that becomes a classification problem
suppose it just has two outputs then it
becomes a binary classification if you
have more than two different categories
at that time it becomes a multiclass
classification so this is the difference
between regression problem statement and
the classification problem statement now
let's go ahead and let's discuss about
something called as unsupervised machine
learning now in unsupervised machine
learning which is my second main topic
over here I'm just going to write
unsupervised machine learning now what
exactly is unsupervised machine learning
here whenever I talk about there are two
main problem statement that we solve one
is clustering
one is dimensionality reduction let's
take one example of a specific data set
over here let's say that my data set is
something called as salary and age now
in this scenario we don't have any
output variable no output variable no
dependent variable then what kind of
assumptions that we can take out from
this particular data set suppose I have
salary and age as my values so in this
particular case I would like to do
something called as clustering now why
clustering is used just understand let's
say I am going to do something called as
customer segmentation now what does this
customer segmentation do clustering
basically means that based on this data
I will try to find out similar groups
groups of people suppose this is my one
group this is my another group this is
my third group let's say that I was able
to create this many groups this many
groups are clusters I'll say cluster 1 2
three each and every cluster will be
specifying some information this cluster
May specify that this person uh he was
very young but he was able to get some
amazing salary this person it may some
specify that these people are basically
having more age and they are getting
good salary these people are like middle
class background where with respect to
the age the salary is not that much
increasing so here what we are doing we
are doing clustering we are grouping
them together main thing is grouping
this word is very much important now why
do we use this suppose my company
launches is a product and I want to just
Target this particular product to rich
people let's say product one is for rich
people product two is for middle class
people so if I make this kind of
clusters I will be able to Target my ads
only to this kind of people let's say
that this is the rich people this is the
middle class people I will be able to
Target this particular ads or this
particular product or send this
particular things to those specific
group of people by that that is
basically called as ad marketing and
this uses something called as customer
segmentation a very important example
and based on this customer segmentation
we can later apply any regression or
classification kind of problem statement
now coming to the second one after
clustering which is called as
dimensionality reduction now in
dimensionality reduction what we are
focusing on suppose if we have th000
features can we reduce this features to
lower Dimensions let's say that I want
to convert this
uh th000 feature to 100 features lower
Dimension so can we do that yes it is
possible with the help of dimensionality
deduction algorithm there are some
algorithms like PCA so I'll also try to
cover this as we go ahead understand
clustering is not a classification
problem clustering is a grouping
algorithm there is no output feature no
dependent variable in clustering sorry
in unsupervised ml so yes I will also
try to cover up LDA we'll cover up PCA
and all as we go ahead so with respect
to supervised and unsupervised so first
thing that we are going to cover is
something called as linear regression
the second algorithm that we will try to
cover after linear regression is
something called as Ridge and lasso
third that we are going to cover is
something called as logistic regression
the fourth that we are basically going
to cover is something called as decision
tree decision tree includes both
classification and regression four fifth
that we are going to cover is something
called as adab boost sixth that we are
going to cover is something called as
random Forest seventh that we are going
to cover is something called as gradient
boosting eighth that we are going to
cover is something called as XG boost N9
that we are going to cover is something
called as n bias then when we go to the
unsupervised machine learning algorithm
the first algorithm that we are going to
do is something called as K means K
means algorithm then we also have DV
scan then we are also going to do higher
C clustering there is also something
called as K nearest neighbor clustering
fifth we'll try to see about PCA then
LDA so different different things we
will try to cover up yes svm I have
missed here I'm going to include svm KNN
will also get covered so I have that in
my list probably I may miss one or two
but we are going to cover everything so
let's start our first algorithm linear
regression so let's go ahead and discuss
about linear regression linear
regression problem statement is very
simple guys so suppose I have let's say
I have two features one is my X feature
and one is my y feature let's say that X
is nothing but age and Y is nothing but
weight so based on these two features I
have some data points that has been
present over here so in linear
regression what we try to do is that we
try to create a model with the help of
this training data set so this will be
my training data set what I'm actually
going to do is that I'm going to
basically train a model and this model
is nothing but a kind of hypothesis
testing or it is just kind of hypothesis
which takes the new age and gives the
output of the weights and then with the
help of performance metrics we try to
verify whether this model is performing
well or not now in short what we are
going to do in linear regression is that
we'll try to find out a best fit line
which will actually help us to do the
prediction that basically means if I get
my new age over here then what should be
my output with respect to Y okay so with
respect to this what should be my output
over here in this particular case
whenever we are drawing a diagram like
this I can basically say that Y is a
linear function of X so this is what we
are going to do now understand how we
are going to create this best fit line
this is very much important whenever we
say linear regression it basically means
that we are going to create a linear
line over there you may be thinking sir
why to create linear line why not
nonlinear line that I'll discuss about
it as we go ahead see other other
algorithms so to begin with let's
consider this line that you see over
here right this line equation can be
given by multiple equations someone some
people people write yal mx + C some
people write uh H some people write yal
beta 0 + beta 1 into X some people write
H Theta of xal to Theta 0 + Theta 1 into
X many many equations are there for this
this straight line this straight line
many many equations are there with
respect to many many different kind of
notations but the first algorithm that I
have probably learned of linear
regression is from Andrew Ng definitely
I would like to give him the entire
credits and based on his notation
whatever he has explained I'll try to
explain you over here so the credits for
this algorithm specifically goes to
Andrew NG so let's consider this one
over here in order to create this
straight line I will basically use a
equation which is called as H Theta so
this is the equation of a straight line
if I know the equation of the straight
line whatever I can write I can write
many things yal mx + C yal beta 0 + beta
1 * X and then I can also write one more
that is H Theta of xal theta 0 + Theta 1
into X of I here also you can basically
say x of I here also you can say x of I
now let's go ahead and let's take this
equation for now let's take this
equation of now so I'm I'm going to take
out this equation and just write one
equation through which I have also
studied but I will definitely be adding
some points which probably Andrew and
could not mention mention in his video
but I'll try my level best obviously he
is the best I cannot even compare myself
to him so Theta 0 + Theta 1 into X now
let's understand what is Theta 0 Theta 1
as I said that let's say I have a
problem statement over here let's say I
this is my X and this is my y this is my
data points now what I'm doing I'm
trying to create a best fit line like
this now what is this best fit line what
is uh when I say this best fit line is
basically given by this equation what
does Theta 0 basically indicate Theta 0
over here is something called as
intercept now what exactly is intercept
intercept basically means that when your
X is zero then H Theta of X is equal to
Theta 0 so in this particular case
intercept basically indicates that at
what point you are meeting the Y AIS so
this particular point is basically
your intercept when your X is equal to 0
at that point of time you'll be seeing
that this line is intersecting the y-
AIS whatever value this will be that is
your intercept now the second thing is
about your Theta 1 what is Theta 1 this
is nothing but slope or coefficient now
what does this basically indicate this
indicates let let's say that this is the
unit one unit in the x-axis and probably
with respect to this I can find one
point over here one point over here and
if I try to draw this over here to here
this is the unit movement in y so what
does it basically say slope with the
unit movement in one one unit movement
towards the x-axis what is the unit
movement in y- axis that is basically
slope or coefficient Theta 0 and Theta 1
two things and X of I is definitely your
data points now our main aim is to
create a best fit line in such a way
that I I'll just try to show it to you
what is our main aim let's let's
understand what is the aim of a linear
regression so if I take an example of
linear regression I need to find out the
best fit line in such a way that the
distance
between this data points that I have and
the predicted points should be very very
less suppose I'm creating a best fit
line okay I'm creating a best fit line
so with respect to this data points
initially was this right but my
predicted point is this point in this
particular case my predicted point is
this point so and if I do do the
summation of all these points those
distance should be minimal then only
I'll be able to say that this is the
best fit line so I I cannot definitely
say that this is exactly the best fit
line or not how will I say when I try to
calculate the difference between this
point and the predicted Point these are
my predicted point right if I try to
calculate the distance between them then
I will basically have a aim to it should
be minimal if I do the summation of all
the distance it should be minimal
so for that what I can do is that see
you may be also thinking Krish why not
just do one thing okay suppose if these
are my data points why not just play and
create multiple lines and try to compare
what we can do is that we can compare
multiple we can create multiple lines
right like this and then whoever is
giving the best minimal point I will go
and select that but how many iteration
you will do how you will come to know
that okay this line is the best line so
for that specific purpose we should
start at one point and we should lead
towards finding the best fit line start
at one point and then we should go
towards finding the best fit line so for
this particular purpose what we do is
that we create a something called as uh
cost function I have already shown you
what is my hypothesis function my best
fit line equation is basically given as
H Theta of x equal to Theta 0 + Theta 1
* X this is my hypothesis right now
coming to the cost function which is
super super important why this it is
super important because cost function
basically what what is cost function
over here I told right right this
distance when I do the
summation this distance that I when I'm
doing the summation it should be minimal
so if I really want to find out this
particular distance I will be using one
more equation how can I use a distance
formula between the predicted and the
real point I will just say that H Theta
of x - y so when I say h Theta of x - Y
what does this basically mean this is my
real point and this is my predicted
Point predicted point is basically given
by H Theta of X and what I'm going to do
I'm going to basically do the squaring
because I may get a negative value so
because of that I really want to do the
squaring part Now understand one thing I
need to also do the
summation I = 1 to compl complete M
let's say that I'm taking the number of
data points over here as M because I
need to calculate the distance between
all the points right with respect to the
predicted and the predict with respect
to the real
points so after this I also need to
divide by 1X 2m the reason why I'm
dividing by first of all let me show you
why we are dividing by 1 by m 1 by m
will give us the average of all the
values that we have the specific reason
why we are dividing by 1 by 2 do is for
the derivation purpose it helps us to
make our equation very much simpler so
that later on when I am updating the
weights when I say weights I'm basically
updating Theta 0 and Theta 1 Theta 0 and
Theta 1 at that point of time you'll be
able to see that this particular value
when we probably do the derivative it
will help us to do it again I'm going to
repeat it I'm going to write it down for
you first of
all now in order to find find out the
best fit line I need to keep on changing
Theta 0 and Theta 1 unless and until I
get the best fit line unless and until I
don't get the best fit line I need to
keep on updating Theta 0 and Theta 1 now
if I need to keep on updating Theta 0
and Theta 1 I probably require a cost
function okay what this cost function
will do I'll just tell you so cost
function over here I will specify as J
of theta 0 comma Theta 1 is equal to now
what is cost fun function over here what
this distance I told right this distance
between the H Theta of X and Y if I do
the summation of all these things it
needs to be minimal it needs to be less
because with respect to an X point this
is my y point
right similarly with respect to this x
point this is my y point so what I'm
actually going to do I'm going to use a
cost function now in this cost function
my main aim is
to basically write H Theta of x - y s
this will be with respect to I I I why I
am saying I because this will be moving
from I equal to 1 to all the points that
is m m is basically all the points over
here now apart from this what I actually
going to do I'm going to divide by 1X 2
m I'll tell you why I'm specifically
dividing by 1X 2 m first of all by
dividing by m I will be getting an
average
output average cost function because
here I'm iterating M the reason why I'm
dividing by two because it will help us
in derivation why let's say that I have
x² if I try to find out derivative of x²
with respect to X then what will I get I
will basically get 2x right that is what
is the formula what is the derivation of
X of n it is nothing but n x of n
minus1 so that is the reason why I'm
actually making it 1 by two so that when
two comes over here this two and two
will get cancelled so I hope everybody's
able to understand so this is my cost
function Now understand what is this
called as this entire equation is
basically called as squared error
function yes mathematical Simplicity
basically means because when we are
updating Theta 0 and Theta 1 we
basically find out derivation in the
cost function so that is the reason why
we are specifically doing it squaring
off is basically done because so that we
don't get any negative values here
squared error function now let's go
towards the what we need to solve this
is my cost function okay so I need to
minimize minimize this particular value
that is 1x 2 m summation of I = 1 2 m
and then this will basically be H Theta
of X of I minus y of I whole Square we
need to minimize this by adjusting
parameter Theta 0 and Theta 1
this entirely is what this is nothing
but J of theta 0 comma Theta 1 and we
really need to minimize this so this is
our task okay this is our task now let's
go ahead and let's try to compare with
two different thing one is the
hypothesis testing and one is with
respect to the cost
function okay let's take an
example so right now my equation of
the
hypothesis is nothing but H Theta of x
equal to Theta 0 + Theta 1 *
X if Theta 0 is 0 then what does this
basically indicate can I say that it
basically the line the line the best fit
line passes through the origin and this
is nothing but s Theta of xal to Theta
1 multiplied by X can I say like this
obviously I can definitely say like this
right so my equation will be like this
so for right now let's consider that
your Theta 0 is equal to 0 so this is
what it is we have done till here we
have minimized we have written the
equation everything yes so it is passing
through the origin and this is what is
the equation I'm actually getting now
let's take one example and let's try to
solve this if I if I have H Theta of X
so this is my new hypothesis considering
that my intercept is passing through the
region so with respect to this let's say
that I will create one line over here
let's say this is
my this is my data points like X1 y1 I
have 1 2 3 I have 1 2 3 now let's
consider that if I have T I have data
points like what I have data points like
let's say I have three data points 1
comma 1 2A 2 3 comma 3 so 1A 1 is
nothing but this is my data point 2A 2
is nothing but this is my data point and
3 comma 3 is this is my data point so
these are my data points from the data
set that I
have so 2 comma 2 is this point and 3
comma 3 is basically this point let's
consider that these are my points that I
have these are my data points now if I
consider Theta 1 as 1 where do you think
the straight line will pass through
where do you think the straight line
will pass the straight line will
definitely pass like this right my
straight line will definitely pass
through all the points this same point
becomes a prediction point also right
same point let's consider that this is
also getting pass through this it passes
through all the points when Theta 1 is
equal to 1 Theta 1 is nothing but slope
when slope is equal to 1 in this
scenario it passes through all the
points now go ahead and calculate your J
of theta so what will the form of J of
theta 1 become because Theta 0 is 0 okay
we can basically write 1 by 2 m
summation of I = 1 2 three how many
points are there three right and here I
have J of H of theta of X1
sorry X of theta of x i - y i
s right now let's go ahead and compute
now in this particular scenario what
will happen 1X 2 m
then what is what is this point minus y
of I see h of X is also 1 y of I is also
one both the point are 1 so this will
become 1 - 1 whole S Plus because we are
doing summation the next point is also
falling in 2A 2 so this will become 2 -
2 s + 3 - 3 S so in total this will
become zero so when your J of theta when
Theta 1 is 1 Theta 1 is 1 so J of theta
1 is how much it is
Z right so what is this J of theta 1 it
is the cost function so let me draw the
cost function graph over here let's say
that this is my Theta and this is
my so here I have 0.5 here I have 1 here
I have 1.5 so this is my Theta here I
have two then I have 2.5 okay then
similarly I have 0. five then I have 1
1.5 2 2.5 this is my J of theta 1 so
right now what is my Theta 1 my Theta 1
is 1 at this particular Point what did I
get J of theta 1 is nothing but zero so
this will be my first point this will be
my first point guys I have discussed why
why the value will be 1X 2m basically to
make the calculation simpler we are
dividing by 1X 2 m is basically used to
average aage is the sumission that we
are actually doing over here now let's
go ahead and let's take the second
scenario in the second scenario let's
consider my Theta 1 let's say that my
Theta 1 over here is now 0.5 if my Theta
1 is 0.5 then tell me what are the
points that I will get for x equal to
1.5 * 1 so it will come as 0.5 over
here right then similarly when X is
equal to 2.5 * 2 is nothing but 1 over
here and then similarly when uh for x
equal to
35 multiplied by 3 see we are
multiplying here right5 multi by 3 is
1.5 so the next point will come over
here now when I create my best fit line
what will happen so here is my next best
fit line which I will probably create by
green
color okay so this is my second one
which is green color here definitely
slope is decreasing so if I go ahead and
calculate my J of theta let's see what
I'll get so J of theta
1 is nothing but 1X 2
m again same equation summation of I = 1
2 3 H Theta of X of
i - y of
i² so what we have for over here we have
nothing but 1X 2 m now let's do the
summation what is this point this point
is nothing but the predicted point and
this point is the real point right so in
this particular scenario the first point
that I will get is nothing but. 5 - 1
whole s how I'm getting. 5 - 1 whole
Square this is 1 this is the real Point
1 this is the predicted Point .5 so here
I'm getting. 5 - 1 whole Square the
second point will be 1 - 2 whole s right
2 so 1 - 2 whole
s and then I will finally get 1.5 - 3
whole s so finally if I do this
calculation how much I'm actually
getting 1X 2 * 3 which is 6 here I'm
getting
.25 5 Square here I'm getting 1 here I'm
getting 1.5 whole Square so my final
output will be which I have already
calculated it is nothing but point it
will be approximately equal to. 58 so 58
now with Theta as this is nothing but
Theta Theta 1 as
.5 right that is what Theta 1 as .5 we
are able to get. 58 so Theta 1 is .5
over here and. 58 will be coming
somewhere here right so this is my next
point which will be again in green color
now let's go ahead and calculate the
third condition now in third condition
what I'm actually going to write I'm
going to basically say Theta 1 as 0 at
that point of time just go and assume
what is 0 multiplied by X it will
obviously be zero so I will be getting
three points and my next line will be in
this line that is the
x-axis and this is basically all my
points now if I go ahead and calculate
this what is J of theta 1
now what is J of theta 1 now in this
particular case when my Theta 1 is equal
= to 0 1X 2 m now this part you'll be
able to see this is 0 - 1 0 - 2 0 -
3 okay so it will become 0 - 1 s 0 - 2 s
and 0 - 3
S okay so this will become 1X 6
* 1 + 4 + 9 which will not be it will be
nothing but 2.3 which is approximately
equal to
2.3 then what will happen with respect
to Theta 1 as 0 we are getting 2.3 so if
I draw this it is nothing but with
respect to zero I'm getting 2.
2
2.3 this is my point so similarly when
you start constructing with Theta 1 is
equal 2 I may get some point over here
so here when I join this points
together you will be seeing that I will
be getting this kind of
curve okay and this curve is something
called as gradient
descent and this gradient descent will
play a very very important role in
making sure that in making sure that you
get the right Theta 1 value or light
slope value now which is the most
suitable point the most suitable point
is to come over here because this is
this this point is basically called AS
Global
Minima because see out of all these
three lines which is the best fit line
this is the best fit line right this is
the best fit line when I had this best
fit line my point that came over here
was here itself this was my point that
came over here right and I want to
basically come to this region because
this is my Global
Minima when I basically am over here the
distance between the predicted and the
real point is very very less right so
this specific point is basically called
AS Global minimum but still I did not
discuss Krish you have assumed Theta 1
is 1 Theta 1 is .5 Theta 1 is 0 here
also you're assuming many things right
and then you probably calculating and
you're creating this gradient descent
but the thing should be that probably
you come to one point over here and then
you reach towards this so for that
specific reason how do you do that how
do I first of all come to a point and
then move towards This Global Minima so
for that specific case we will be using
one convergence algorithm because if I
come to one specific point after that I
just need to keep on updating Theta 1
instead of using different different
Theta 1 value so for this we use
something called as convergence
algorithm so here the convergence
algorithm basically says
repeat until
convergence that basically means I'm in
a while loop let's say and here I'm
basically going to update my Theta value
which will be given by this notation
which is continuous updation where I'll
say Theta J minus I'll talk about this
Alpha don't worry and then it will be
derivative of theta
J with respect to this J of theta
0 and Theta 1 so this should happen that
basically means after we reach to a
specific point of theta after performing
this particular operation we should be
able to come to the global Minima and
this this specific thing that you are
able to see is called as
derivative this is called as derivative
derivative basically means I'm trying to
find out the slope
derivative which I can also say it as
slope this equation will definitely work
guys trust me this will definitely work
why it will work I'll just draw it show
it to you let's say that this is my cost
function let's say that I've got this
gradient
descent and let's say that my first
point is somewhere here but I have to
reach somewhere here right now when I
reach this this is my Theta 1 and this
is my J of theta 1 suppose I reach at
this specific point and I will also have
another gradient descent which looks
like this let's say that in the initial
time I reach the point over here how we
will be coming to this minimal Global
Minima by using this equation I'll talk
about Alpha also don't worry now this is
also my Theta 1 this is also my J of
theta 1 now let's say suppose I came to
this particular point right after coming
to this particular point I will
basically apply this derivative on this
J of theta 1 okay now when I find out a
derivative that basically means we are
trying to find out the slope and in
order to find the slope we just create a
straight line like
this which will look like this I'll just
try to
create so I'll try to create a slope
like this this
slope so if you try to find out with
respect to this this is a positive slope
how do we indicate it because understand
the right hand side of the line of this
is pointing on the top wordss Direction
this is the best easy way to find out
whether it is a positive slope or
negative slope now in this particular
case this is a positive slope now when I
get a positive slope that basically
means I will update my weights or Theta
1 as Theta 1 let's say I'm writing it
over here so I will just apply this
convergence algorithm see Theta
1 colon Theta 1 minus this learning rate
which is called as Alpha this is my my
learning rate I'll talk about learning
rate don't worry then this derivative
value in this particular case since I'm
having a positive slope I will be
getting a positive value let's say that
for this Theta value I got this slope
initially now I need to come to this
location so for that I have to reduce
Theta 1 so that I come to this main
point now here you can see that I am I
subtracting Theta 1 with something which
is a positive number
right this is a positive number so
definitely I know that after some n
number of iteration I will be able to
come to the global Minima similarly if I
take the right hand side and if I try to
draw the slope in this particular case
my slope will be
negative so similarly I can write the
equation as Theta
1 = to Theta 1 minus learning rate
multiplied by a negative number so minus
into minus will be positive right
suppose initially my 1 was
here my Theta 1 was here now I'll keep
on updating the weight to come to this
Global Minima so minus into minus is
positive so I will basically get Theta 1
+
Alpha by a positive number because minus
into minus is plus so this will
definitely work so that we will be able
to come over here to the global Minima
whether it is a positive slope or a
negative slope now what is this learning
learning rate now learning rate based on
this learning rate suppose I want to
come from this point to the global
Minima by what speed I should be coming
what speed if my learning rate value is
bigger what speed I may be coming
suppose if I say usually we select
learning rate as 01 if I select a small
number then it'll start taking small
small steps to move towards the optimal
Minima but if I take a alpha value a
huge value if it is a huge huge value
then what will happen this uh this
updation of the Theta 1 will keep on
jumping here and there and the situation
will be that it will never meet it will
never reach the global Minima so it is a
very very good decision to take a alpha
small value it should also not be a very
very small value if it becomes a very
very small value then what will happen
very tiny steps it will take forever to
reach the global Minima that basically
means my model will keep on training
itself so definitely this Al is going to
work now let me talk about one
scenario one scenario will be that what
if my my cost function has a local
Minima what if I have a local Minima
because here if I
come here if I come this is a local
Minima suppose one of my points come
over here and finally I'm reaching over
here what will happen in this particular
case because in this case you'll be
seeing that what will be my equation my
equation will be simply Theta 1
Theta 1 minus Alpha in this point in
this local Minima slope will be zero so
in this particular case my Theta 1 will
be equal to Theta 1 now you may be
thinking what is if this is the scenario
then we will be stuck in local Minima
this is called as local
Minima but usually with respect to the
gradient descent and the equation that
we are using here we do not get stuck in
local Minima because our gradient
descent in this particular scenar iio
will always look like this but yes in
deep learning when we are learning about
grade in descent and a Ann at that point
of time we have lot of local Minima and
because of that we have different
different G decent algorithm like RMS
prop we have Adam optimizers which will
solve that specific problem so this one
point also I wanted to mention because
tomorrow if someone asks you as an
interview question that what if in your
uh do you see any local Minima in linear
regression you can just that the cost
function that we use will definitely not
give us local Minima but if in deep
learning techniques with that we are
trying to use like Ann we have different
different kind of optimizers which will
solve that particular problem so that is
the answer you basically have to give
now let me go ahead and write with
respect to the gradient descent
algorithm so here again I'm going to
write the gradient descent algorithm so
this will be my gradient descent
algorithm and remember guys gradient
descent is an amazing algorithm and you
you will definitely be using it so
please make sure that you know this
perfectly now some questions are that
when will convergence stop convergence
will stop when we come to near this area
where my uh J of theta will be very very
less now in gradient descent algorithm I
will again repeat it so what did I say I
said
repeat until convergence I told you
right here we have written this
algorithm
and now let's take it for Theta 0 and
Theta 1 so here I will write Theta 0
J equal to Theta
J minus learning rate of derivative of
theta
J J of theta 0 and Theta 1 so this is my
repeat until convergence now we really
need to find out what we'll try to
equate we'll try to first of all find
out what is this
now if I really want to find out
derivative
of derivative of derivative of theta J
with respect to J of theta 0 and Theta 1
so how do I write this I can definitely
write this in a easy way okay so this
will be derivative of theta J and
remember J will be 0 and 1 right because
we need to find out for 0 Theta 0 and
Theta 1 so this will be 1 by 2 m what is
what is J of theta 0a Theta 1 obviously
my cost function so I will write
summation of IAL 1 to M and here I will
basically write J of theta of X of I
minus y of I whole squar so if my J is
equal to Z so what will happen for this
so here I can specifically say that
derivative of derivative of theta 0 J of
theta 0a 1
now it's simple here what I will be
doing is that I will be simply applying
derivative function see guys what is
this derivative let's consider this is
something like this 1X 2 m x² so if I
try to find out the derivative this will
be 2x 2 MX so 2 and 2 will get cancel so
similarly I'll have 1 by m and here I
will specifically be writing summation
of I = 1 2 m h Theta of x X of I which
will be my
x - y of i² so this will be my
derivative with respect to Theta 0 this
is what I got now the second thing will
be that when J is equal to 1 derivative
of derivative of theta 1 J of theta 0
comma Theta
1 in this particular case I will be
having 1 by m summation of I = 1 to M
then again see in this particular case
Theta of 1 is there right Theta of 1
basically means what if I try to replace
this let's say that I'm trying to
replace this H Theta of X with something
else what is s Theta of X I know that
right it is Theta 0 + Theta 1 * X so
Theta 0 + Theta 1 * X so after this if
I'm trying to find out the derivative
with respect to Theta 0 this will
obviously become I will be able to get
this much right now with respect to the
second derivative what I will be writing
I will again be writing H thet of X of i
- y of i s
multiplied X of I so this Square also
went off understand this H Theta of X is
what see they H Theta of X is nothing
but Theta 0 + Theta 1 * X so if I'm
trying to find out derivative with
respect to Theta 0 nothing will be going
to come okay Theta 1 of X will become a
constant in this particular case in this
case because Theta 1 of X is there so if
I try to find out derivative of theta 1
into X only I'll be getting X Y Square
will not be there it's easy right X squ
means 2x this is the derivative of x
square right so that square went and 1X
2 1 2 by two got cancelled so this will
be now my convergence algorithm so here
we have discussed about linear
regression oh sorry I have to remove
Square here also so let me write it
again okay repeat until conver con let
me write it down again repeat until
convergence finally your two updates
will be happening one is Theta 0 so here
it will be Theta 0
minus Alpha that is my learning rate 1
by m summation of IAL 1 to M and this
will basically be H Theta of X of I
minus y of
I and similarly if I want to update
Theta 1 it will be - alpha 1 by m
summation of I = 1 to m h Theta of X of
I oh my God y of I uh multiplied by X of
I Alpha is your learning rate guys Alpha
is nothing but it is learning rate here
we have to initialize some value like
0.1 see what is s Theta of X Theta 0 +
Theta 1 into X right if I do derivative
of theta 1 into x what is derivative of
theta 1 with Theta 1 x it is nothing but
X so this x will come over here now
let's discuss about two important thing
one is R square and adjusted R square
now similarly what will happen you will
have lot of convex functions now see if
I talk about uh like if you have
multiple features like X1 X2 X3 x4 at
that point of time you will be having a
3D curve curve which looks like this
gradient
decent which will be something like this
gradient it's just like coming down a
mountain now let's discuss about two
performance metrics which is important
in this particular case one is R
square and adjusted R square
we usually use this performance metrix
to verify how our model is and how good
our model is with respect to linear
regression so R square is basically
given R square is a performance Matrix
to check how good the specific model is
so here we basically have a formula
which is like 1 minus sum of residual
divided by sum of total now this is the
formula of R squ now what is this sum of
residual I can basically write like this
summation of y i Min - y i hat whole
Square this Yi hat is nothing but H
Theta of X just consider in this way
divided by summation of Y of i - y mean
y mean y s to formula this is the
formula I'll try to explain you what
this formula definitely says okay so
first thing first let's consider that
this is my this is my problem statement
that I'm trying to solve suppose these
are my data points and if I try to
create the best fit
line This Yi hat Yi hat basically means
this specific point we are trying to
find out the difference between this
things difference between these things
let's say that these are my points I'm
trying to find out a difference between
this predicted this is my predicted the
point in green color are my predicted
points which I have denoted as y i hat
and always understand this is what Su
sum of residual is sum of residual is
nothing but difference between this
point to this point this point to this
point this point to this point this
point to this point and I doing the all
the summation of those now the next
point which is very much important here
is my X and Y what is this y IUS y y bar
Y Bar is nothing but mean mean of Y if I
calculate the mean of Y then I will
probably get a line which looks like
this I'll get a line something like this
and then I will probably try to
calculate the distance between each and
every point and this specific point with
respect to the distance between this
point and this point the denominator
will definitely be high right this value
obviously this value will be higher than
this value right the reason why it will
be higher because the mean of this
particular value distance will obviously
be higher so this 1 minus high this will
be a low value and this will be a high
value when I try to divide Low by
High Low by high then obviously this
entire number will become a small number
when this is a small number 1 minus
small number will be a big number so
this basically shows that our R square
has fitted properly right it has
basically got a very good R square now
tell me can I get this entire R square a
negative number let's say that in this
particular case I got 90% can I get this
R square as negative number there will
be situation guys what if I create a
best fit line which looks like
this if I create this best fit line
which looks like this then this value
will be quite High it is only possible
when this value will be higher
than higher than this
value okay but in the usual scenario it
will not happen because obviously we'll
try to fit a line which will be at least
good it's not just like pulling one line
somewhere we don't want to create a best
fit line which is worse than this right
worse than this so in this particular
scenario you'll be saying that in R
square now here you'll be able to see
one one amazing feature about R square
is that let's say let's say one scenario
suppose I have features like let's say
that my feature is something like uh
let's say I have a price of a house okay
so suppose this is my bedrooms how many
bedrooms I have and this is basically
the price of the house now if I if I
probably solve this Pro problem I'll
definitely get an R square value let's
say the R square value is 85% let's say
that my R square is 85% now what if if I
add one more feature the one more
feature basically says that okay if I
add
location location of the house will be
definitely correlated with price so
there is a definite chance that the R
square value will increase let's say
that R square will become 90% if I
probably have this two specific feature
and obviously it is basically increasing
the R square because this is also
correlated to price
and let me change the example see first
case I got by R square as 85% let's say
now as soon as I added location I got
90% now let's say that I added one more
feature which gender is going to stay
gender like male or female is going to
stay you know that gender is no way
correlated to price but even though I
add one feature there is a scenario that
my R square will still increase and it
may become
91% even though my feature is not that
important even gender is not that
important the R square formula Works in
such a way that if I keep on adding
features and that are not nowhere
correlated this is obviously nowhere
correlated this is not correlated with
price then also what it does is that it
is basically increasing my r² so this
specific thing should not happen whether
a male will stay or female will stay
that does not matter at all still when
you do the calculation the R square will
still increase so in order to not impact
the model because see now right now with
this particular model where I have got
90% now as soon as I see R square as 91%
because it is considering this
particular gender so this model will be
picked right because it is performing
well and is giving you a better R square
value but this should not happen because
that is not at all corelated this model
should have been picked so in order to
prevent this situation what we do we
basically Ally use something called as
adjusted R square now what is this
adjusted R square and how it will work
I'll also show it to you very very nice
concept of adjusted R square so adjusted
R square R square
adjusted is given by the
formula is given by the Formula 1 - 1 -
r² * N - 1 where n is the total number
of samples n minus P minus 1 this p p is
nothing but number of features
or predictors we'll also say or
predictors suppose initially my number
of predictors were in this particular
scenario in this scenario where I saw
this my number of predictors was two and
in this particular case my number of
predictor was three now if my predictor
is 2 I got the r squ as 90% so in this
particular scenario what all the
calculation will happen okay all the
calculation will happen and let's say
that my R square adjusted it'll be
little bit less it'll be little bit less
let's say it8 is 6% let's say that my R
square adjusted is 86% based on this
predictor 2 now when I use my predictor
3 predictor basically means number of
features that I'm going to use and now
in this one one feature is nowhere
related like gender but what we are
getting we are basically getting R
square increased to
91% now for the R square
adjusted this will not increase this
will in turn decrease right now it will
become 82% how it will become I'll show
you I've just considered some value 8682
here you can see that there is an
increase here an increase is there here
decrease is there now how this is
basically happening see this P value
that I will be putting okay if I put a p
isal 3 obviously with n minus P minus 1
this will become a little bit smaller
number or sorry little bit uh smaller
number right so now in this particular
case if it is not correlated obviously
this will be high when I'm increasing
this so this will also be high let me
write the equation something like this
just a second so this will basically
be okay now why probably this value may
have decreased let me talk about this
one what is r squ I hope everybody
understood n is the number of data
points p is the number of
predictors if p is increasing then what
will happen as P keeps on increasing
this value will keep on
decreasing this value will keep on
decreasing if this values keep on
decreasing this will be a bigger number
this will obviously be a big number a
big number divided by a small number
what it will be obviously this will be a
little bit bigger number 1 minus bigger
number we will basically get some values
which will be decreasing if my P value
is two in this particular case it will
be less smaller than this right at least
it will be greater than this this
particular value right when p is equal
to
3 so with the help of P obviously R
square is there to support you okay
whether it is correlated or not always
remember when the features are highly
correlated your R square value will
increase tremendously if it is less
correlated then it will be there will be
a small increase but there will not be a
very huge increase now if I consider p
is equal to 2 obviously when I'm trying
to find out this uh calculation n minus
P minus 1 it will obviously be greater
than p is equal to 3 when p is equal to
3 then this value will be still more
smaller and when we are dividing a
bigger number by a smaller number
obviously we are subtracting with one so
that basically means even though my R
square is 86 over here there may be a
scenario since this is nowhere
correlated I'm basically getting an 82%
because of this entire equation so I
hope you are understanding this this is
very much important to understand a very
very important property simple way to
define is that as my P value keeps on
increasing the number of predictors
keeps on increasing my R squ gets
adjusted whatever R square I'm getting
with respect to this it will always be
less than this particular R square there
was one interview question that was
asked one of my student between R square
and adjusted R square which will always
be bigger definitely the student said R
square then he told him to explain about
adjusted R square why does that specific
happen agenda one is about Ridge lasso
regression second is assumptions of
linear regression the third point that
we are probably going to discuss about
is logistic regression then the fourth
thing that we are going to discuss about
is something called as confusion
Matrix the fifth thing that we are going
to consider about
is practicals
for lead lineer Ridge lasso and logistic
so first topic uh that we are probably
going to discuss is something called as
Ridge and lasso
regression so let's understand about
Ridge and lasso regression if you
remember in our previous session what
all things we discussed linear
regression and then we had discussed
about the cost function we have
discussed about R square adjusted
adjusted R square sorry R square and
adjusted R square we have discussed
about it gradient descent we have
discussed about it it was nothing but 1
by 2 m summation of I = 1 2 m h Theta of
x i -
y - y i s so this is the cost function
that we had discussed right yesterday
and this cost function was able to give
us a
gradient descent with respect to the J
of
theta J of theta Zer or Theta not so I
can also write this as J of theta comma
Theta 0 comma Theta 1 now let me give
you a scenario let's say that I have a
scenario over here and I have this
specific scenario let's say that I just
have two points which looks like this
okay now if I have these two specific
points what will happen I will probably
try to create a best fit line the best
fit line will definitely pass through
all the points like this if I try to
calculate the cost function what will be
the value of J of theta 0 comma Theta 1
let's say that in this particular case
since it is passing through the origin
my Theta 0 will be zero okay so what
will be the value of theta 0 comma Theta
1 so here obviously you can see that
there is no difference so it will
obviously become zero Now understand
this data that you see right right this
data is basically called as training
data so this data that I have actually
plotted with two points these are
specifically called as training
data now what is the problem in this
data right now see right now exactly
whatever line is basically getting
created over here which is through the
uh hypothesis over here you can see that
it is passing through every point so
that is the reason your cost is zero and
our main aim is to basically minimize
the cost function that is absolutely
fine now in this particular case in
which my model this if this model is
getting trained initially this data is
basically called as training data now
just imagine that tomorrow new data
points comes so if my new data points
are here let's consider that I I want to
basically uh come up with this new data
point now in this particular scenario if
I want to predict with respect to this
particular Point let's say my predicted
point is here
is this the difference between the
predicted and the real Point quite
huge yes or no so this is basically
creating a condition which is called as
overfitting that basically means even
though my
model has given or trained well with the
training
data or let me write it down properly
over here so this condition since since
you can see that over here my each and
every point is basically passing through
the best fit line so because of that
what happens it causes something called
as
overfitting so you really need to
understand what is overfitting now what
does overfitting mean overfitting
basically means my model performs well
with training data but it fails to
perform well with test data now what is
the test data over here the test data is
basically this points the real test data
answer was this points but because the
my line is like this I'm actually
getting the predicted point over here so
this distance if I try to calculate it
is quite huge so in this scenario
whenever I say my model performs well
with training data and it fails to
perform well with test data then this
scenario we say it as overfitting so
this scenario when the model performs
well with training data I have a
condition which is called as low bias
and when it fails to perform with the
test data then it is basically called as
high High variance very important okay I
will make each and everyone understand
one by one if it is performing well with
the training data that is basically low
bias and whenever it performs well with
the test sorry fails to perform well
with the fails to perform well with the
test data then it is basically High
variance now similarly I may have
another scenario which is called as
underfitting so let's say that I have
something called as
underfitting now in this underfitting
what is the scenario the
model fails to perform it gives bad
accuracy I say that model always
remember whenever I talk about bias then
you can understand that it is something
related to the training data whenever I
talk about test data at that point of
time you talk about variance and that
specifically whenever you talk about
variance that basically means we are
talking about the test data so for an
overfitting you will basically have low
bias and high variance low bias with
respect to the training data and high
variance with respect to the test data
now if the model accuracy is bad with
training data and the model accuracy is
also bad with test data in this scenario
we basically say it as underfitting so
these are the two conditions that are
with respect to underfitting that
basically means that both for the
training data also the model is giving
bad accuracy and again for the test data
also it is basically having a bad
accuracy so in this particular scenario
we can definitely say two things out of
underfitting one is high bias and high
variance so this is the condition with
respect to underfitting very super
important let me just explain you once
again suppose let's consider I have one
model I have model two this is model one
this is model one this is model two and
this is model 3 okay guys so suppose
let's say that I have my model my
training accuracy is let's say
90% And my let's say that my test
accuracy is 80% now in this particular
case let's say that my training accuracy
is
92% and my test accuracy is 91% and
let's say my model three is basically
having training accuracy as
70% and my test accuracy is 65% so if I
take this particular case it is
basically overfitting if I take this
particular thing this basically becomes
my generalized model and when I talk
about this this is my I'll just say that
okay I'll also put nice color so that uh
you'll be able to understand this this
becomes our generalized model and this
finally becomes our underfitting right
under under fitting so here is my red
color I will just say it as underfitting
what are the main properties of this
overfitting as I said in this scenario
since it is performing well with the
training data so it will be low bias
High variance in this particular case it
will be low bias low variance and this
particular case it will be high bias and
high variance understand in this
terminology in this particular way
you'll be able to understand so why do
we require always a generalized model
because whenever our new data will
definitely come generalized model will
be able to give us very good output
let's go back to this particular example
here you'll be able to see this straight
line the red line that I have actually
created is basically overfitting so that
whenever I probably get the new points
which is having this real value and the
predicted points here you'll be able to
see the difference is quite huge so
because of this it will definitely be a
scenario of overfitting where it has low
bias and high weight
so again let me go ahead and take this
example so this was my line which I have
actually drawn I had two points and when
I draw this line which was a best fit
line to which is passing through both
the points this scenario is basically
causing a overfitting problem and I've
also shown you my J of theta 1 will be
zero in this scenario since it is
passing exactly and the predicted point
is also over there now understand one
thing is that what can can we take out
from this what assumptions we can take
out from this definitely if I talk about
our cost function our cost function here
is nothing but 1X 2 m summation of I = 1
2 m h Theta of X of i - y of I whole s
now let's consider that I am going to
use this H Theta X and I'm going to
basically write it as y hat okay let's
focus on this specific point so when I
take this I'm I'm just going to focus on
this particular point so here I will
definitely write it as y hat minus y of
I whole squ so this is my y y hat of I
minus y hat y i whole Square so this is
nothing but the difference between the
predicted value and the real value okay
this is what I'm actually trying to get
now in this scenario if I am adding this
values obviously I'm going to get the
value as zero now I have to make sure
that this value does not come to zero
because this is still over fitting so
that is where your Ridge regression will
come into picture Ridge and lasso will
come into picture now when I use Ridge
and lasso suppose if I use Ridge now in
Ridge what we say this this is also
called as L2
regularization now L2 regularization
what it does is that it basically adds a
unique
parameter add a One More Sample value
which is like Lambda multiplied by slope
Square now what is this slope whatever
slope of this particular line it is we
are just going to square it off now
suppose if I take my equation which
looks like this H Theta of X is equal to
Theta 0 + Theta 1 x now in this
particular case my Theta 0 was zero so
my H Theta of X is nothing but Theta 1
what is Theta 1 this is specifically
called as slope and I am basically
taking this Theta 1 I'm actually making
it as a square Square so always
understand I don't want to make this as
zero because if it becomes zero it may
lead to overfitting condition now what
will happen if I add this particular
equation if I add this particular
equation this will obviously come as
zero let's consider my Lambda value over
here my Lambda value is one I'll talk
about how do you set up Lambda value
okay let's consider that I'm
initializing it to one let's say my
Lambda value is 1 now what I will do is
that this l Lambda value is 1 Let's
consider our slope value initially is
two and because of this two I got this
best fit line I'm just going to consider
it so if I do the total sum over here if
I'm just considering this this value is
three now the cost function will not
stop over here because still it has to
minimize it has to reduce this three
value so what it will do it will again
change the Theta 1 value and let's say
that my Theta van value has changed now
it got another best fit line which looks
something like like this this is my next
best fit line I'll talk about Lambda
Lambda is a hyper parameter guys what
exactly is Lambda I'll just talk about
it now when I basically change this line
now see why I'm getting this line let's
consider I have changed my Theta 1 value
since we need to minimize now when we
need to minimize what it will do we'll
again calculate the slope of this
particular line and then we will try to
create a new line when we sorry it is
two two not three just a second guys 0 +
1 multiplied by 2 s which is nothing but
4 so now my cost function will not stop
over here so we are going to still
reduce this now in order to reduce this
again Theta 1 value will get changed and
then we will get a next best fit line
for this point now what will happen in
this scenario once we have this best fit
line we will definitely get a kind of
small difference so now if I go ahead
and consider the new equation my y hat I
minus y
i² + Lambda of slope squar this value
will be a small value now because I have
some difference and then plus again 1
multiplied by now understand whether the
slope will increase in this particular
case or whether it will decrease in this
particular case there will be some slope
value let's say that I have got some
slope of this particular line in this
particular scenario again your slope
will definitely decrease so let's say in
the case of two initially it was now it
is basically
1.36 whole squ now this small Value
Plus 1 + 1.3 squ or let me consider that
my slope is now one simple value that is
5 so if I get this it is 2.25 2.25 plus
small value it will be less than three
only right it will obviously be less
than three or equal to 3 but understand
what is happening the value is getting
reduced from 4 to 3 so this is is the
importance of Ridge now what will happen
is that you will try to get a
generalized model which has low bias and
low variance instead of this overfitting
condition you know why specifically we
are adding Ridge L2 regularization it is
basically to prevent
overfitting because here you are not
stopping here you are trying to reduce
it unless and until you get a line you
get a line which will be able to handle
the which will be able to handle as a uh
generalized model now here you can see
now if I have my new points like how I
drew over here now the distance will be
less so now you'll be able to see that
it will be able to create a generalized
model guys this will be a small value
only see initially when we have this
line obviously we have zero if we try to
slightly move here and there so here
you'll be able to see that it will just
a slight movement but what this movement
is basically specifying it is specifying
that the slope should not be steep if we
probably have a steep slope it obviously
leads to most of the time overfitting
condition it should not be steep it
should be very very it should be less
steeper but it should actually help you
to create a generalized model so you
will be seeing that after playing for
some amount of time this value will not
reduce after some point of time it'll
get almost it'll be a minimal value
it'll be a smaller value and for this
also you have to specify iterations how
many times you probably have to train
them now this iterations is also a
hyperparameter based on number of
iterations you will probably see your R
square or adjusted R square over here so
this iterations based on the number of
iterations it will never become zero
guys understand because zero it is not
possible if it becomes zero trust me it
is an overfitting model you cannot get
that is something zero now what is
Lambda coming to this Lambda this Lambda
is a
hyperparameter this is basically to
check how fast you want to lessen the
steepness or how fast you want to make a
steepness grow higher right and this
Lambda will also be selected by using
hyper parameter and this also I'll show
you today in Practical what do you mean
by iterations iteration basically means
how many time I want to change the Theta
1 value how many times you want to
change the Theta value that is the
convergence algorithm right
convergence algorithm over here L2
regularization or Ridge is basically
used in such a way that you should never
overfit why we assume Theta 0 is equal
to 0 because I'm considering that it
passes through a origin right origin
over here Lambda is a hyper
parameter steep basically means how
steep the line is if I have this line
this line is quite steep if I have this
line This is less steep now if I go to
the next regularization which is called
as lasso raso R lasso regression this is
also called as L1
regularization now here the formula will
be changing little bit here you will be
having y hat of minus of Y whole Square
here you'll be adding a parameter Lambda
but understand here you'll not be adding
slope Square no here you'll be adding
mode of slope here you'll be adding mode
of slope and this mode of slope will
work is that it will actually help you
to do feature selection now you may be
thinking how feature selection crash
let's consider a equation over here
let's say that I have many many features
I have many many many features okay so
my H Theta of X which I'm indicating
here as y hat let's say that I'm I'm
writing this equation apart from
preventing for overfitting it will also
help you to do feature selection here
let me just show you over here with an
example this H Theta of X which I'm
probably writing as y hat will basically
be indicated by something over here
you'll be able to see that it is nothing
but let's say that I have multiple
features like this now in this
particular features obviously there are
so many coefficients over here so many
slopes over here now mod of slope will
be what it will be nothing but mod of X1
plus X2 plus X3 plus X4 plus X5 like
this up to xn now in this particular
case how it is basically helping you to
sorry not X1 sorry just a second this
mod of I have taken the data point this
is not data points this should be your
mod of theta 0 + Theta 1 + Theta 2 +
theta 3 + Theta 4 + Theta 5 like this up
to Theta n so here you'll be able to see
that this is how I will basically uh
I'll basically be calculating the slope
now as we go ahead guys whichever
features are probably not playing an
amazing role the Theta value the
coefficient value the slope value will
be very very small it is just like that
entire feature is neglected that entire
feature is neglected now in this
particular case we were doing squaring
because of the squaring that value was
also increasing but here because of the
mode that value will not increase
instead it will be a condition wherein
we are basically neglecting those
features that are not at all important
in this specific problem statement so
with the help of L1 regularization that
is lasso you are able to do two
important things one is preventing
overfitting and the second case is that
if you have many features and many of
the features are not that important okay
in basically finding out your slope or
your line or the best fit line in that
particular case it will also help you to
perform feature selection so this is the
importance of the entire what is the
importance of this this is the
importance of the uh Ridge and the lasso
regression that we are doing here I'm
just going to write L1
regularization and obviously we have
discussed about L2 regularization also
now you have probably understood Lambda
is one hyperparameter okay which we will
specifically using okay and based on
this Lambda this will be found out
through cross
validation cross validation is a
technique wherein we will try to
probably train our model and try to find
out the specific things okay what should
be the exact value and there also we
play with multiple values in short what
we are doing we just trying to reduce
the cost function in such a way that uh
it will definitely never become zero but
it will basically reduce based on the
Lambda and the slope value in most of
the scenario if you ask me we should
definitely try both the regularization
and see that wherever the performance
Matrix is good we should use that what
is cross validation basically means I
will try to use different different
Lambda value and basically Ally use it
so in a short let me write it down again
for Ridge regression which is an L2 Norm
here I'm simply writing my cost function
in this particular case will be little
bit different here I can definitely
write my cost function as H Theta X of i
- y of I S Plus Lambda multiplied slope
Square what is the purpose of this the
purpose is very simple here we are
preventing overfitting this was with
respect to the Ridge Recreation that is
L2 nor now if I go ahead and discuss
about the next one which is called as
lasso regression which is also called as
L1 regularization in the case of lasso
regression your cost function will be H
Theta of X of
IUS y of
i² plus Lambda ultied mode of flow so
here you have this specific thing and
what is the purpose the purpose are two
one is prevent overfitting and the
second one is something called as
feature selection so these two are the
outcomes of the entire thing see with
respect to this lasso right you have
slopes slopes here you'll be having
Theta 0 plus Theta 1 plus Theta 2 plus
theta 3 like this up to Theta n now when
you'll have this many number of thetas
when you have many number of features
and when you have many number of
features that basically means you'll
have multiple slopes right those
features that are not performing well or
that has no contribution in finding out
your output that coefficient value will
be almost nil right it will be very much
near to zero in short you neglecting
that value by using modulus you're not
squaring them up you're not increasing
those values now I will continue and uh
probably I will also discuss about the
assumptions of linear regressions so
what are the assumptions of linear
regression in this particular scenario
so assumption is that number one point
linear regression if our features are in
normal or gion
distribution if our features follows
this particular distribution it is
obviously good our model will get
trained well so there is one concept
which is called as feature
transformation now in future
transformation always understand what
will happen if a model does not fall
follow a gan distribution then we apply
some kind of mathematical equation onto
the data and try to convert them into
normal orian distribution the second
assumption that I would definitely like
to make is that standard scalar or
standard digestion standard dig is
nothing but it is a kind of scaling your
data by using Z score I hope everybody
remembers Z score this is what we
basically apply there your mean is equal
to zero and standard deviation equal to
1 see guys wherever you have gradient
descent involved it is good to basically
do
standardization because if our initial
point is a small Point somewhere here
then to reach the global Minima or
training will happen quickly otherwise
what will happen if your values are
quite huge then your graph may be very
big and the point can come over any over
there and the third point is that this
linear regression works with respect to
linearity it works if your data is
linearly separable
I'll not say linearly separable but this
linearity will come into picture if your
data is too much linear it will
obviously be able to give a very good
answer like logistic regression also
which we are going to discuss today this
also has the same property now you may
be asking is it compulsory to do
standardization guys if you want to
increase the training time of your model
or if you want to optimize your model I
would suggest go ahead and do
standardization now coming to the fourth
Point here you really need to check
about multicolinearity
this is also one kind of check we
basically do what is multicol
linearities let's say I have X1 I have
X2 and this is my output feature I have
let's say X3 also now let's say that if
I try to see the colinearity of this two
feature how how correlated these two
feature are let's say that these two
feature are 95% correlated is it is it a
wise decision to use both the features
and let's say that let's let's say that
these two features are 95% correlated
but it is highly correlated with Y is it
necessary that we should use both the
feature in this particular scenario the
answer should be no we can drop this
particular feature okay we can drop this
particular feature any one of the
feature we can definitely drop it and
based on that I can just use one single
feature and basically we do the
prediction there is also a concept which
is called as variation inflation factor
I will try to make a dedicated video
about this multical is also solved with
the help of variation inflation Factor
one more term is there homos orc so that
kind of terminologies also we use one
more condition in this but if you almost
satisfied with this assumptions you will
definitely be able to outperform in
linear regression so you have got an
idea of the assumptions you have also
got an idea of multiple things okay now
let's go towards something called as
logistic regression now logistic
regression what logistic regression is
the first type of algorithm that we are
going to learn in classification let's
say that in classification I have one
example you know so suppose I have say
number of hours study hours and number
of play hours based on this I want to
predict whether a child is passing or
failing suppose these two are my
features I want to predict whether it is
pass or fail so here you'll be able to
see that I have some fixed number of
categories specifically in this
particular scenario I have two
categories binary logistic regression
works very well with binary
classification now the uh question comes
that can we solve multiclass
classification using logistic the answer
is simply yes you can definitely do it
so let's go ahead and let's try to
discuss about uh logistic regression now
what is the main purpose of the logistic
regression first of all let's let's uh
understand one scenario okay suppose I
have a feature which basically says um
number of study hours and this is like 1
2 3 4 5 6 7 and let's say that I have
pass this point is basically pass and
this point is basically
fail so I have this two conditions these
are my outcomes now what I'll do I will
just try to make some data points let's
say that if I study Less Than 3 hours I
will probably be fail if I study more
than 3 hours then probably I will pass
this I'll make it as fail and this I
will make it as pass so I will be having
points over here this 1 2 3 let's say
that this is my training data set now
the first question says that okay Chris
fine you have some data over here
whenever it is less than three you are
basically the person is failing if it is
greater than five greater than three it
is basically showing data points points
with respect to pass now can't we solve
this problem first with linear
regression now with the help of linear
regression here the first point will be
that yes I can definitely draw a best
fit line my best fit line in this
particular scenario may be something
like this it may it may look something
like this so here fail is nothing but
zero pass is one the middle point is
basically 0.5 so obviously with the help
of linear
regression I'm able to create this best
fit line and I'll put a scenario that
whenever the value is less
than5 whenever the value is less than
0.5 whenever the output is less than5
let's say that new data point is this
and based on this I'll try to do the
prediction I'm actually able to get the
output over here now when I'm getting
the output over here this basically is
0.25 now in this particular scenario
obviously I'm able to say that yes the
person I'll write a condition over here
saying that if my H Theta of x value is
less than 0.5 then my output should be
zero let's say less than 0.5 I'll say
not less than or equal to less than5
then my output will be zero right so in
this particular case Zero basically
means fail similarly I'll have a
scenario where I'll say that when if my
S of theta of X is greater than or equal
to 5 then this will basically be one
which is nothing but pass so this two
condition I can definitely write over
here this is my center point so that any
point that will probably come over here
let's say that this point is coming over
here right let's say new data point is
somewhere coming over here with this red
point
now what I'll do I'll basically draw a
straight line it will come over here I
will just extend this line
long I will extend this line over here
and I will extend this line over here
and here you can see that based on this
I'm actually getting this particular
prediction which is greater than 0.5 so
I will say that okay the person has
passed obviously this is fine this is
obviously working better this is
obviously working better so what what is
the problem why we are not using linear
regression okay in order to solve this
particular problem why you are
specifically having logistic regression
the answer is very much simple guys the
answer is that whenever let's say that
if I have an outlier which looks
something like this suppose I have an
outlier which comes like this over here
what is this value let's say that this
value is nothing but 7 8 9 10 let's say
that the number of study hours and I'm
studying for nine it is obviously pass
now in this particular scenario when I
have an outlier this entire line will
change now I will probably get my line
which looks something like this okay my
line will basically move something like
this it will now get moved something
like this now when it gets moves
completely like this now for even five
or even at any point that I am actually
predicting let's say that at this
particular point if I try to find out
it'll be showing less than 0. five so
here this particular value or answer
will be wrong right because if we are
studying more than 5 hours OB viously B
based on the previous line the person
had to pass but in this scenario it is
failing it is coming less than 0.5 but
the real value for this is basically
passed so I hope you are understanding
because of the outlier the entire line
is getting changed so how do we fix this
particular problem now in this two
scenarios are there first of all
obviously because of just an outlier
your entire line is getting shifted here
and there the second point is that over
here sometimes you're also getting
greater than one you you're also getting
less than one suppose if I try to
calculate for this particular point if I
project it in behind I'll be getting
some negative value so we have to squash
this function if I squash this function
then it'll become a plain line right how
do we squash it and for this we use
something called as sigmoid activation
function or sigmoid function if somebody
ask you why don't you use linear
regession in order to solve this
classification problem then your answer
should be very much simple you should
say this to specific points so we will
try to go ahead and solve some linear
regression now with the help of cost
function everything as such and we'll
try to understand how the cost function
will look for logistic regression second
reason I told you right it is greater
than zero over here the line is going
greater than zero right greater than
zero I have only Z and one and it is
becoming greater than zero but I have
already told that our maximum and
minimum value are 1 and zero so I hope
you have understood why linear Reg
cannot be used okay I showed you all the
scenarios why linear regression should
not be used now we'll continue and
probably discuss about the other things
over here and uh we will now try to
understand fine what exactly logistic
regression is all about and how the
decision boundaries basically created
now we'll go ahead and discuss about
that specific thing so let's go ahead
our values should be always between 0 to
one over here in this particular case
because it is a binary classification
problem only this should be the answer
so let's go ahead and let's define our
decision boundary so my decision
boundary decision boundary in the case
of logistic regression first of all as
usual in logistic regression we defined
our hypothesis okay guys first of all
let's see if I'm writing my my h of
theta my H Theta of X as Theta 0 + Theta
1 into x + Theta 2 into X like this X1
X2 + Theta n into xn
now in this scenario can I write this
entire equation as Theta transpose X
obviously I can definitely write this
way right and this is what is the
notation that you will probably seeing
in many places so with respect to the
decision boundary of logistic regression
our Theta see like this we can write I'm
saying okay but since we have to
consider two things one is squashing the
line okay how that squashing will
basically happen see if I have this if I
have this line
we saw in the above right if I have this
line suppose I have some data points
over here and I have some data points
over here if I want to create the best
fit line how will I create I will
basically create like this but I have to
also do two things one is squash over
here and squash over here right squash
over here and squash over here now in
order to squash I'm saying squash squash
means
okay now in order to do this I use a
function which is called as sigmoid
activation function
that basically means what happens
obviously you know this line is
basically denoted by H Theta of x equal
to how do you denote this straight line
let me write it down nicely for you so
how do you denote this straight line the
straight line is obviously denoted by
Theta 0 + Theta 1 * X1 let's say now on
top of this on top of this I have to
apply something on top of this value I
have to apply something so that I can
make this line straight instead of just
expanding in this way so my hypothesis
will basically be now G of G is
basically a function on Theta 0 and
Theta 1 * X1 so here I'm trying to
basically what I'm trying to do I will
apply a mathematical formula on top of
this linear regression to squash this
line now let's go ahead and let's try to
find out what is this G okay what is
this G I will say let Z equal to Theta 0
+ Theta 1 * X I'm just initializing this
now my H Theta of X is nothing but G of
Z now we need to understand what is this
z g of Z and how do we basically specify
what is the G function so my G function
is nothing but H Theta of x equal to 1
by 1 + e ^ of minus Z which in short if
I try to initialize Zed now it is 1 ^ of
e ^ of minus Theta 0 + Theta 1 * X so
this is what is my H Theta of X which is
my hypothesis and this obviously works
well because it is being able to squash
the function so this is basically my
hypothesis which I am definitely trying
to use it and this function that you are
actually able to see is called as
sigmoid or logistic function now you
need to understand what does this
sigmoid function look like in graph in
graph it looks something like this so
this this is my Zed value and this is my
G of Z this is my 05 your sigmoid
function will have this curve so this is
your one this is zero your value when
now from this we can make a lot of
assumptions what are the assumptions
that we can basically make your G of Zed
your G of Zed is greater than or equal
to
5.5 is obviously greater than or equal
to 0.5 when your Zed value is greater
than or equal to zero this is the major
assumptions that we can basically make
that is whenever your G of Z is greater
than your G of Z is greater than or
equal to 0.5 whenever your Zed is
greater than or equal to Z so obviously
whenever your Zed value is greater than
Z it is greater than 0.5 if your Zed
value is less than zero what it will
become it will basically be less than
0.5 so you can write that specific
condition also you want so this is the
most important condition
over here why it is called as logistic
regression see guys with the help of
regression you creating this straight
line and with the help of the concept of
sigmo you are able to squash it so they
have probably combined that name and uh
basically have written in this way will
squashing of the best fit L line help to
overcome the outlier issues yes
obviously it'll be able to help you so
let's go ahead and let's try to solve
the problem statement now usually let's
consider my training set let's consider
my training set suppose I have some
training points like this x of 1 comma y
of 1
let's say x of 2A y of 2 okay X of 3A y
of 3 like this I have lot of training
points and finally X of n comma y of n
let's say that this is my training data
so here uh my y y will belong to what
zero or 1 because I will only have two
outputs since we are solving a binary
classification problem here is my
training set with two outputs and I hope
everybody knows about J Theta of Z
it is nothing but 1 + e ^ of minus Z
here your Z is nothing but Theta 0 +
Theta 1 * X1 so this is your Theta 0 now
what we have to do we have to select
this Theta now in this particular case
let's consider that my Theta 0 is 0
because it is passing through the origin
just for time pass sake suppose my Z is
Theta 1 into X so now I need to change
what is my parameter my parameter is
Theta 1
I have to change parameter Theta 1 in
such a way that I get the best fit line
and along that I apply this sigmoid
activation function now let's go ahead
and let's first of all Define our cost
function because for this we definitely
require our cost
function now everything will be same
obviously you know the cost function of
linear regression because the first best
fit line that you are probably creating
is with the help of linear
regression now in this particular case
in the case of linear regression so here
you can basically write J J of theta 1
is nothing but 1 by m summation of I = 1
2 m 1X 2 and here you have H Theta of x
minus y of I I whole Square so this is
your entire thing of if you remember
linear regression whatever things we
have discussed yesterday okay so this is
the cost function let's consider that
for linear regression for this is for
the linear regression now for the
logistic regression what will happen for
your logistic regression I will take the
same cost function H Theta of X now you
know what is s Theta of X it is nothing
but 1 + 1 + e ^ of minus Theta 0 + Theta
sorry Theta 1 multiplied by X right this
is my with respect to logistic
regression this is my entire equation
now similarly I will try to only put
this H Theta of X let's consider that
this is my cost function only only my H
Theta of X is changing in this
particular case so if I go ahead and
write my cost function I can basically
say 1x2 h Theta of X of i - y of
i² and in this particular scenario what
is h Theta of X it is nothing but 1 + 1
+ e ^ minus Theta 1 x so this is what
this is getting replaced and this is my
logistic regression cost function I'm
just considering this cost function part
this part later on if you replace this
to this see if I replace this to this
and if I replace this to this it becomes
a logistic regression cost function
intercept I'm considering it as zero
guys now when I'm replacing this to this
this to this then it becomes a logistic
uh regression cost function but there is
one problem we cannot we cannot use we
cannot use this cost function there is a
reason for this because this equation
that you're seeing 1/ 1 + e^ of minus
Theta 1 * X this is a non-convex
function now you may be considering what
is a non-convex function so let me write
it down so here this this term this
terminology right it is a non-convex
function now what is this non-convex
function let me show you and let me
differentiate it with convex function
okay we'll try to understand what is the
difference between non-convex function
and convex function this is related to
gradient descent very important this is
related to gradient desent if you
remember with the help of linear
regression whatever gradient Dent we are
actually getting it is a convex function
like this this is the convex function
which looks like a parabola curve
Parabola curve because of this Parabola
curve whenever we use this linear
regression cost function specifically
because here my H Theta of X is what it
is nothing but Theta 0 + Theta 1 into X
because of this this equ
will always give you a parabola curve
this kind of cost function or convex
function you can say but here your s
Theta of X is changing so in the case of
if I use that cost function you will be
getting some curves which looks like
this now what is the problem with this
curve here you have lot of local Minima
if local Minima is there you will never
reach This Global Minima so that is the
reason we cannot use that c function now
mathematically you can also go and
probably search in the Google what is
the
what is the graph or what is a convex or
non-convex function but always remember
whenever we updates Theta 1 with this
within this particular equation by
finding the slope then this way it will
not be differentiable and here you have
lot of local Minima and because of this
local Minima you will never be able to
reach the global Minima this is your
Global Minima right in case
of in case of linear regression you'll
reach This Global Minima but in this
case you will never reach never never
you'll be stuck over here or you may get
stuck over here you may get stuck over
here okay so this has a local Minima
problem so how do we solve this
understand in local Minima these are my
points right I have to come over here
this is my deepest point in this
particular case I don't have any local
Minima now in local Minima also you'll
get slope is equal to Z so that is the
reason your Theta 1 will never get
updated so in order to solve this
problem you can see this diagram we have
something called as logistic regression
cost function so I can now write my
logistic regression cost function in a
different way so this researcher
researcher thought of it and basically
came up with this proposal that the
logistic cost function should look
something like this so the entire cost
function of logistic regression that is
specifically H Theta of X of I comma y
this should be written something like
this and it should be written like this
see here I'm just going to write cost
function of J of theta 1 let's say that
I'm writing J of theta 1 okay so J of
theta 1 what are the different different
output that I'll be getting I'll be get
I'll be getting yal 1 or y equal to 0 So
based on this two scenarios our cost
function will look something like this
minus log of H of theta of X and I know
I hope you all know what is h Theta of x
h Theta of X is nothing but 1 + 1 ^ of -
Theta 1 x so this is what is my H Theta
of X and whenever Y is Zer then you
basically have minus log * 1 - H Theta
of X of I of I okay so this is how you
basically write your cost function in
this particular scenario now with the
help of this cost function it is always
possible since it is getting log log is
basically getting used in this scenario
you'll always get a global Minima that
is the reason why they have completely
neglected this cost function and utiliz
this cost function now what does this
cost function basically mean two
scenarios if Y is equal to 1 Let's
consider this is my cost function
graph I have H Theta of X and you know
that H Theta of x value will be ranging
between 0 to 1 since it is a
classification problem so it will be
ranging between 0 to 1 and this is
basically of J of theta 1 which is my
cost function so if Y is equal to 1 this
specific equation will be used and
whenever this equation is is basically
used you get a you get a curve see minus
log s of X of I you get a curve which
looks something like this okay which
you'll get a curve which looks like this
now what does this curve basically
specify the curve come up with two
assumptions the cost will be zero if Y
is = 1 and H Theta of x equal to 1 that
basically when your s Theta of X is 1
and the Y is output is one that
basically means you're going to assign
over here one right so in this
particular case you will be seeing that
your cost function will be zero cost is
zero so here is my zero it is meeting
over here if you of x equal to 1 and Y
is equal to 1 so this is this is again a
convex function only then the next point
that you can probably discuss over here
is with respect to Y is equal to 0 if
your Y is Z then what kind of curve you
will be getting you'll get a different
kind of curve which will look like this
H Theta of x here your value will be 0
to one and here you'll be having a curve
which looks like this so when you
combine this two you'll be able to see
that you are able to get a kind of
gradient descent so this will definitely
help us to create a cost function so I
hope everybody is able to understand
till here with respect to this and this
will definitely work so finally I can
also write my cost function in a
different way the cost function that I
will probably write over here so this
will be my J of theta 1
so I can come up with a cost function
which looks like this
cost of H of theta of X of I comma Yus
log of H Theta of x if Y is equal
1 and then minus
log 1 - H Theta of x if Y is equal
0 now I can combine this both and
probably write something like like this
I can combine this both and I can
basically write cost of H Theta of X of
IA Y is equal to - y log H Theta of X of
I minus log 1 -
y okay 1 - y log of 1 - H Theta of X so
this will be my final cost
function and here also you can see that
if I
replace if I replace y with one then
what will remain only this particular
value will remain right this value when
Y is equal to 1 this thing only will
come you see over here replace y with
one probably replace y with one and then
you'll be able to see so here I can now
write if Y is equal to 1 my cost
function will Rook something like this
which is nothing
but see Y is 1 then what will happen my
log of H Theta of X of I will come and
this 1 - 1 is 0 so 0 multili by anything
will be 0 if Y is equal to 0 then what
will happen my cost function will be so
when it is zero this will - y will
become 0 0 multili by anything is z so
here you'll be able to see that I am
I'll be having minus log 1 - H Theta of
x i so this both the condition has been
proved by this cost function
so this is my cost function yes cost
function and loss function with respect
to the number of parameters will be
almost same so finally if I try to write
J of theta because I have that 1X 2 m
also right so 1X 2 m also I have so what
I'm actually going to do here you will
be able to see that I can write J of
theta 1 is equal to 1 by 2 m summation
of IAL 1 to M and then write down the
entire equation that you have probably
over here so here you have minus y or I
I'll just remove this minus and put it
over here and this will become plus
sorry y of I
* log H Theta of X of I 1 - y of i y
log 1 - H Theta of X of I so this
becomes my entire first function and
obviously you know what is h thet of x H
Theta of X of I is nothing but 1 + 1 e^
minus Theta 1 * X and finally my
convergence algorithm I have to repeat
this to update Theta 1 repeat until this
updation that is Theta Theta
J is equal to Theta J minus learning
rate derivative with respect to Theta J
and this will be my J of theta 1 this is
my repeat until conversion so this is my
cost function this is my repeat
algorithm and here I will be updating my
entire Theta
1 and this solves your problem with
respect to logistic regression simple
simple questions may come like how it is
different from linear regression how it
is not different from linear regression
can we say log likelihood a topic from
probabilistic yes this is uh this is log
likelihood if now I will discuss about
performance metrics and this is specific
to classification problem and binary
classification I'm talking let's
consider let's consider I have a data
set which has X1 X2 and this is y and
obviously in logistic uh classification
you have outputs like 0 1 0 1 1 0 1 and
your y hat y hat is basically the output
of the predicted model now in this
particular scenario my y hat will
probably be 1 1 0 uh 1 1 1 Z so in this
particular scenario this is my predicted
output and this is my actual output so
can we come to some kind of conclusions
wherein probably we will be able to
identify what may be the accuracy of
this specific model with respect to this
many data points because confusion
Matrix is all dealt with this is called
as we will first of all have to create a
confusion Matrix now for a binary
classification problem the confusion
Matrix will look like this so here you
have 1 0 1 0 Let's say that this is
prediction let's say that these are my
actual value and these are my prediction
value okay these both are prediction
value these are my output value when my
actual value is zero my predicted value
is one does this what does this mean
wrong prediction right so when my actual
value is zero my predicted value is 1 so
here my count will increase to one let's
go to the second scenario when the
actual value is one and my predicted
value is one that basically means one
and one so here I'm going to increase my
count similarly when my actual value is
zero my predicted value is zero so that
basically mean when my actual value is z
my predicted value is zero I'm going to
increase the count by one if I go over
here 1 one again it is so instead of
writing one now this will become two I'm
going to increase the count similarly
I'll go over here one more one is there
so I'm going to increase the count three
then I have 01 01 basically means when
my actual value is zero I'm actually
getting it as one so I'm also going to
increase this particular value as two
and then finally I have 1 and zero where
I'm going to increase like this now what
does this basically mean now what does
this basically mean see with respect to
this kind of predictions whenever we are
discussing this basically basically says
so this is my actual values and I have Z
1 and zero and this is my predicted
values I also have 1 and zero this value
when one and one are there this is
called as true positive this value when
0 and Zer are there this is called as
false negative whenever your actual
value is zero and you have predicted one
this becomes false positive and whenever
your actual value is one you have
predicted zero this becomes false
negative now coming to this I really
need to find out the accuracy of this
model now if I really want to find out
and this is what is called as confusion
Matrix now in this confusion Matrix if I
really want to find out the accuracy the
accuracy of this model it is very much
simple this middle elements that you are
able to see will basically give us the
right output so this and this if I add
it up it will give us the right output
so here I'm going to get TP + TN divided
by TP + FP + FN + TN so once I calculate
this so I have 3 + 1
/ 3 + 2 + 1 + 1 so this is nothing but 4
by 7 what is 4 by
757 so am I getting 57 percentage
accuracy so I'm actually getting 57%
accuracy over here with respect to the
accuracy so this is how we basically
calculate with respect to basic accuracy
with the help of uh the confusion Matrix
okay so this is specifically called as
confusion Matrix now there are some more
things that you really need to specify
always remember our model aim should be
that we should try to reduce false
positive and false negative now let's
say that I want to discuss about two
topics what one is suppose in our data
set I have zeros and one category let's
say in my output if I say Zer are 900
and ones are 100 this becomes an
imbalanced data very clear right so this
become an imbalanced data set it is a
biased data suppose if I say zeros are
probab
600 and ones are probably 400 in this
particular scenario I will say that this
is the balance data because yes you have
100 less but it's okay the it may not
impact many of the algorithm now see
guys most of the algorithm that we will
be probably discussing imbalanced if we
have an imbalanced data set it will
obviously affect the algorithms let me
talk about this let's say that I have
number of zeros as 900 and number of
ones is 100 now let's say that my model
I have created which will directly
predict
zero it'll I'll just say that all my
inputs that it is probably getting with
respect to this training data it'll just
output zero now in this particular
scenario what will be my accuracy my
accuracy will be 900 divid by 1,000
right so this is nothing but 90% so is
this a good
accuracy obviously it is a good accuracy
but this is a biased data if my model is
basically just outputting 00000000 0 if
it is outputting 00 00 0 obviously most
of the answer will be zeros but this
will be a scenario like you know where
it is just outputting one thing then
also it is able to get 90% accuracy so
you should only not be dependent on
accuracy so there are lot of
terminologies that we will basically use
one terminology that we specifically use
is something called as Precision then
we'll also use recall what is precision
what is recall I'll write the formula
over here in Precision what do we need
to focus and then finally we will
discuss about f score so we have to use
different kind of parametrics of sorry
different kind of formulas whenever you
have an imbalanced data set you can also
do oversampling but again understand in
most of the scenarios in some of the
scenarios oversampling may work but we
have to focus on the type of performance
metrics that we are focusing on right
now I'll not say F1 score I'll say F
score the reason why I'm saying I'll
just let you know so let's talk about
recall recall formula is basically given
by true positive divided by true
positive plus false negative
Precision is given by true positive
divided by true positive plus false
positive and then I will probably
discuss about F sore also or we
basically say fbaa also now I'll just
draw this confusion Matrix again okay
which is having true positive true
negative so let me draw it over here so
this is my ones and zeros these are my
actual values and these are my predicted
values I have true positive I have true
negative false positive and false
negative now in this particular scenario
when I'm actually discussing understand
what is recall and what focus it is
basically given on so here whenever I
talk about recall recall basically says
that TP TP divided by TP plus FN so I'm
actually focusing on this so what does
this basically say true uh recall out of
all the actual true positives how many
have been predicted correctly that is
basically mentioned by TP out of all the
positive values how many of them have
predicted as positive so this is what it
is basically saying and this scenario is
called as recall in this the false
negative is basically given more
priority and our focus should be that we
should try to reduce false positive
false negative sorry we should try to
reduce this now let's go ahead and let's
discuss about Precision in Precision
what we are doing we are basically
taking out of all the predicted values
out of all the predicted positive values
how many of them are actual true or
positive okay this is what Precision
basically means now suppose if I
consider spam classification suppose
this is my task tell me in this
particular case should we use Precision
or recall and one more use case I'm
saying that whether the person has
cancer or not in which case we have to
support recall and in which case we have
to go ahead with Precision has cancer or
not in spam what is important okay guys
the recall is also called as true
positive rate I can also say recall as
sensitivity so if I go with Spam
classification it should definitely go
with Precision why it should go with
Precision if I probably get a Spam ma
the main aim should be that whenever I
get a Spam Mill it should be identified
as spam okay in that specific scenario
my positive false positive we should try
to reduce and in this scenario my false
pository talks about the spam
classification a lot in a better way in
the case of cancer I should definitely
use recall let's let's focus on the
recall formula tp/ by TP plus FN if a
person has a cancer see one actually he
has a cancer it should be predicted as
one otherwise if we have FN it is
basically predicting it does not have a
cancer that is really a big situation in
this case if a person does not have a
Cancer and if he's predict if the model
predicts okay fine he has a cancer he
may go and further do the test and then
he'll come to know whether he has a
cancer or not but this scenario is very
dangerous if a person has a cancer but
he is being indicated that he does not
have that cancer
so here false negative is given more
priority over here in the case of spam
classification false positive is given
more priority so this is something
important over here and you really need
to understand with respect to different
different problem statement let me give
you one more example tomorrow the stock
market is going to crash in this what we
need to focus on should we focus on
Precision or should we focus on recall
now here two things are there who is
solving what kind of problem see many
people will say recall or Precision but
here two things are there on whose point
of view you are creating this model are
you creating this model for the industry
or are you creating this model for the
people for the people he should
definitely get identified that okay in
this particular scenario you need to
sell your stock because tomorrow stock
market is going to crash but for
companies this is very bad okay I hope
everybody is able to understand for
companies it is very very bad so in this
particular case sometime we need to
focus both on false positive and false
negative and again I'm telling you for
which problem statement you are solving
that indicates if you are solving for
people then they should be able to get
the notification saying that it is going
to crash if you're probably uh doing it
for companies at that time your
Precision recall may change but if I
consider for both the scenarios at that
point of time I will definitely use
something called as F score F score or
I'll also say it as F beta now how is
fbaa Formula given as I will talk about
it and here in the F score you have
three different formulas the first
Formula I will say basically as when
your beta value is 1 okay first of all
I'll just give a generic definition of f
s or F beta here you are basically going
to consider 1 + beta squ Precision
multiplied by recall divided beta Square
* Precision plus recall whenever your
both false positive and false negative
are important we select beta as one so
if I select beta as 1 it becomes 1 + 4
Precision multiplied by recall then you
have Precision plus recall so here sorry
1 + 1 so this becomes 2 multiplied by
Precision into recall divided by
Precision plus recall so here you have
this is basically called as harmonic
mean harmonic mean probably you have
seen this kind of equation where you
have written 2x y / x + y same type you
are able to see this this is called as
harmonic mean here the focus is on both
false positive and false negative let's
say that your false positive is more
important than false negative at that
point of time you will try to decrease
or you will try to decrease your beta
value let's say that I'm decreasing my
Beta value to 0.5 then what will happen
1 +5 whole
s and then you have P * R Precision
recall and here also you have 25 p + r
now in this particular scenario I'm
decreasing my Beta decreasing the beta
basically means that you are providing
more importance to false positive than
false negative and finally you'll be
able to see that if I consider beta
value as let me just say my notes if I
consider beta value as two that
basically means you are giving more
importance to false negative than false
positive so with this specific case you
can come up to a conclusion what value
you basically want to use now whenever I
use beta is equal to 1 it becomes fub1
score if I use beta as .5 then this
basically becomes f.5 score and this
becomes your F2 score So based on which
is important okay which is important
whether your Precision or false positive
or false negative is important you can
consider those things F score will have
different values if you're using beta is
equal to 1 that basically means you are
giving importance to both precision and
recall if your false positive is more
important then at that point of time you
reduce beta value if false negative is
greater than false bet uh false positive
then your beta value is
increasing beta is a deciding parameter
to decide your F1 score or F2 score or F
Point score now first thing first what
is the agenda of today's session first
of all we will complete practicals for
all the algorithms that we have
discussed these all algorithms that we
have discussed we will cover the
practicals probably we will be doing
hyper parameter tuning everything the
second thing and again here we are going
to take just simple examples so yes uh
so today's session I said practicals
with simple examples where I'll probably
discuss about all the hyper parameter
tuning then the second one the second
algorithm that I'm going to discuss
about is something called as n bias this
is a classification algorithm so we are
going to understand the intuition and
the third one that we are going to
probably discusses KNN algorithm so KNN
algorithms is definitely there
so this our today's plan I know I've
written very less but this much maths
and involved in na bias right we'll
understand the probability theorem again
over there there is something called as
bias theorem we'll try to understand and
then we'll try to solve a problem on
that so let's proceed and let's enjoy
today's session how do we enjoy first of
all we enjoy by creating a practical
problem so I am actually opening a
notebook file in front of you so here uh
we will try to Sol solve it with the
help of linear regression Ridge lasso
and try to solve some problems let's see
how much we will be able to solve it but
again the aim is that we learn in a
better way okay uh so that everybody
understands some basic basic things okay
so first of all as usual uh everybody
open your jupyter notebook file the
first algorithm that I'm going to
discuss about is something called as SK
learn linear regression so everybody I
hope everybody knows about this SK learn
let's see what all things are basically
there in this we will be using fit
intercept everything as such but here
the main aim is to find out the
coefficients which is basically
indicated by Theta 0 Theta 1 and all the
first thing we'll start with linear
regression and then we will go ahead and
discuss with r and lassor I'm just going
to make this as
markdown how many different libraries of
for linear regression you can do with
stats you can do with skyi you can do
with many things okay so first thing
first let's first of all we require a
data set so for the data set what we are
going to do is that we are going to
basically take up some smaller smaller
data just let me do this so for this uh
we are going to take the house pricing
data set so we are going to solve house
pricing data set problem a simple data
set which is already present in SK learn
only now in order to import the data set
I will write a line of code which is
like from SK learn dot data sets data
sets
import load uncore Boston so we have
some Boston house pricing data set so
I'm just going to execute this I'm also
going to make a lot of Sals so that I
don't have to again go ahead and create
all the sales again some basic libraries
that I probably want is pro import numai
as
NP
import pandas
SPD okay import cbon as
SNS and then I will also import Matt
Matt plot lib do p plot as PLT and then
percentile matplot lib matlot lib do
inline and I will try to execute this
see this my typing speed has become a
little bit faster by writing by
executing this queries again and again
and uh let's go ahead uh so I have
imported all the necessary libraries
that is required which which will be
more than sufficient for you all to
start with now in order to load this
particular data set I will just use this
Library called as load uncore Boston and
I'm going to just initialize this so if
you press shift tab you will be able to
see that return load and return the
Boston house prices data set it is a
regression problem it is saying and then
probably I'm just going to execute it
now once I execute it I will go and
probably see the type of DF so it is
basically saying skarn dos. bunch now if
I go and probably execute DF you'll be
able to see that this will be in the
form of key value pairs okay like Target
is here data is here okay so data is
here Target is here and probably you'll
be able to find out feature names is
here so we definitely require feature
names we require our Target value and
our data value so we really need to
combine this specific thing in a proper
way in the form of a data frame so that
you will be able to see so what I'm
actually going to do over here I'm just
going to say PD do data frame I'll
convert this entirely into a data frame
and I will say DF do data see this is a
key value pair right so DF do data is
basically giving me all the features
value so if I write DF do data and just
execute it you'll be able to see that I
you will be able to get my entire data
set in this way my entire data set in
this way this is my feature one feature
two feature three feature 4 this feature
12 I have 12 features over here and
based on that I have that specific value
now the next thing thing that I'm going
to do probably I should also be able to
add the target feature name over here so
what I will do I will just convert this
into DF and then I will also say DF do
columns and I'll set it to DF do Target
okay and let me change this to data set
so I'm going to change this to data set
and I'm going to say data set. columns
is equal to DF do Target so if I execute
this and now if I probably
print my data set do head you will be
able to see this specific thing okay it
is an error let's see expected axis has
13 element new values has
506 so Target okay I should not use
Target over here instead I had a column
which is called as features feature
names like if I go and probably see
DF DF over here you'll be able to see
there is one thing which is called as
feature names so I'm going to use DF do
feature names over here so here it is DF
do feature names I'm just going to paste
it over here and now if I go and write
here you can see print DF data set. head
if I go and execute without print you'll
be able to see my entire data set so
these are my features with respect to
different different things and this is
basically a house pricing data set so
initially I have this features CRM ZN
indust CH nox RM age distance radius tax
PT ratio b l stack that so I have my
entire data set over here the same data
set I have basically put it over here
now here also you'll be able to see what
all this feature basically means this is
showing wasted weighted distance to five
do uh Five Boston employment center rad
basically means index of accessibility
to radial Highway tax basically means
full value property tax rate this much
PT rate basically means pupil teacher
ratio I don't know what the hell it
means but it's fine we have some kind of
data over here properly in front of you
so these are my independent features
what are these these all are my
independent features if you want the
features detail here you can see it
right everything what is CRM this
basically means per capita crime rate by
town which is important ZN it is
proportional of residential land zone
for Lots over 25,000 Square ft so this
is my DF I did not do much I'm just
using data frame DF do data column
features name I'm getting this value
very much simple now let's go a little
bit slowly so that many people will be
able to also understand now this is my
data set. head now the thing is that I
obviously have taken all these
particular values but this is my
independent feature I still have my
dependent feature so what I'm actually
going to do I will create a new feature
which is like data set of price I'll
create my feature name price price of
the house and what I will assign this
particular value this value will be
assigned with this target this target
value this target value is basically the
sale the price of the houses right it is
again in the form of array so I'm going
to take this and put it as a dependent
feature so here you'll be able to see
that my price will be my dependent
feature so here I'll basically write DF
do Target so once I execute it and now
if I probably go and see my data set do
head you'll be able to see features over
here and one more feature is getting
added that is price now this price may
be the units may be in
millions somewhere Target should be here
or there it should be probably in
millions
or I cannot see it but it should be
somewhere here it should have definitely
said that it is probably in millions or
okay but that is not a problem I think
but mostly it'll be in millions
somewhere I think it should be
here okay I cannot see it but probably
if I put more time I'll be able to
understand it okay so over here what is
the thing main thing this all are my
independent features and this is my
dependent feature right so if I'm trying
to solve linear regression I have to
divide my independent and dependent
features properly now let's go to the
next step that
is
dividing the data
set dividing the oh my God dividing the
data
set
into
train into first of all I'll try try to
divide into
independent and dependent
features so I want my entire features
data set divided into independent and
dependent features X I will be using as
my independent featur so I will write
data set dot I will use an iock which is
present in data frames and understand
from which feature to which feature I
will be taking as my independent feature
to this feature till lat so the best way
that basically means that I just need to
skip the last feature in order to skip
the last feature what I'm actually going
to do from all the columns I will just
skip the last column so this is how you
basically do an indexing with respect to
just skipping the last feature and this
will basically be my independent
features and here I will basically say Y
is equal to data set do iock and here I
just want the last feature so I will
write colon all the records I want and
see the first term that we are probably
WR writing over here this basically
specifies with respect to records here
this specifies with respect to columns
from all the columns I'm taking the last
column here I will just take the last
column and this will basically be my
dependent features dependent features so
here I have basically executed now if
you can go and probably see x. head here
you'll be able to find all my
independent features in y do head you'll
be able to find the dependent feature
now let's go to the first algorithm that
is called as linear regression
always remember whenever I definitely
start with linear regression I'll
definitely not go directly with linear
regression instead what I will do is
that I'll try to go with Ridge
regression and uh lasso regression
because there you are lot of options
with respect to hyper pment T but I'll
just show you how linear regression is
done so basically you really really need
to use a lot of libraries okay over here
and based on this libraries this
libraries will try to install okay and
what are these libraries these are
basically the linear regression Library
so here I'm basically going to use two
specific thing one is linear regression
Library so I will just use from SK learn
do linear uncore model import linear
regression do you need to remember this
the answer is no because I also do the
Google and I try to find out where in
escal and it is present okay so here is
my linear regression so I will try to
initialize linear reg is equal to
initialize with linear regression and
then here what I'm actually going to do
I'm going to basically apply something
called as cross validation cross
validation is very much important
because in Cross validation we divide
out train and test data in such a way
that every combination of the train and
test data is basically taken by care is
taken by the model and whoever accuracy
is better that all entire thing is
basically combined so here what I'm
going to do I'm going to say mean square
error is equal to here I will import one
more Library let's say from SK learn
dot model selection I'm going to import
cross Val
score so cross Val score cross
validation score basically means it is
going to do a lot of train and test
split it's something like this one
example I will show it to you here only
so what does cross validation basically
do okay so in Cross validation what
happens what you do suppose this is your
entire data
set suppose this is 100 records if you
do five cross validation then in the
first this will be your test data and
remaining all will be your training data
if in the second cross validation this
will be your test data and remaining all
will be your test uh training data like
this five times you'll be doing cross
validation by taking the different
combination of train and test but I'm
not going to discuss much about it in
the future if you want a separate
session I will include that in one of
the session itself so this was uh
basically the plan with respect to cross
validation or cross Val score so here
I'm going to basically take cross
Val
score and here the first parameter that
I give is my model so linear regression
is my model and here I will take X and Y
I'm not doing a train test split
specifically over here I'm giving the
entire X and Y and probably based on
that I'm going to do a cross validation
over here you can also do train test
plate initially and then just give the X
train and Y train over here to do the
cross validation it is up to you but the
best practices will be that first you do
the train test split and then only give
the train data over here to do the cross
validation I'm just going to use scoring
is equal to you can use mean squared
error negative mean squared error let's
say that I'm going to use negative mean
squ error again where do you find all
these things you will be able to see in
the SK learn page of L uh cross Val
score and then finally in the cross Val
score you give cross validation value as
5 10 whatever you want so after this
what I'm actually going to do I'm just
going to basically from this how many
scores I will get the mean squar error
will be five since I'm doing five cross
validation if you don't believe me just
see over here print msse so here you'll
be able to see five different values 1 2
3 4 5 right five different mean values
because we are doing cross five five
cross validation so here what I'm going
to write I'm just going to say np. mean
I want to take the average of all the
five so here will basically be my
meanor
msse okay and then probably I'll print I
will print my Ms meanor MSC so this will
be my average score with respect to this
the negative value is there because we
have used negative mean squ error but if
you just consider mean square error then
it is only 37.1 3 okay so this I have
actually shown you how to do cross
validation see with respect to linear
regression you can't modify much with
the parameter so that is the reason why
specifically in order to overcome
overfitting and do the feature selection
we use uh R and lasso regression so here
I will show show you how to do ridge
ridge regression
now now in order to do the prediction
all you have to do is that just go over
here take the model okay what is the
model linear R and just say do
predict so here you can see uh you'll be
getting a function called as do predict
and give the test value whatever you
want to predict automatically the
prediction will be done so I'm just
going to remove this and focus on Ridge
regression right now because I I want to
show how hyperparameter tuning is done
in R regression so for R regression the
simple thing is that I'll be using two
different libraries from skarn do
linear linear uncore model I'm going to
import Ridge so for the ridge it is also
present in linear underscore model for
doing the hyperparameter tuning I will
be using from SK learn do modore
selection and then I'm going to import
grid SE CV so these are the two
libraries that I'm actually going to use
grid SE CV will be able to help you out
with the um okay will be able to help
you out with Hyper parameter tuning and
then probably you'll be able to do
that uh difference between MSE and
negative MSE not big thing guys if you
use MSE here mean squ error you'll be
getting 37 I've just used negation of
MSE it's okay anything is fine you can
go with MSE also means square error
there is also another uh another scoring
area which is like which focuses on
square root square mean Square uh sorry
root means Square eror okay so there are
different different things which you can
basically focus on okay now in order to
give you this specific good value I'm
actually going to do hyper Peter tuning
now let's go ahead with uh grid s CV so
here what I'm going to do again I'm
going to basically Define my model which
will be
Ridge okay so this this is what I have
actually imported now uh let me open the
ridge skarn so SK learn
Ridge we need need to understand what
all parameters are basically
used do you remember this Alpha value
guys do you remember this Alpha value
why do we use Alpha I I told you now
Alpha multiplied by slope square if you
remember in Ridge we specifically use
this right Ridge and lasso regression
Alpha so this is the alpha the this is
probably the best parameter we can
perform hyper parameter tuning the next
parameter that we can probably perform
is basically uh this Max iteration okay
Max iteration basically means how many
number of iteration how many number of
times we may probably change the Theta 1
value to get the right value so we can
do this so what I'm actually going to do
I'm going to select some Alpha values
I'm going to play with this apart from
that if I want I can also play with the
other parameters which are uh like kind
of uh you know probably you can you can
also play with the iteration parameter
it is up to you try whichever parameter
you want to change you can go ahead and
change it now let me show you how do we
write this and how do we make sure that
this specific thing is done now uh
before doing grid s CV uh let me do one
thing I will Define my parameters okay
so here is my Ridge now what I'm going
to do I'm going to say parameters and in
this parameter two important value that
I'm probably going to take is this one
that is my C value and I will try to
Define this in the form of dictionaries
so here the C value that I sorry not C
just a second
guys my mistake it is not C it is
Alpha let's see so how do I Define my
Alpha value we'll try to see so here the
parameters will be Alpha C is basically
for uh logistic regression I'll show you
so the alpha value I will just mention
some values like
1 e to the power of -5 that basically Me
00000000 0 0 0 1 similarly I I can write
1 E to the^ of - 10 that again means 0 0
0 0 0 0 0 0 10 * 0 1 I'm just making fun
okay so that you will also get
entertained 1 E to the^ of minus 8 okay
similarly I can write 1 E to the^ of
minus 3 from this particular value now
I'm increasing this value see 1 E to
the^ of minus 2 and then probably I can
have 1 5 10 um 20 something like this so
I'm going to play with all this
particular parameters for right now
because in grit or CV what they do is
that they take all the combination of
this Alpha value and wherever your uh
your your model performs well it is
going to take that specific parameter
and it is going to give you that okay
this is the best fit parameter that is
got selected so here I have got all
these things now what I'm going to do
I'm going to basically apply the grid C
TV so here I have uh gridge uh sorry
Ridge GD I'm
saying ridore regressor so I'm going to
use git s
CV git s CV and here I'm basically going
to take the parameters regge okay Ridge
is my first model and then I will take
up all this params that I have actually
defined see in git CV if I press shift
tab I have to first of all execute this
then only it will be able to press shift
tab so here if I press shift tab here
you'll be able to see estimator and
parameter grid is my second parameter
then scoring and then all the other
parameters so here the first thing that
goes is your model then your parameters
which what you are actually playing then
the third parameter is basically your
scoring
scoring and again here I'm going to use
negative mean squ error some people are
saying that mean squared error is not
present so that is the reason why
negative mean squ error is done why it
may not be present because
uh they try to always create a generic
Library probably this kind of uh scoring
parameter may also get used in other
algorithms so that is the reason they
may not have created but if you want to
Deep dive into it Google
Google then what is r regress dot fit on
X comma y again I'm telling you you can
first of all do train test split on X
and Y and then probably only do this on
X train and Y train parameter is not oh
sorry
okay I get this okay parameter is not
and why it is not and oh yeah it has
become a
list I'm going to make this as
dictionary right now I'm fully focused
on implementing things if I get an error
I'll definitely make sure that it'll get
fixed anyhow if I get that error I will
not say oh Kish why why this error came
you
know why this error came I I'll not get
worried I'll get the error down only you
cannot give this as the one okay so try
to understand okay so this is your gitar
CV I've also done the fit and let's go
and select the best parameter so what I
can do I will write print
ridore
regressor dot
params sorry there will be a parameter
called as best params I'm going to print
this and I'm going to print ridore
regressor Dot
best
score so these are all the values that
are got selected one is Alpha is equal
to 20 and the best score is - 32 so
initially I gotus 37 but because of
Ridge regression you can see that our
negative mean square error has
definitely become better there is a
minus sign don't worry but from 37 it
has come to 32 cross validation guys
over here inside grids s CV also when it
is probably taking the entire
combination over there the CV Value
Cross validation also we can use
so probably if I am probably considering
all these
things many people has a question Chris
is this minus value increased that
basically means you cannot use Ridge
regression you are right in this
particular case Ridge regression is not
helping you out so guys let me again
write it down everybody don't worry yeah
previous I got minus 32 right now I'm
getting - 37
right sorry previously I got what - 37
- 37 now I got - 32 so here you can see
this I got it from linear regression
this I got it from what Ridge which one
should I select I should select this
model only because it is performing well
than this but again understand Ridge
also tries to reduce the overfitting so
probably in this particular scenario we
cannot use Ridge because the performance
is becoming more bad so what I will do I
will go and try with lasso regression
now I'll copy and paste the same thing
so linear model import lasso then this
will basically be my
lasso let's see with lasso whether it
will increase or not let's
see this is my parameter that got
selected now let me write lasto
regressor
dot best params so this is Alpha is
equal to one is got selected over here
I'm just going to print it okay and then
I'm going to print with last one
regression DOT score will be the best so
here I'm actually getting - 35 - 35 here
I'm actually getting - 32 so minus 35
still I will focus on linear regression
now see what will happen if I add more
parameters if I add more parameters see
what will happen so now I'm going to
take Alpha different different values
see this I'm just going to remove this
and probably add Alpha value in this
way see here I have added more values 5
10 20 30 35 40 45 100 okay let's see
whether we our performance will increase
or not so here
uh first of all let me remove from here
in Ridge just take it down guys I'm I'm
adding more parameters like this just
take it down yeah CV is equal to 5
nobody okay you're not able to see it um
CV is equal to 5 now here it is uh what
you can basically focus on so here you
can see I have added some values like
this you can also
add and just try to execute and now if I
go and probably see this is my see first
I have tried for Ridge I'm getting minus
29 do you see after adding more
parameters what happened in Ridge after
adding more parameters what happened in
Ridge you can see om minus 29 and the
alpha value that is got selected is 100
if you want try with cross validation
10 and just try to execute now
now so these are are some hyper
parameters that we will definitely play
with here you can see - 29 so here you
can see minus 29 you can also increase
the cross validation
value over here also and probably
execute it but with lasso I don't know
whether it is improving or not it is
coming to minus 34 you just have to play
with this parameters as now for a bigger
problem statement the thing is not
limited to here right we try to take
multiples and many parameters multiples
and many parameters and try to do these
things it is up to you we play with
multiple parameters whichever gives us
the best result we are basically taking
it it's okay error is increased I know
that no error is increasing definitely
error is increasing even though by
trying with different different
parameters but about most of the
scenario see here I gotus 37 probably
what I can actually do is that uh try to
get better one with respect to this
now the best way what I can also do is
that I can basically take up train and
test split also and probably do these
things let's see let's see one example
so how do we do train and test from SK
scalar dot I think model selection
import train test split okay it's okay
guys you may get a different value okay
let's do one thing okay let's make your
problem statement little bit simpler now
what I'm going to do just tell me in
train test plate what we need to do so
I'm going to take the same code I'm
going to paste it over here or let me do
one thing let me insert a cell below and
let me do it for train test split so in
train test plate what we can do so I'm
just going to take the syntax paste it
over here let's say that I'm taking XT
train y train and then I'm using train
test split with 33% now if I execute
with respect to X train and Y train so
here is my you can see this I have
written this code from SK learn. model
selection uh train test plate random
State can be anything whatever you write
it is fine then you basically give X and
Y with test sizes 33 uh this is
basically saying that the test will have
33% and the train data will be 77% so
this is what I'm actually getting with
respect to X train and Y train here what
I'm going to do I'm going to basically
take X train comma y train and now if I
go and probably see this here you can
see minus 25 understand this value
should go towards zero if it is going
towards zero that basically means the
performance is better now similarly I do
it for Ridge in Ridge what I'm actually
going to do here I'm going to write X
train and Y train and if I go and
probably select the best score than this
here you'll be able to see I'm getting
how much I'm getting minus
2.47 okay here I'm getting
25.8 here 25. 47 that basically means
now still the Improvement is little bit
bad because here we are not going
towards zero so the next part again here
also you can basically do it for X train
and Y train X train and Y train so here
you have this one and let's go and
execute this so here you can see minus
2.47 now what you can also do is that
you can use this
lasso regressor do predict and you can
basically predict with respect to X test
so this is your white test value suppose
let's say that this is my y PR Yore PR
then what I can do from SK
learn I will be using R square and
adjusted R square if you remember SK
learn R square r² so this is my R2 score
so where it is present in SK learn.
Matrix so I'm going to write from SK
learn import let's say I'm saying from
skarn do Matrix import r² R2 score now
what I'm going to do over here I'm
basically going to say my R2 score which
is my variable I'll say this is nothing
but R2 score here I'm just going to give
my y PR comma Yore test so if I go and
probably see the output here I will be
able to see print R2 score this is all I
have discussed guys there is also
adjusted rant score is there where is R2
R2 score one adjusted r² okay R2 score
is there but adjusted R square should be
here somewhere in some manner so this is
how your output looks like with respect
to by using this lasso regressor okay
which is very good okay it should be I
told it should be near 100% right now
I'm getting 67% if I want to tie with
the ridge you can also try that so you
can say Ridge regressor do predict and
here you can see 7 68% then you can also
try linear regressor and
predict what is the error saying the
regression is not fitted yet why why it
is not fitted why it is not
fitted let's say that I have fitted here
linear
regression dot fit on X train and Y
train X train and comma y train so I'm
just going to fit it now if I go and
probably try to do the
calculation so if I go and see my R2
score it is also coming somewhere around
68% 67% now since this is just a linear
regression you won't be able to get 100%
because you're drawing a straight line
right so for that you basically have to
other use other algorithms like XG boost
and all n bias so many algorithms are
there it's okay see you give y test over
here y PR over here both are same right
they're
comparing by see at one limit you can
you can increase the performance after
that you cannot see again I'm telling
you in linear regression what we do
these are my points right I will be only
able to create one best line I cannot
create a curve line right over here so
obviously my accuracy will be only
limited let's go and do it logistic
practical
quickly and here uh in logistic also we
can do git SE CV now what I'm actually
going to do first of all let's go ahead
with the data set so I will quickly
Implement logistic so from LC learn.
linear
model I'm going to import logistic
regression so I'm going to use logistic
regression and apart from that we know
that let's take a new data set because
for logistic we need to solve using
classification problem so this is
basically my logistic regression I'll
take one data set so from SK learn. data
sets import we'll take a data set which
is like uh breast cancer data set so
that is also present in SK learn with
respect to the breast cancer data set
I'm just going to use this see load best
cancer data set I'm loading it and all
the independent features are in data and
my columns are feature names the same
thing like how we did previously okay so
this will basically be my
complete uh complete independent feature
so if I go and probably see this x. head
here you'll be able to see that based on
this input features the independent
feature we need to determine whether the
person is having cancer or not these are
some of the features over here and this
is like many many features are actually
present so next thing I this that was my
independent feature now I'll take my
dependent feature dependent feature will
already present in DF Target okay this
particular data set that we have taken
in DF in DF do Target we will basically
have all our dependent feature these are
my independent features so what I'm
actually going to do I'm going to create
Y and I'm going to say PD do data frame
and here I'm going to say DF do Target
Target and this column name should be
Target right so this will be my column
name and now if I go and see my y y is
basically having zeros and one in the
target feature now the next thing that
we are going to do is that uh apply
basically apply the first of all we need
to check whether this data set is uh
this particular y column is balanced or
imbalanced okay in order to do that I
will just write F
Target if the data set is imbalanced
definitely we need to work on that and
try to perform upsampling so if I write
y target. Valore counts if I execute
this so here you'll be able to see that
value SC counts will basically give that
how many number of ones are and how many
number of zeros are so now total number
of ones are 357 and total number of
zeros are 22 so is this a imbalanced
data set probably this is a balanced
data set so here I'm actually going to
now do train test spit train test spit I
will try to do again train test spit how
do we do we can quickly do copy the same
thing entirely I'll copy this entirely
over here and then I will get my X and Y
so here is my X train X test y train y
test so train test plate obviously I'll
be doing it now in logistic regression
if I go and search for
logistic regression escalar I will be
able to see this what all parameters are
there this is basically the L1 Norm or
L2 Norm or L1 regularization or L2
regularization with respect to whatever
things we have discussed in logistic and
then the C value these two parameter
values are very much important if I
probably show you over here the penalty
what kind of penalty whether you want to
add L2 penalty L1 penalty you can use L2
or L1 the next thing is C this is
nothing but inverse of regularization
strength this basically says 1 by Lambda
something like that this parameter is
also very much important guys class
weight suppose if your data set is not
balanced at that point of time you can
apply weights to your classes if
probably your data set is balanced you
can directly use class weight is equal
to balanced other than that you can use
other other weight which you basically
want so this is specifically some of
this right no this is not Ridge or lasso
okay this is logistic in logistic also
you have L1 norm and L2
Norms understand probably I missed that
particular part in the theory but here
also you have an L2 penalty norm and L1
penalty Norm I probably did not teach
you in theory because if you look see
logistic regression can be learned by
two different ways one is through
probabilistic method and one is through
geometric method if you go and probably
see my video that is present with
respect to logistic regression right now
in my YouTube channel there I have
explained you about this L1 and L2 Norms
also over there so in this also it is
basically present it is a kind of
penalty again just for uh using for this
kind of classification problem so what
I'm actually going to do let's go and
play with the parameters that I am
looking at so I will play with two
parameters one is params C value here
I'm defining 1 10 20 anything that you
can Define one set of values you can
Define and there was one more parameter
which is called as Max iteration this is
specifically for grits or CV okay that
I'm specifically going to apply so I
will just try to execute this this will
be my params now I'm going to quickly
Define my model one which will be my
logistic regression model so my logistic
regression here by default one value
I'll give for C and Max itra let's say
I'm giving this value later on what I
will do for this model I'll apply it to
grid sear CV so I'm just going to say
grid s CV and I'm going to apply it for
model one param grid is equal to params
this parameter that I'm specifically
trying to apply since this is a
classification problem and I am not
pretty sure that whether true positive
is important or true negative is
important I'm going to use F1 scoring
okay F1 scoring is basically again the
parametric term which we discussed
yesterday which is nothing but
performance metrics and then I'm going
to use CV is equal to 5 so this will be
entirely my model with respect to grid s
CV and I'll be executing this then I
will do model. fit on my X train and Y
train data so once I execute it here you
can see all the output along with
warnings a lot of warnings will be
coming I don't know because this many
parameters are there and finally you can
see that this has got selected now if
you really want to find out what is your
best param score model
dot best params so here you can see Max
iteration as
150 and what you can actually do with
respect to your best score model do best
score is 95 percentage but still we want
to test it with test data so can we do
it yes we can definitely do it I'll say
model do core or I'll say model dot
predict on my X test data and this will
basically be my y red so this will be my
y red all the Y prediction that I'm
actually getting so if you go and see y
red so these are my ones and zeros with
respect to the Y
prediction at finally after getting the
prediction values I can apply confusion
Matrix I hope I have taught you about
confusion Matrix so from sklearn do
confusion Matrix sorry sklearn do metrix
I'm going to import confusion metrix
classification report and the next thing
that I would like to do is this two I
will try to import confusion Matrix and
classification report now if you want to
see the confusion Matrix with respect to
your I can just write
Yore frad or Yore test whatever you want
go ahead with it and this is basically
my confusion Matrix if I put this
forward no difference will be there only
this thing will be moving that also I
showed you 63 118 3 and 4 now finally if
I want to accuracy score I can also
import accuracy score over here so here
you can see accuracy score is imported I
can also find out my accuracy score
which is my the total accuracy with
respect to this I we can give y test and
Yore PR which we have discussed
yesterday this is giving
96% if you want detailed Precision
recall all the score then at that point
of time I can use this classification
report and here I can give white test
and wied here is what I'm actually
getting so here you can see with respect
to F1 F1 score Precision recall since
this is a balanced data set obviously
the performance will be best yes you can
also use Roc see I'll also show you how
to use Roc and probably you'll be able
to see this you have to probably
calculate false positive rate two
positive rate but don't worry about Roc
I will first of all explain you the
theoretical part now let's go ahead and
discuss about n bias n bias is an
important algorithm so here I'm just
going to go ahead so now let's go ahead
and discuss about na bias and here we
are going to discuss about the intuition
so na bias is an another amazing
algorithm which is specifically used for
classification and this specifically
works on something called as base
theorem now what exactly is base theorem
first of all we need to understand about
base theorem let's say that guys I have
base theorem let's say that I have an
experiment which is called as rolling a
dis now in rolling a dis how many number
of elements I have have so if I say what
is the probability of 1 then obviously
you'll be saying 1X 6 if I say
probability of two then also here you'll
say 1X 6 if I say probability of three
then I will definitely say it is 1x 6 so
here you know that this kind of events
are basically called as independent
events now rolling a dice why it is
called as an independent event because
getting one or two in every experiment
one is not dependent on two two is not
dependent on three so they are all
independent that is the reason why we
specifically say is an independent event
but if I take an example of dependent
events let's consider that I have a bag
of marbles okay in this marble I
basically have three red marbles and I
have two green marbles now tell me what
is the probability of suppose I have a
event in the first event I take out a
red marble so what is the probability of
taking out a red marble so here you can
definitely say that it is
3x5 okay so this is my first event now
in the second event let's say that in
this you have taken out the red marble
now what is the second second time again
you are taking out the second red marble
or forget about second Rand marble now
you want to take out the green marble
now what is the probability with respect
to taking out a green marble so here
you'll be definitely saying that okay
one red marble has been removed then the
total number of marbles that are left
are four so here you can definitely
write that probability of getting a
green marble is nothing but 2x4 which is
nothing but 1x2 so here what is
happening first first element you took
out first marble that you took out first
event from from the first event you took
out red marble from the second event you
took out green marble this two are in
these two are dependent events because
the number of marbles are getting
reduced as you take out from them so if
I tell you what is the probability of
taking out a red marble and then a green
marble so it's the simple the formula
will be very much simple right which we
have already discussed in stats it is
nothing but probability of probability
of red multiplied by probability of
green given Red so this specific thing
is called as conditional probability
here understand what is happening
probability of green marble given the
red marble event has occurred here both
the events are independent now let me
write it down very nicely so I can write
probability of A and B is equal to
probability of a multiplied probability
of B divided by probability of a let's
go and derive something can can I write
probability of A and B is equal to
probability of b and a so answer is yes
we can definitely say we can definitely
say if you go and do the calculation
you'll be able to get the answer you
should not say no now what is the
formula for probability of A and B so
here you can basically write probability
of a multiplied by probability of B
given a if I take out probability of
green what is probability of green in
this particular case 2x 5 what is
probability of red 3x 4 for right now
let's consider this now this part I can
definitely write as this part I can
definitely write as probability of B
multiplied by probability of B
probability of B this one probability of
B and this will be probability of a
given B so I can definitely write this
much with respect to all this
information now can I derive probability
of a is equal to probability of B
multiplied by probability of a / B me
probability of a given B divided by
probability of sorry I'll write this as
probability of B given a divided by
probability of a and this is
specifically called as base theorem and
this is the Crux behind na bias
understand this is the Crux behind the
base theorem now let's go ahead and
let's discuss about how we are using
this to solve let's take some examples
and probably make you understand let's
say that I have some features like X1 X2
X3 X4 X5 like this till xn and I have my
output y so these are my independent
features these all are my independent
features these all are my independent
features so here I'm going to write
independent features and this is my
output feature which is also my
dependent feature now what is happening
if I say probability of b or a what does
this basically mean I need to really
find what is the probability of Y and
you know that guys I will have some
values over here and basically I'll have
some output value over here so based on
this input values I need to predict what
is the output initially on a training
data set I will have your input and then
your output initially my model will get
trained on this now let's consider what
this entire terminology is I will try to
write in terms of this equation so I
will say probability of Y given x1a x2a
X3 up till xn then this equation will
become probability of Y see probability
of Y given X X1 X2 X3 xn this a is
nothing but X1 X2 X3 xn and I'm trying
to find out what is the probability of Y
and then I will write probability of b b
is nothing but y but before that what
I'll write probability of a / B right a
given b or probability of B probability
of B is nothing but y multiplied by
probability of a given B probability of
a given B basically means probability of
x1a X2 comma xn and given b b is given
right so I'm able to find this entire
value now just a second I made some
mistakes I guess now it is correct sorry
I I just missed one term that is this
given y this is how it will become and
this will be equal to probability of a
that is X1 comma X2 like this up to XL
so probability of Y multiplied by
probability of a given y now if I try to
expand this then this will basically
become something like this see
probability of Y multiplied by
probability of X1 given yes a given y
sorry given y multiplied by probability
of X2 given y probability of x3 given Y
and like this it will be probability of
xn given y so this will also be y1 Y2 Y3
YN this I can expand it like this and
then this will basically become
probability of X Y 1 multiplied by
probability of X2 multiplied by
probability of x3 like this up to
probability of xn so this is with
respect to all the probability y will be
different see here for this particular
record y will be different for this y
will be different for this y will be
different but why output it may be yes
or no right it may be yes or no okay I
I'll solve a problem it will make
everything understand and this will
probably be probability of Y it can be
binary multiclass whatever things you
want I'll solve a problem in front of
you now let's say that I have my y as
let's say that I have a lot of features
X1 X2 X3 X X4 with respect to this let's
say in my one of my data set I have this
many x1s this many features and this is
my y so these are my feature number and
this is my y let's say that in y I have
yes or no so how I will probably write
we really need to understand this okay I
will basically
say what is the probability of Y is
equal to yes given this x of I this is
my first record first record of X of I
this is my second record of X of I so I
may write like this what is the
probability of Y being yes if x of I is
given to you X of I basically means X1
X2 X3 X4 so here you'll obviously write
what kind of equation you'll basically
say probability of yes multiplied by
probability of yes multiplied by
probability of X of 1 given
yes multiplied by probability of X2
given yes probability of x3 given yes
and probability of X4 given yes divided
by probability of X1 multiplied by
probability of X2 multiplied by
probability of x3 multiplied by
probability of X4 Y is fixed it may be
yes or it may be no but with respect to
different different records this value
may change similarly if I write
probability of Y is equal to no given X
of I what it will be then it will be
probability of no multiplied by
probability of X1 given no then
probability of
X2 given
no probability of
x3 given
no and probability of X4 given no so
here because every any input that I give
any input X of I that I give I may
either get yes or no so I need to find
both the probability so probability of
X1 multiplied by probability of X2
multiplied by probability of x3
multiplied by probability of X4 see with
respect to Any X of I the output can be
yes or no and I really need to find out
the probabilities so both the formula is
written over here what is the
probability of with respect to yes and
what is the probability with respect to
no now in this case one common thing you
see that this this denominator is fixed
this is definitely fixed it is fixed it
is it is not going to change for both of
them and I can consider that this is a
constant so what I can do I can
definitely ignore so here I can
definitely ignore these things ignore
this also ignore this Al because see
this is constant so I don't want to
consider this in the next time I'll just
use this specific formula to calculate
the probability now let's say that if my
first probability for a specific data
set yes of X of I is let's say that I'm
getting
as13 and similarly probability of no
with respect to X of I if I get
05 you know that in a binary
classification any values if it get
greater than or equal to 5 we are going
to consider it as 1 and if it is less
than 0.5 I'm going to consider it as
zero now I'm getting values like this 13
and .1 05 obviously I'm getting .13 05
so we do something called as
normalization it says that if I really
want to find out the probability of X
with X of I if I do normalization it is
nothing but .13 divided by .13 +
05 72 this is nothing but
72% and similarly if I do for
probability of no given X of I here
obviously it will say 1 - 72 which will
be your remaining answer that is 28
which is nothing but 28% so your final
answer will be this one this formulas
you have to remember now we'll solve a
problem let's solve a problem this will
be a very very interesting problem let's
say I have a data set which has like
this feature day let me just copy this
data set okay for you all now in this
data set I want to take out some
information let's take out Outlook
table now based on this output Outlook
feature see over here Outlook my day
outlook temperature humidity wind are
the input features independent feature
this is my output feature this one that
you are probably seeing play tennis is
my output feature which is specifically
a binary
classification so what I'm actually
going to do I'm basically going to take
my Outlook feature and based on this
Outlook feature I will just try to
create a smaller table which will give
some information now based on Outlook
first of all try to find out how many
categories are there in Outlook one is
sunny one is
overcast and one is rain right three
categories are there so I'm going to
write it down over here Sunny overcast
and rain so these three are my features
with respect to Sunny uh with Outlook I
have three categories one is sunny one
is overcast and one is RA here I'm going
to basically say with respect to Sunny
how many yes are there and how many no
are there and what is the probability of
yes and probability of no so I'm going
to again write it over here so this is
my Outlook feature
and then I have categories first yes no
Sunny overcast rain yes no then
probability of yes and probability of no
now the next thing that we need to find
out is that with respect to Sunny how
many of them are yes see yes we have so
when we have sunny over here the answer
is no so I will increase the count over
here one then again I have sunny again
answer is no so I'm going to increase
the count to two with this sunny this is
basically no okay so again I'm going to
increase the count to three now with
sunny how many of them are yes one and
two so I have this one and this one so I
have two so I'm going to say with
respect to Sunny I have two
yes understand Outlook is my X1 X1
feature let's consider now the next
thing is that let's see with respect to
overcost with overcast how many of them
are yes so this overcast is there yes 1
2 3 and four so total four yes are there
with respect to overcast then with
respect to overcast how many are on no
you can go ah and find out it is
basically zero NOS then with respect to
rain how many of them are yes so here
you can see with respect to one rain yes
yes no no so this is nothing but 3 2
let's try to find out there are three is
two or
not one here also one yes is there right
so 3 yes two NOS so the total number of
yes and NOS if you count it there are
nine yes and five NOS this is my total
count so if you totally count this 9 + 5
is 14 you'll be able to compare that
there will be 9 yes and five NOS what is
the probability of yes when Sunny is
given so here you have 2X 9 here you
have 4X 9 here you have 3x 9 now if if I
say what is the probability of no given
Sunny now see probability of yes given
Sunny probability of yes given forecast
probability of yes given rain so it is
basically that I will just try to write
it in a simpler manner so that you'll
not get confused okay so this is my
probability of yes and this is my
probability of no but understand what
does this basically mean this
terminology basically means probability
of yes given Sunny probability of yes
given overcast probability of yes given
rain similarly what is probability of no
probability of no obviously you know
that 3x 5 is my first probability then
you have 0x 5 and then you have 2X 5 now
with respect to the next feature let's
consider that I'm going to consider one
more feature and in this feature I will
say let's consider
temperature okay let's consider
temperature now in temperature how many
features I have or how many categories I
have I have hot you can see hot mild and
and cold now with respect to hot mild
cold here also I will be having yes no
probability of yes and probability of no
now try to find out with respect to hot
how many are yes so here no is there
here also no is there two NOS uh 1 yes
uh 2 yes so two yes and two NOS probably
then similarly with respect to mild mild
how many are there 1 yes 1 No 2 yes 3s
4s 4S and two knows okay so here you
basically go and calculate 4 yes and two
knows with respect to cold how many are
there cool cool or cold 1 yes 1 No 2 yes
3 S 3 S and 1 no so here I have
specifically have 3s and 1 no again the
total number is 9 and five which will be
equal to the same thing that what we
have got now really go ahead with
finding probability of yes given hot so
it will be 2x 9 over here then here it
will be how much 4X 9 here it will be 3x
9 again here what will be the
probability of no given given hot so
it'll be 2x 5 2x 5 1X 5 so this two
tables has already been created and
finally with respect to play the total
number of plays are yes is 9 no is five
and the answer is total 14 if if I say
what is the probability of yes only yes
then it is nothing but 9 by4 what is the
probability of no it is nothing but
5x4 okay so this two values also you
require now let's say that you get a new
data set you need get a new data set
let's say you get a new test data where
it says that suppose if you are having
sunny and hot tell me what is the output
so this is my problem statement so let
me write it down so here I will write
probability of yes given Sunny comma hot
then here I will write probability of
yes multiplied by probability of so here
I will write probability of Sunny given
yes multiplied by probability of hot
given yes divided by what is it
probability of Sunny multiplied by
probability of hot
equation because it is a
constant because probability of no also
I'll be getting the same value 9 by4 so
probability of yes I'm going to replace
it with 9
by4 multiplied by 2x 9 then probability
of hot given yes so I am going to get 2
by 9 so
here 99 cancel or 2 1 7 then this is
nothing but 2 by
6331 I read this statement little bit
wrong it should be probability of Sunny
given yes now go ahead and calculate go
ahead and calculate what is probability
of no given sunny and hot so here you
have probability of no multiplied by
probability of Sunny given
no multiplied by probability of hot
given
no divided by probability of Sunny
multiplied by probability of heart this
will get cancelled denominator is a
constant guys this is a constant so what
is probability of no so probability of
no is nothing but 5 by4 so I will write
over here 5 by4 multiplied by
probability of Sunny given no what is
probability of Sunny given no what is
probability of Sunny given no is nothing
but probability of Sunny given no is
nothing but 3x 5 so here I'm going to
get 3x 5 multiplied probability of H
given no that is nothing but 2x 5 so 2x
5 is here 3x 5 is there five and five
will get cancelled 2 1 2 7 and then I'm
getting 3x 35 which is nothing but
calculator uh if I'm actually getting
three ID by 35 it's nothing but
857 I will write it down again
probability of yes given Sunny comma hot
which is my independent feature is
nothing but
031
031 and this is probability of no given
Sunny comma hot 85 now we'll try to
normalize this 85 + Point divided by 031
+ 085 73 this is nothing but 73% and
here I can basically say 1 -73 which is
my27 which is nothing but 27% if the
input comes as sunny and hot if the
weather is sunny and hot what will the
person do whether he will play or not
the answer is no okay now my next
question will be that if your new data
is overcast and Mild now tell me what
will be the probability using name bias
now you can add any number of features
let's say that I will say that okay
let's let's say that I will I will
probably say we can consider humidity
mind wind also you basically create this
kind of table to find it out but this
will be an assignment just do
it overcast and Mild if it is with
respect to NB try to solve it so the
second algorithm that we are going to
discuss about is something called as KNN
algorithm KNN algorithm is a very simple
problem statement okay which can be used
to solve both classification and
regression so KNN basically means K
nearest neighbor let's first of all
discuss about classification problem
number one classification problem let's
say that I have a binary classification
problem which looks like this I have two
data points like this one and this is
another one suppose a new data point
suppose a new data point which comes
over
here then how do I say that whether this
belongs to this category or whether it
belongs to this category if I probably
create a logistic regression I may
divide a line but in this particular
scenario how do we Define or how do we
come to a conclusion that
whether this will belong to this
category or this category so for here we
basically use something called as K
nearest neighbor let's say that I say
that my K value is five so what it is
going to do it is going to basically
take the five nearest closest point
let's say from this you have two nearest
closest point and from here you have
three nearest closest point so here we
basically see from the distance the
distance that which is my nearest point
now in this particular case you see that
maximum number of points are from Red
categories from Red from Red categories
I'm getting three points and from White
categories I'm getting two points now in
this particular scenario maximum number
of categories from where it is coming we
basically categorize that into that
particular class just with the help of
distance which all distance we
specifically use we use two distance one
is ukan distance and the other one is
something called as Manhattan distance
so ukan and Manhattan distance now what
does ukan distance basically say suppose
if this is your two points which is
denoted by X1 y1
X2 Y2 ukine distance in order to
calculate we apply a formula which looks
like this X2 - X1 s + Y2 - y1 s whereas
in the case of magetan distance suppose
this are my two points then we calculate
the distance in this way we calculate
the distance from here then here right
this is the distance we calculate we
don't calculate the hypothenuse distance
so this is the basic difference between
ukan and magetan distance now you may be
thinking Chris then fine that is for
classification problem for regression
what do we do for regression also it is
very much simple suppose I have all the
data points which looks like this now
for a new data point like this if I want
to calculate then we basically take up
the nearest Five Points let's say my K
is five k is a hyper parameter which we
play now suppose let's say that K it
finds the nearest point over here here
here here and here so if we need to find
out the point for this particular output
with respect to the K is equal to 5 it
will try to calculate the average of all
the points once it calculates the
average of all the points that becomes
your output so regression and
classification that is the only
difference because this K is actually an
hyper parameter we try with K is equal
to 1 to 50 and then we probably try to
check the error rate and if the error
rate is less then only we select the
model now two more things with respect
to K nearish neighbor K nearest neighbor
works very bad with respect to two
things one is outliers and and one is
imbalanced data set now if I have an
outlier let's say I have an outlier over
here this is one of my categories like
this and this is my another category
let's consider that I have some outliers
which looks like this now if I'm trying
to find out the point for this you can
see that the nearest point is basically
blue only and it belongs to the blue
category but because this outlier you
know it'll consider that the nearest
neighbor is this so then this will be
basically treated in this group only
formula for Manhattan distance it uses
modulus X2 - X1 + Y2 - y1 mode X2 - X1
Y2 - y1 uh this was it from my side guys
and yes I've also made detailed videos
about whatever topics we have discussed
today you can directly go and search for
that particular
topic so this is the agenda of this
session we will try to complete this all
things again here we are going to
understand the mathematical equations
and all uh so today's session we are
basically going to discuss about uh
decision tree okay and uh in this
session we are going to basically
understand what is the exact purpose of
decision tree with the help of decision
tree you are actually solving two
different problems one is regression and
the other one is
classification so we'll try to
understand both this particular part
well we will take a specific data set
and try to solve those problems now
coming to the decision tree one thing
you need to understand I'll say that if
age is less than 8 let's say I'm writing
this condition if age is less than or
equal to 18 I'm going to say print go to
college here I'm printing print college
and then I'll write else if age is
greater than 18 and pag is less than or
equal to 35 I'll say print work then
again I'll write else if age is let me
let me put this condition little bit
better then I'll write here L if if age
is greater than 18 and age is less than
or equal to 35 I'm going to say print
work basically people needs to work in
this age else I'm just going to consider
print retire so here is my ifls
condition over here now whenever we have
this kind of nested if Els condition
what we can do is that we can also
represent this in the form of decision
trees we'll also we can actually form
this in the form of decision and the
decision tree here first of all we will
have a specific root node let's say this
is my root node now in this root node
the first condition is less than or
equal to 18 so here obviously I will be
having two conditions saying that if it
is less than or equal to 18 and one
condition will be yes one condition will
be no so if this is yes and if this is
no right if this condition is true that
basically means we'll go in this side if
it is true then here we will basically
have something like college so this is
your Leaf node similarly when I have no
okay no no in this particular case we
will go to the next condition in this
next condition I will again create a
node and I'll say that okay this is less
than 18 and greater than sorry less than
or equal to 35 so if this is also there
then again I'll have two conditions
which is basically yes or no now when I
create this yes or no over here you'll
be able to see that basically means here
again two condition will be there if it
is yes I will say print work so this
will again be my leaf
node and again for no again I will do
the further splitting which is retire so
here you can see that this entire
algorithm this entire code that I have
actually written you can see that it has
got converted to this kind of
trees where you specifically able to
take decisions yes or no so can we solve
a classification
problem sorry this is greater than 18
again if it is greater than 18 and less
than or 35 so can we solve a
regression and a classification problem
regression and classification problem
using this decision trees by creating
this kind of
nodes so in short whenever we talk about
decision
trees whenever we talk about decision
trees
you will be seeing that decision trees
are nothing but decision trees are
nothing but by using this nested if El
condition we can definitely solve some
specific problem statement but here in
the visualized way we will specifically
create this decision tree in the form of
nodes now you need to understand that
what type of maths we will probably use
okay so let's do one thing let's take a
specific data set which I will
definitely do it over here in front of
you
okay and we will try to solve this
particular data set and this will
basically give you an idea like how we
can probably solve these problems so uh
let me just open my snippet tool so this
is my data set that I have let's
consider that I have this specific data
set now this data set are pretty much
important because this probably in
research papers also probably people who
have come up with this algorithm they
usually take this they take this thing
but but right now this particular
problem statement if I talk about this
is a classification problem statement
okay but don't worry I will also help
you to explain I'll also explain you
about regression also how decision tree
regression will definitely work so let's
go ahead and let's try to understand
suppose if I have this specific problem
statement how do we solve this this is
my output feature play tennis yes or no
okay whether the person is going to pay
tennis or not yesterday or there after
yesterday or whenever you want so if I
have this input features like Outlook
temperature humidity and wind is the
person going to play tennis or not this
is what my model should predict with the
help of decision tree so how decision
tree will work in this particular case
first of all let's consider any any any
specific uh feature let's say that
Outlook is my feature so this will be my
first
feature which is specifically Outlook
now just tell me how many are basically
having no and how many are basically
having yes in the case of Outlook there
you'll be able to find out there are
nine yes see 1 2 3 4 5 6 7 8 9 and how
many NOS are there 1 2 3 4 5 I think 1 2
3 4 5 so nine yes and five NOS what we
are going to do in this specific thing
now we have N9 yes and five Nos and the
first node that I have actually taken
is basically Outlook so Outlook feature
now just try to find out we are focusing
on this specific feature now in this
feature how many categories I have I
have one Sunny category you can see over
here I have Sunny one category then I
have another category called as
overcast then I have another category as
rain so I have three unique categories
So based on these three categories I
will try to create three nodes so here
is my one node here is my second node
here is my third node so these are my
three categories so this category is
basically called as Sunny this category
is basically called as overcast and this
category is basically called as rain
based on these three categories so I'm
splitting it now just go ahead and see
in Sunny how many yes and how many no
are there how many yes with respect to
Sunny are there see in sunny I have two
NOS see one and two no uh one more no is
there three NOS so here you can see this
is my one no then this is my two no this
is my three no and yes are two so this
one and this one so how many total
number of yes so here you can see that
there are 1 2 2 yes and three no let's
say that I have randomly selected one
feature which is Outlook why can't I
when like see it is up to it it is up to
the decision tree to select any of the
feature here I have specifically taken
Outlook later on I'll explain why it it
can basically select how it selects the
feature okay I'll I'll talk about it
don't worry so in the Outlook we have
two yes sorry in the case of Sunny we
have two yes and three NOS now the next
thing is that let's go and see for
overcast in overcast I have 1 yes uh 2s
um 3s and 4 yes I don't have any no in
overcast so over here my thing will be
that four yes and Zer Nos and then
finally when we go to the Rain part see
in Rain how many features are there in
rain if you go and probably see it how
many number of yes and NOS are there go
and see in one one yes in row rain two
yes then one no then again you have one
yes and one no right so here you can
basically say that in rain in the case
of rain if I take a as an example how
many number of yes and NOS are there it
will be 3 yes and two
NOS understand understanding
algorithm then everything will you'll be
able to understand now let's go ahead
and try to cease for sunny sunny
definitely has 2 yes and three NOS this
has four yes and zero NOS here you have
three Y and two NOS now if I probably
take overcast here you need to
understand understand about two things
one is pure
split and one is impure split now what
does pure split basically mean pure spit
basically means that now see in this
particular scenario in overcast in
overcast I have either yes or no so here
you can see that I have four yes and Zer
NOS so that basically means this is a
pure split anybody tomorrow in my data
set if I just take this Outlook feature
suppose in one day in day 15 the Outlook
is Outlook is basically overcast then I
know directly it is the person is going
to play so this part is already created
and this node is called as pure
node understand this why it is called as
pure node because either you have all
Yes or zeros NOS or zero yes or all NOS
like that in this particular case I have
all yes so if I take this specific path
I know that with respect to overcast my
final decision which is yes it is always
going to become yes so this is what it
basically says so I don't have to split
further so from here I will probably not
split I will definitely not split more
because I don't require it because I
have it is a pure leaf node okay you can
also say that this is a pure leaf node
so I'm just going to mention it again
this one I'm specifically talking about
now let's talk about sunny in the case
of Sunny you have two yes and three NOS
so this is obviously impure so what we
do we take next feature and again how do
we calculate that which feature we
should take next I'll discuss about it
let's say that after this I take up
temperature I take up temperature and I
start splitting again since this is
impure okay and this split will happen
until we get finally a pure split
similarly with respect to rain we will
go ahead and take another feature and
we'll keep on splitting unless and until
we get a leaf node which is completely
pure I hope you understood how this
exactly work now two questions two
questions is that Kish the first thing
is that how do we calculate this
Purity and how do we come to know that
this is a pure split just by seeing
definitely I can say I can definitely
say by just seeing that how many number
of yes or NOS are there based on that I
can def itely say it is a pure split or
not so for this we use two different
things one is
entropy and the other one is something
called as guine coefficient so we will
try to understand how does entropy work
and how does Guinea coefficient work in
decision tree which will help us to
determine whether the split is pure
split or not or whether this node is
leaf node or not then coming to the
second thing okay coming to the second
thing one is with respect to Purity
second thing your first most important
question which you had asked why did I
probably select Outlook how the features
are selected and here you have a topic
which is called as Information Gain and
if you know this both your problem is
solved so now let's go ahead and let's
understand about entropy or guinea
coefficient or Information Gain entropy
or guine coefficient oh sorry Guinea
coefficient I'm saying guine impurity
also you can say over here
I'll write it as guine impurity not
coefficient also I'll just say it as
Guinea impurity but I hope everybody is
understood till here let's go ahead and
let's discuss about the first thing that
is
entropy how does entropy work and how we
are going to use the formula so entropy
here I will just write guine so we are
going to discuss about this both the
things let's say that the entropy
formula which is given by I will write h
of s is equal to so h of s is equal to
minus P plus I'll talk about what is
minus what is p plus log base 2 p
+- p
minus log base 2 p minus so this is the
formula and in guine impurity the
formula is 1 minus summation of I equal
1 2 N p² I even talk about when you
should use guine impurity when you
should not use guine impurity
when you should use entropy you know by
default the decision tree regression or
classific sorry decision tree
classification uses Guinea impurity now
let's take one specific example so my
example is that I have a feature one my
root node I have a feature one which is
my root node and let's say that in this
root node I have six yes and three NOS
very simple let's say that this has two
categories and based on this two
categories of split has happened that is
a C1 let's say in this I have 3 S3 Nos
and here I have 3 s0 Nos and this is my
second category always understand if I
do the sumission 3s and 3s is 6s see
this this sumission if I do 3 + 3 is
obviously 6 3 + 0 is obviously so this
you need to understand based on the
number of root nodes only almost it'll
be same now let's go ahead and let's
understand how do we Cal calculate let's
take this example how do we calculate
the entropy of this so I have already
shown you the entropy formula over here
now let's understand the components I
will write h of s is equal to minus sign
is there what is p+ p+ basically means
that what is the probability of yes what
is the probability of yes this is a
simple thing for you all out of this
what is the probability of yes yes out
of this so obviously how you'll write if
you want to find out the probability of
yes out of this see when I say plus that
basically means yes when I say minus
that basically means no so what is the
probability of yes so it is be nothing
but yes plus and minus are specifically
for binary
class this can be positive negative so
the probability with respect to yes can
I write 3x 3 only for this what is the
probability out of this total number of
this is there 3x3 similarly if I go and
see the next term log to the base 2 p+
so again if I go ahead and write over
here log to the base 2 p+ p+ is again
3x3 so then again we have minus and this
is now P minus what is p minus 0 by 3
log base 2 0 by 3 this obviously will
become zero this will obviously become 0
because 0 divid by anything is zero what
will this be 1 log to the base 1 what is
this this is nothing but zero log to the
base 1 is nothing but zero tell me
whether this is a pure split or impure
split so this is a pure split whenever
we have a pure split the answer of the
entropy is going to come to zero so here
I'm going to Define one graph
this is H of s and let's say this is p+
or P minus if my probability of plus see
when I say probability of plus is 0.5
what will be probability of minus it
will also be 0. five right because it's
just like P is equal to 1 - Q right if p
is .5 then Q will be 1 - P same thing
right so when it
is5 obviously my h of s will be 1 let's
say so this is this is the graph that
will basically get formed let's go ahead
and try to calculate the entropy of this
guys what will be the entropy of this
node so here I'm going to just make a
graph h of s minus what is p+ p+ is
nothing but 3x 6 log base 2 3x 6
minus three no are there 3x 6 log base 2
3x 6 so if you compute this
log base 2 to the^ of 1 if you do the
calculation here I'm actually going to
get one so when I'm getting one when I'm
actually getting one when you have three
yes and three NOS what is the
probability it is 50/50% right so when
your p+ is5 that basically means your h
of s is coming as one so from this graph
you can see that I'm getting one if this
is zero this is one this is zero and
this is one I hope everybody is able to
to understand guys 0o and one if your p+
is
zero or if your p+ is one that basically
means it becomes a pure split so in h of
s you are going to get
zero so always understand your entropy
will be between 0 to
1 if I have a impure this is a
completely impure split because here you
have 50% probability of getting yes 50%
probability of getting no h ofs is
entropy this is entropy for the sample H
ofs notation that I'm using is H ofs so
if whenever the split is happening the
first thing is done the purity test the
purity test is done with the help of
entropy right now I'll also show guinea
guinea impurity don't worry so with the
entropy you'll be able to find if I am
getting one that basically means it is a
impure split and if I'm getting zero it
is pure split so this is the graph okay
this is the graph and this graph is
basically the entropy graph again
understand if your probability of
getting yes or no is 0.5 that basically
means 50/50 is there 3s and three NOS
then your entropy is going to be 1 h of
s if your probability is completely one
that basically means either you're
getting completely yes or completely no
so your your entropy will be zero that
basically means it is pure split so in
the case of probability .5 you're
getting plus one then it'll keep on
reducing now let's go ahead and let's
try to understand so here you have
understood about purity test definitely
you'll use entropy try to find out
whether it is pure or impure if it is
impure you go ahead with the further
shift further division of the categories
again you take another feature divide it
because here from this two which split
you will do further you will do this
split as further if you are getting 6 6
is this specific value then you probably
go and draw over here this is your
entropy if your probability is here
which
is.3 then you will go here and create
this this may be0 4 or3 something like
this it will be between 0 to 1 let's go
ahead and discuss about the second issue
I hope everybody is discussed about we
have discussed about checking the pure
split or not and we have understood this
much but the next thing is that okay
fine chish this is very good we have
explained well I know many people will
say that but there are some people I
can't help let's say that I have some
features okay now coming to the second
problem how do we consider which node to
cap which which feature to take and
split because here I may have one one
split so again let's see that what is
the second problem which feature to take
to split right this is the second
problem that we are trying to solve
let's say that I have one feature one
over here and I have two categories
let's say this is there C1 and C2 here
let's say that I have 9 years 5 Nos and
then I have 6 years 2 NOS here I have
basically three yes and three NOS let's
say and in my data set I have features
like F1 FS2 F3 now let's say that
another split I can actually start with
feature two also and in feature two I
may have probably three categories like
C1 C2 C3 so with respect to the root
node and all the other features because
after this also I may have to split
right I may have to take another feature
and keep on splitting right based on the
Pure or impure split how do I decide
should I take fub1 first or F2 first or
F3 first or any other feature first how
should I decide that which feature
should I take and probably do the split
that is the major question so for this
we specifically use something called as
Information Gain so here I'm just going
to say here we basically use Information
Gain now what is this Information Gain
I'll talk about it so Information Gain
first of all I will write the formula we
basically write gain with sample first
with feature one I will compute so first
with feature one I will compute suppose
this is my first split of my data and
probably I'm Computing over here this
can be written as h of s I'll discuss
about each and every parameter don't
worry summation of V belong to values s
of V don't worry guys if you have not
understood the formula I will explain it
then the sample size H of SV I'll
discuss about each and every parameter
let's say that I'm taking this feature
one split I have you have already seen
what is feature one so this is my
feature one I have two categories C1 C2
this has 9 yes 5 NOS this has 6s and two
Nos and this has 3 yes and three NOS now
I will try to calculate the information
gain of this specific split now I will
go ahead and probably take this up now
see over here we'll try to understand
what is this now if I want to compute
the gain of s of F1 first is first first
thing that I need to find out is H of s
now this h of s is specifically of the
root node so I need to first of all
calculate what is h of s h ofs is
nothing but entropy entropy of the root
node so if I want to compute the entropy
of the node node tell me how should I
compute h of s is equal to minus p + log
base 2 p+ calculate guys along with me -
P minus log base to P minus so I hope
everybody knows this so here I'm going
to compute by what is ability of plus
over here in this specific root node it
is nothing but 9 by4 then I have log
base 2 again 9
by4 then I have P minus what is p minus
5x4 log base 2 5 by4 so this calculation
I will probably get it as
94 approximately equal to 94 just check
it whether you're getting this or not
again you can use calculator if you want
now now I have definitely found out this
this is specifically for the root node
now let's see the next thing the next
important thing which is this part what
is s of v and what is s and what is h of
SV now very important just have a look
everybody see this graph okay see this
graph I will talk about h of SV first of
all I'll talk about h of SV okay this
one this is the entropy of category one
you need to find and entropy of category
2 you need to find so if I write h of SV
of category 1 so what is category 1 for
this I'll write SC1 let's say I'm going
to write like this quickly calculate the
H of SV of this and this separately you
need to calculate so h of SV of C1 okay
so here again you'll write - 6X 8 log
base 2 6X
8us 2x 8 log base to 2x 8 I hope
everybody knows this how we got it so h
of SV basically means I'm going to
compute the entropy of this category and
this category so for that I will
basically write h of so here I will
write - 6 by8 log base 2 6X 8 - 2x 8 log
base 2 2x 8 so if I get it I'm actually
going to get 81 and similarly if I if I
calculate h of C2 quickly calculate how
much you are going to get guys 6X 8 6X 8
with respect to this we need to find out
so now we have all these values we'll
start equating them to this equation so
here we have finally gain of s comma
fub1 so let's say that here I'm going to
basically add
94 minus see minus summation of okay
summation of what is s s of V understand
s of V basically means that how many
samples I have over here let's say for
category one how many samples I have for
category one over here simple if you
really want to just calculate it is
nothing but eight and total number of
sample is how much if I go and see over
here there are 9 years five NOS okay 9
years and five NOS that basically means
14 total sample here you have eight
sample Okay so this will become
8x4 then you multiply by what see see
from this equation you multiply by h of
SV so h of SV is nothing but the entropy
of category 1 so entropy of category 1
is nothing but 81 plus then you go again
back to the graph and try to see that
for C2 how much how many total number of
samples are there 3 + 3 is 6 so 6 by 14
it will
become multiplied by 1 right so this is
your entire thing so here after all the
calculation you are going to get
0.041 so this is my gain with s comma F1
so here I have got this value amazing I
did this with feature one only what
about feature two let's say that this
was my split for feature two and suppose
I get the gain for S comma feature 2 as
.51 if I get this now tell
me in using which feature should I start
splitting first whether it should be
fub1 or whether it should be FS2 based
on this value you know that over here
the gain the information gain of s comma
F2 is greater than gain of s comma fub1
so your answer is very much simple we
will definitely use feature 2 to start
the split the thing over here you are
trying to understand that if I really
want to select which feature to select
to start my splitting then I have to
basically calculate the information gain
and go throughout the all the paths and
whichever path has the highest
Information
Gain then we will select that specific
thing now the question Rises Kish
obviously this is good but you had
written about guinea impurity what is
the purpose of that please explain us
and why Guinea impurity is basically
used so let me go ahead with guine
impurity I told that yes you can
obviously
use you can obviously use entropy but
why Guinea impurity so guine impurity
formula which I have specifically
written as 1 minus summation of IAL 1
2 N
p² now what is this p² suppose let's say
that in my n n is the number of outputs
right now how many outputs I have I have
two outputs yes or no so I will expand
this 1 minus since this is summation I
equal to 1 to n I'm basically going to
basically say that okay fine I will
write probability of plus whole
Square uh plus probability of minus
whole Square so this is the formula for
guinea impurity now you may be thinking
okay fine the calculation will be
obviously very much equal easy right
suppose if I have a node sorry if I have
a node which which has 2 yes two NOS now
in this particular case how do I
calculate my this probability if I have
two yes or two NOS suppose let's say
that I have a node over here which is my
split and this is having two yes and two
no so how do I calculate I will write 1
minus what is probability of square 1X 2
square sorry not 1 by two
yeah 1X 2 squ + 1 by 2
squ right then I will say 1 by 1X 4 + 1X
4 is nothing but 2x 4 which is nothing
but 1X 2 so I will be getting 0.5 now
here here you understand this is a
complete impure split right if you have
an impure split in entropy the output
you getting it as one whereas in the
case of Guinea impurity
as Z sorry
0.5 so if I go ahead with the graph that
I probably had created here so my Guinea
impurity line will look something like
this so it will be looking something
like this for zero obviously I'll be
getting zero but whenever my probability
of plus is 0.5 I'm going to get 0.5 over
here and that is the difference between
Guinea
impurity and entropy but the re but you
may be seeing Kish when to use what now
let's understand that when to use Guinea
and when to use entropy tell me guys if
I consider this formula of guine
impurity and if I probably
consider if I consider entropy this
formula where do you think more time
will take for execution for this
particular formula whether for entropy
it will take or for guinea impurity it
will take more time where it will
probably take for the execution purpose
see understand decision tree is having a
worst time complexity because if you
have 100 features probably you'll keep
on comparing by dividing many many
features then probably compute a
Information Gain like this if you have
just 100 features so which is faster
entrop
or guine impurity understand in entropy
you have log function here you have log
function here you have simple maths the
more amount of time out of entropy and
guine impurity the more amount of time
basically is taken
by
entropy so if you have huge number of
features like 100 200 features and you
are planning to apply decision Tre I
would suggest try to use Guinea impurity
then entropy if you have small set of
features then you can go ahead with
entropy so over here definitely with
respect to fast Guinea is greater than
entropy now let's go ahead and
understand with respect to you may be
thinking Kish okay fine you have
basically explained us about categorical
variables over here see over here you
have you have explained about
categorical variables what if I have
numerical feature let's say I have F1
over here which is a numerical
feature I have an F1 feature which is
numerical feature and I may have values
let's say that I have sorted all the
values over here okay let's say that I
have F1 and output okay so this F1 let's
say that I have values
like ass sorted order values I'm sorting
this features I'm basically doing this
let's say that initially I have this
features like this and let's say I have
values like 2.3 1.3 4 5 7 3 let's say I
have this features now this is a
continuous
feature this is a continuous feature so
for a continuous feature how probably
the decision tree entropy will be
calculated and the Information Gain will
get calculated so here you'll be able to
see that I will first of all sort these
values so in F1 the decision tree will B
basically first of all sort this values
so I have 1.3 then you have 2.3 then you
have four then you have three three then
you have four then you have five and
then you have six now whenever you have
a continuous feature so how the
continuous feature will basically work
in this case first of all your decision
tree node will say
that we'll take this one only one first
record and say that if it is less than
or equal to 1.3
okay if it is less than or equal to 1.3
so you here you'll be getting two
branches yes or no so yes and no
definitely your output over here will be
put over here right and then for the no
here you'll be having another node over
here how many number of Records you'll
be having in this particular case you'll
be having one record in this particular
case you will be having around five to
six records and here also you'll be able
to see right how many yes and NOS are
there definitely this will be a leaf
node so in the first instance they will
go ahead and calculate the information
gain of this then probably once the
Information Gain Is got then what
they'll do they will take the first two
records and again create a new decision
tree let's say that this will be my
suggestion where they'll say it is less
than or equal to 2.3 so I will get one
and one over here so in this now you'll
be having two records which will
basically say how many yes and no are
there and remaining all records will
come over here then again Information
Gain will be computed here then again
what will happen they'll go to the next
record then then again they'll create
another feature where they'll say less
than or equal to three and they will
create this many nodes again they'll try
to understand that how many yes or no
are there and then they'll again compute
The Information Gain like this they'll
do it for each and every record and
finally whichever Information Gain is
higher they will select that specific
value in that feature and they'll split
the node so in a continuous feature
whenever you have a continuous feature
this is how it will basically have and
then it will try to compute who is
having the highest Information Gain the
best Information Gain will get selected
and from there the splitting will
happen now let's go ahead and understand
about the next topic is that how this
entirely things work in decision tree
regressor because in decision tree
regressor my output is an continuous
variable so suppose if I have one
feature one feature two and this output
is a continuous feature it will be
continuous any value can be there so in
this particular case how do I split it
so let's say that f1c feature is getting
selected now in this f1c feature what
value will come when it is getting
selected first of all the entire mean
will get calculated of the output mean
will get calculated so here I will have
the mean and here here the cost function
that is used is not Guinea coefficient
or guinea impurity or entropy here we
use mean squared
error or you can also use mean absolute
error now what is mean squared error if
you remember from our logistic linear
regression how do we calculate 1 by 2 m
summation of I = 1 to n y hat minus y
whole Square y hat of i y - y whole
Square this is what is mean square error
so what it will do first based on F1
feature it will try to assign a mean
value and then it will compute the MSE
value and then it'll go ahead and do the
splitting now when it is doing splitting
based on categories of continuous
variable I will be having different
different categories now in this
categories what will happen after split
some records will go over
here then I will be having a mean value
of this over here
that will be my output and then again
the MSC will get calculated over here as
the msse gets reduced that basically
means we are reaching near the leaf
note and the same thing will happen over
here so finally when you follow this
path whatever mean value is present over
here that will be your output this is
the difference between the decision tree
regressor and the classifier here
instead of using entropy and all you use
mean squar error or mean absolute error
and this is the formula of mean square
error now let's go to the one more topic
which is called as the hyperparameters
tell me decision tree if I keep on
growing this to any depth what kind of
problem it will face regressor part you
want me to explain okay let's
see okay let's let's do the
regression decision
tree
regressor let's say I have feature F1
and this is my output let's say I have
values like 20 24 26 28 30 and this is
my feature one with category one
category one let's
say some categories are there let's say
I have done
the division by
F1 that is this feature initially tell
me what is the mean of this that mean
value will get assigned over here then
using msse that is mean squar error here
you will try to calculate suppose I get
an msse of some 37 47 something like
this and then I will try to split this
then I will be getting two more nodes or
three more nodes it depends then that
specific nodes will be the part of this
again the mean will change again the
mean will change over here suppose this
two is there this two records goes here
right then again MC will get calculated
I'm just taking as an example over here
just try to assume this thing now if I
talk about hyper parameters see this is
what is the formula that gets applied
over MSC now let's see in this hyper
parameter always understand decision
tree leads to overfitting because we are
just going to divide the nodes to
whatever level we want so this obviously
will lead to
overfitting now in order to prevent
overfitting we perform two important
steps one is post pruning and one is
pre- pruning so this two post pruning
and pre pruning is a condition let's say
that I have done some
splits I have done some splits let's say
over here I have seven yes and two
no and again probably I do the further
split like this now in this particular
scenario you know that if 7 yes and two
NOS are there there is a maximum there
is more than 80% chances that this node
is saying that the output is yes so
should we further do more
pruning the answer is no we can close it
and we can cut the branch from here this
technique is basically called as post
pruning that basically means first of
all you create your decision tree then
probably see the decision tree and see
that whether there is an extra Branch or
not and just try to cut it there is one
more thing which is called as
pre-pruning now pre-pruning is decided
by hyperparameters what kind of hyper
parameters you can basically say that
how many number of decision tree needs
to be used not number of decision tree
sorry over here you may say that what is
the max
depth what is the max depth how many Max
Leaf you can
have so this all parameters you can set
it with grid SE
CV and you can try it and you can
basically come up with a pre- pruning
technique so this is the idea about
decision tree uh regressor yes yes it is
possible your guinea value will be one
no this graph is there
no Guinea value are you talking about
this Guinea entropy it will not be one
it will always be between 0
to.5 so the first thing first as usual
what we should do we should import the
libraries so here I will go ahead and
import the librar so I'll say
import pandas as NP PD import matplot
li. pyplot as PLT
uh
import so this basic things I have with
me so I will go and take any data set
that I want from SK
learn. data sets import let's say that
I'm going to take load Iris data set and
then I'm going to upload the iris data
set so I'm going to write load Iris
there is my Iris data set then the next
step uh once you get your iris data set
so this is my iris. dat
okay these are all my features the four
features will be there these four
features are petal length petal width
SLE length and SLE width this is my
independent features then if I really
want to apply
for classifier so decision tree
classifier so I can first of all import
from
skarn do tree import decision let's see
where decision tree present in a scalon
decision tree
classifier the name is absolutely fine
but I was not getting over here
so so this is got no module SK okay SK
skar
skn learn so here you have
classifier right now I'm just going to
overfit the data then I'll probably show
you how you can go ahead with uh
pruning so by default what are the
parameters over here if you probably go
and see in in the classifier over here
you have Criterion see this the first P
parameter is Criterion by default it is
Guinea then you have Splitter Splitter
basically means how you're going to
split and there also you have two types
best and random you can randomly select
the features and do it okay you should
always go with
best max depth is a hyper parameter
minimum sample lift is a hyper parameter
Max Fe features how many number of
features we are going to take in order
to fix that that is also an hyper
parameter so all these things are hyper
parameter okay so I will just by default
executed whatever is giving me in
decision tree and the next thing that
I'm actually going to do is create a
decision tree so for this I will be
using plot. fig size plot. figure inside
figure I have this fix
size okay and I will probably show in
some better figure size so that
everybody body will be able to see it so
here let me say that I'm going to take
an area of
1510 and then probably I'm going to say
tree Dot
Plot and here I'm going to say a
classifier and it should be filled the
coloring should be filled with this so
tree sorry Tre Tre Tre Tre
Tre it should be classifi tree. plot
okay I have to also import uh tree so I
have to basically import tree so from SK
learn
import three again I'm getting
error has no attribute plot
why let me just see the documentation
guys so this plot function is like plot
uncore tree dot tab plot _ tree now what
is the error we are getting okay not
fitted yet
sorry so I'm going to say
classifier do fit on data what data
iris.
data and then I'm going to fit with Iris
dot
Target so once this is done I think now
it will get
executed so this is how your graph will
look like guys so here you can see this
is how your graph looks like now if I
show you the graph over here see you can
see some amazing things over here three
outputs are actually there in this when
you see in this left hand side this
become a leaf node so this first one is
probably vers color uh versol flower
okay if you go on the right hand side
here you can see 50/50 is there so based
on one feature based on one feature here
you'll be able to see that you are
getting a leaf node based on another
Branch here you are getting
05050 so again you have two more
features getting splitted over here so
here you have 495 here you have
471 do we require this split anybody
tell me from here do we require any any
more split just try to think this is
after post pruning I want to find out
whether more splits are required or not
now in this particular case you see this
after this do you require any
split you do not require right here you
are basically getting 47 and one I guess
after this also you require no split
understand this so this is basically
post pruning so you can then decide your
level and probably do it gu value is
more than
0.5 okay this side H this is coming as
0.5 greater than 0.5 it should not had
here it is
0.5 no maximum .5 can come 0 to.5 only
should come I don't know why this is
coming as 667
I'll have a look onto this guys but
anywhere you see other than that you're
everywhere you're getting less
than5 the plotting graph is very much
easy you use SK learn import tree then
you basically do this get classify and
field is equal to true and you can just
do this so the agenda let me Define the
agenda what all things are there first
we'll understand about
emble techniques in this assemble
techniques we are basically going to
discuss about what is the difference
between
bagging and boosting
second what we are basically going to
discuss about is so uh the agenda of
this session is emble techniques bagging
and boosting then we are probably going
to cover random forest and then probably
we will try to cover adab boost and if I
have more energy I will also try to
cover XG boost so all this Al lthms
we'll discuss about it so let's go ahead
and let's start the
topics the first topic that we are going
to discuss is about emble
techniques now what exactly is emble
techniques and we are going to discuss
about it okay so emble techniques what
exactly is emble techniques till now we
have solved two different kind of
problem statement one is
classification and regression and you
have learned about different different
algorithms like uh linear regression
logistic regression we have discussed
about KNN we have discussed about
yesterday what disc what did we discuss
about n bias different different
algorithms we have already finished now
with respect to classification
regression Problem whatever algorithm we
are discussing there was only one
algorithm at a time we were discussing
one algorithm at a time we are
discussing and we are trying to either
solve a classification or a regression
problem now the next thing is over here
is that can we use multiple algorithms
mul multiple algorithm to solve a
problem multiple algorithms basically
means can we I'll just talk about it
okay now the if I ask this specific
question can we use multiple algorithms
to solve a problem at that point of time
I will definitely say yes we can because
we are going to use something called as
emble techniques there now what this
emble techniques is okay so emble
techniques in emble techniques we
specifically use two different ways one
is one one way is that we specifically
use and the other one I'll just go to
write it over here so one that we
basically use is something called as
bagging technique and the other one we
specifically use is something called as
boosting technique so in bagging
Technique we what exactly we can do and
in boosting technique what we can
actually do and how we are combining
multiple models to solve a problem so
let's first of all discuss about bagging
now how does bagging work let's say that
I have a specific data set so this is my
data set with uh with features rows
columns everything like this I have this
specific data set just imagine I have
many many features over here like this
fub1 F2 F3 and probably I have my output
so this is my data set D let's consider
it now what we do in bagging is that we
create models and this model can be
anything it can be logistic it can be
linear for a classification problem
let's say that this is logistic model so
this is my model M1 let's say I have
another model M2 then I may have another
model M3 let's say that this is
logistic and this is probably the other
model which is like decision tree and
then probably we use this model as KNN
classification and this model can again
be decision tree it's fine let's use
another decision tree so now here you
can see that we have used so many models
okay so many models are there now with
respect to this particular model what I
will do is that the first step that I
will do from this particular data set I
will just take up some rows so I'll
basically do row
sampling and I'll take a row sampling of
D Dash D Das basically means this D Das
is always less than D some of the rows
I'll push it to M1 okay I can also use n
fine so what I'll do is that some of the
rows I'll push it to model one this
model one will be training let's say
that for out of this 10,000 record th000
rows I'm actually doing a row sampling
of th rows and giving it to M1 to train
it then what I'm actually going to do
over here I'm basically going to give
this specific model M2 and again I'm
going to do row row sampling and I'm
again going to sample some of the rows
and give it to model two and again
remember some of the rows may get
repeated from this D Dash to next dble
Dash similarly I will do row sampling
and give it to this and again I may have
d triple Dash and D4 Dash so different
different different different rows data
points when I say row sampling basically
I'm talking about data points different
different data points I will give it to
separate separate model and this model
will specifically train when I say D
Dash that basically means uh suppose I
say th 10,000 are my total number of
data points when I say D Dash This D
Dash may be th000 points then D Double
Dash may be another th000 points and
some of the rows may get repeated over
here dle Dash here also I can basically
use so here specifically row sampling
will be used now when I have this many
specific each and every model will be
trained with different kind of data now
how the inferencing will happen for the
test data so first thing first let's say
that I'm going to get a new test data
over here now new test data will be
passed to M1 and this M1 suppose it
gives zero as my output suppose let's
say that I'm doing a binary
classification it gives a Zer as an
output so this is my output of zero next
M2 for the new test data gives one M3
gives one and M4 also gives one as the
the output now in this particular case
in this particular case what will happen
now you can see over here it's simple
what what do you think the output may be
in this particular case now M1 has
predicted for this particular test data
as zero the model M2 has predicted 1 M3
has predicted 1 and M4 has predicted one
so finally all these outputs are going
to get
aggregated are going to get aggregated
and a simple thing that gets applied is
majority voting majority voting so tell
me what will be the output for with
respect to this the output will
obviously be one because the majority
voting that you can see three people are
basically saying it as one so my output
over here will be one okay this is the
concept of bagging wherein you are
providing different different rows with
probably all the features in this case
and giving it to different different
model again which is a classification
model and then finally you are combining
them based on majority voting and you're
getting the answer as one so this step
is called as bootstrap aggregator that
basically means you're aggregating all
the output that is basically coming from
all the specific models all the specific
models now many people will say Krish
what about Tai guys like this kind of
situation you know we will be having
more than 100 to 200 models so it is
very very difficult that it will be a
tie who are repeating questions they
will be put up in time out so what if
you're saying that if the 50% of model
says yes 50% of our models says no
always understand guys we will be having
more than 100 to 200 plus models so in
this particular case there will be high
probability that always there will be a
majority voting available it will always
not be in that specific scenario so this
was the concept about bagging now some
people will be saying that Krish why are
you using different different models
guys I'm not discussing about random
Forest over here random Forest uses only
one type of model that is decision tree
but if we think as an concept of bagging
you can have different different models
over here and you can basically combine
them so this is a technique of emble
techniques and this is basically called
as bagging okay now tell me one point I
missed out fine this is with respect to
the classification problem with respect
to the regression problem what will
happen in case of a regression problem
let's say that I got here 120 here 140
here 122 here 148 as my output so in
regression what will happen is that the
entire mean will be taken mean will be
taken the output mean will be basically
taken and that will be your output of
the model average or mean very simple
right so average or mean will be
basically taken up and here based on the
average you'll be able to solve the
regression problem great now let's go
ahead and try to understand with respect
to bagging and boosting how many
different types of algorithm are but
before that I need to make you
understand what exactly is boosting now
here in bagging you have seen that you
have parallel models right one one one
independent you have parallel models
you're giving some row samples in
different different models and basically
are able to find out the output now in
case of boosting boosting is a
sequential combination of models like
this you have lot of sequential models
like this and one after the model like
first I'll give my training data to this
particular model then it will go to this
data then this model then this model so
this will be my M1 M2 M3 M4 and finally
I will be getting my output so here you
can basically say that boosting is all
about and this M1 M2 M3 we basically
mention it as weak Learners so this will
be weak learner weak learner weak
learner weak learner and finally when we
go till here it it'll if I combine all
these weak ners weak
learner weak learner okay once I combine
all this weak learner it becomes a it
becomes a strong learner finally if I
try to combine this this will basically
become a strong learner so here you have
all the models sequentially one after
the other and then you will probably try
to provide your uh input from one model
to the next model to the next model and
these all models will be a very simpler
weak learner model which will not be
able to predict properly but when you
combine all this particular models
together sequentially it becomes a
strong learner how this specifically
works I'll take an example example of AD
boost XG boost I will show you that okay
week learner basically means the
prediction is very bad but as you go
sequentially you combine them they
become a strong learner okay one example
I want to give you let's say that you
are a data scientist right let's say
that this model one may be a teacher
with respect to physics then this model
two may be a teacher with respect to
chemistry let's say model 3 is basically
a teacher of maths and model four is a
teacher of geography now suppose if you
are trying to solve one problem
obviously if the physics teacher is not
able to solve that particular problem
then probably chemistry can help or
maths can help or geography can help or
someone can help so when we combine this
many expertise together they will be
able to give you the output in an
efficient way Sumit I'll talk about it
where whether all the features are
basically passed to all the models or
not I'll just talk about it just give me
some time okay but I just want to give
you an idea about in short if someone
asks you in an interview what exactly is
boosting okay boosting is you can just
say that it is a sequential set of all
the models combined together and these
all models that I initialized are
usually weak Learners and when they are
combined together they become a strong
learner and based on the strong learner
they gives an amazing output and right
now if I say in most of the kaggle
competition they use different types of
boosting or bagging technique so we have
basically as I said
bagging and boosting in bagging what
kind of algorithm we specifically use we
use something called as random forest
classifier and the second model that we
specifically use is something called as
random
Forest regress so we specifically use
these two kind of models which I'm
actually going to discuss right now
after this and then in boosting we
basically use techniques like ad boost
gradi Boost number three is Extreme
gradient boost which we also say it as
XG boost extreme gradient boost so let's
go ahead and let's discuss about the
first algorithm which is called as
random forest classifier and regressor
now first thing first let's understand
some things from the yesterday's class I
hope uh what is the main problem with
respect to decision tree whenever we
create a decision tree without any
hyperparameter it does it not lead to
overit
does it not lead to overfitting uh
whenever you probably have a decision
tree right it leads to something like
overfitting why overfitting because it
completely splits all the feature till
it's complete depth overfitting
basically means for training data the
accuracy is high for test data the
accuracy is low so training data when
the accuracy is high I may basically say
it as high bias and then I may basically
say it as sorry not high bias low bias
and high V variance so low bias and high
variance yes obviously we can do pruning
and all guys but again understand
pruning is an extensive task probably if
your if you have 100 features if you
have data points which is like 1 million
to do pruning also it is very much
difficult yes pre pruning can be done
but again we cannot confirm that it may
work well or not so right now with
respect to decision tree you have this
specific problem that is low bias and
high variance now in low Biance and high
variance you know that my model is
basically the generalized model that I
should get it should have low bias and
low variance so if somebody asks you why
do you use random Forest you can
basically explain about decision trees
like this now my main aim is to convert
this High variance to low variance now I
will be able to convert this High
variance to low variance using random
forest classifier or random Forest
regressor now what does random Forest do
random Forest is a bagging technique
similarly I have a data set over here
let's say that I have this data set
and then here I will be having multiple
models like
M1
M2
M3 M4 let's say I have this four models
like this we have many many models now
with respect to this models this models
all the models are actually decision
Tree in random forest all are decision
trees you don't have a different model
over there so over here you can see that
all the models are decision trees that
is going to get used used in random
Forest so decision trees always gets
used in random Forest the first thing
that you should know now whenever we are
using decision trees you know that
decision tree if I by default if we try
to create it it may lead to overfitting
and because of that every decision tree
will basically create low V low bias and
high variance but if we combine in the
form of bootstrap aggregator this High
variance will be getting converted to
low variance because why because
majority of voting we will be taking
from this particular decision trees like
there will be many many decision tree so
they lot of outputs will be coming and
with the help of majority voting
classifier this High variance will get
converted to low variance now in random
Forest how it works in the first case if
I talk about random Forest over here two
things basically happen with respect to
the D- data set let's say in first model
we do some kind of row
sampling plus
Feature Feature
sampling that basically means we have to
select some set of rows and some set of
features and give it to M1 similarly you
do row sampling and feature sampling and
give it to M2 then you do row sampling
and feature sampling you give it to M3
and then you do row sampling and feature
sampling you give it to M4 now when you
do this so what will happen
independently you're giving some
features along with some rows now there
may be a situation that your features
may also get repeated it may also get
repeated your records or data points may
also get repeated so when you are
probably training your model with this
specific data sets and specific features
this model become expert in predicting
something right as I said one example
over here I'm giving a physics model
some data I'm giving chemistry data
chemistry model with some data similarly
here I'm giving some information to some
model so the model will be an expert
with respect to that specific data So
based on all this particular data
whenever I get a new test data so what
will happen suppose let's say that this
this is a classification problem the M1
model will be predicting zero this will
be predicting one this will be
predicting zero and this will be
predicting zero now in this particular
case again the majority voting
classifier or majority voting will
happen in the case of classification
problem and then here you will be
specifically able to get the output as
zero so I hope everybody is able to
understand all the models over here are
decision trees and based on that you
will be doing see when in I interview
should be very very uh things the things
that I'm telling you over here is all
all the points are very much important
and similarly if you tell the
interviewer definitely your interview is
cracked in this kind of algorithm I've
seen some of my students saying that
okay uh Kish um when the interviewer
asked me that which is my favorite
algorithm I said random Forest I told
why did you say like that because he
said that because that person let me let
him ask any questions in random Forest
I'm very much confident about it and I'm
also going to prove him you know
why they are very very good so with this
specific case here you can basically see
that because of the overfitting
condition of the decision tree you're
combining multiple decision tree so that
you get a generalized model which has
low bias and low variance so I hope
everybody is able to understand boost
feature sampling basically means suppose
if I have 1 2 3 four feature for the
first model I may give two features for
the second model I may get three
features for the fourth model I may give
four features or uh any one feature ALS
I can give to a specific model so
internally that random Forest it take
carees of over here these things are
there and this is how random Forest
Works only the difference between random
Forest classify and regression is that
in regression again whatever output you
are basically getting you basically do
the mean that's it average you just do
the average you'll be able to get the
output based on all the models output
that you are actually getting now let's
talk about some of the important points
in random Forest the first thing first
question is that is normalization
required in random Forest then the next
question is that in KNN is normalization
when I say normalization or
standardization I I'll just talk about
standardization is standardization is
required so this will be my another
question so is normalization required in
random forest or decision tree you here
you can also say it as decision tree is
it required so for this the answer will
be no because understand decision tree
will basically do the splits if you Mini
minimize the data also that split won't
be that much important but if I talk
about KNN whether standardization
normalization required over here the
answer is yes because here we use two
things one is ukan distance and
Manhattan distance because of this you
definitely have to apply standardization
so that the computation or distance
becomes easy so this is one of the most
common interview questions that is
basically asked in random Forest coming
to the third question is random Forest
impacted by outlier
over here the answer will be no just
check it out outside basically means
Google and check it out check it out in
Google okay perfect so I hope I've
covered most of the things in random
Forest is random Forest impacted by
outliers this is the third question is
KNN impacted by
outliers is this KNN algorithm impacted
by outliers is KNN impacted Byers the
answer is yes big yes perfect so so
these all are the interview questions
that needs to be covered now let's go
ahead and discuss about adab boost now
in bagging most of the time we
specifically use random forest or you
can also create custom bagging
techniques custom bagging techniques
means whatever algorithm you want use
the combination of them and try to give
the output this also you can do it
manually with the help of hands okay
guys so second thing uh we are going to
discuss about is boosting technique in
this
the first thing that uh first algorithm
that we are going to discuss about is
adab Boost so adab boost we going to
discuss about how does adab Boost uh
work now let's solve uh the first
boosting technique which is called as
adab boost okay and uh this is a
boosting technique um in the boosting
technique you have heard that we have to
basically solve in a sequential way this
at least you know I know there is a lot
of confusion within you all but we'll
try to solve a problem let's say so
suppose I have a data set which looks
like this fub1 F2 F3 F4 so these are my
features and probably these are my
output okay so let's say that I'm having
this features like this and this is my
output like yes or no like this so let's
say that how many records I have over
here three
4 5 6 and one more is there 7 so this
seven records are there now in adab
boost the first thing is that
specifically with adab Boost uh you
really need to understand that what all
things we can basically do how do we
solve this classification problem that
we are going to understand the first
thing first is that we Define a weight
and the weight is very much simple
initially to all the records to all this
input records we provide an equal weight
now how do we provide an equal weight we
just go and count how many number of
records are there now in this particular
case the total number of records are one
2 3 4 5 6 7 now every record I have to
provide an equal weight that is between
0 to 1 so the overall sum should be one
so in this particular case what I can do
if I make 1X 7 1X 7 1X 7 to everyone
this will definitely become
a equal weights to all right and if I do
the total sum it will obviously be one
let's go to the next one now after this
what do we do okay after this in adab
the first thing that we do is that we
take any of this feature how do you
decide which feature to take whether we
should go with F1 or whether we should
go with FS2 or whether we should go with
F3 this we can do it with the help of
Information Gain and Information Gain
and entropy or guinea right based on
this we can definitely understand
whether we should start making decision
here also you specifically make decision
trees so here what you do is that you
probably have to determine by using
which feature I have to start my
decision tree so suppose out of all this
feature one feature two feature three
you have selected that okay the
information gain and entropy of feature
one is higher so I'm going to use
feature one and probably divide this
into decision trees now when I divide
this into decision tree let's say that
I'm dividing like this into decision
tree this decision tree depth will be
only one one depth and this depth since
it has only one depth we basically call
it as stumps so what we do over here
specifically we will create a decision
Tre by taking only one feature and we
will only divide it to one level okay
one level or one depth that's
and this is specifically called as stump
what we are going to do next is that
from this particular stump okay the
stump is basically getting created only
one so that is adab Boost right we say
it as weak Learners because this is weak
learner weak learner why there is a
reason we say this as weak learner so
only weak learner so that is the first
thing with respect to uh this particular
adab boost so the first step is that
this is a weak learner so for the weak
learner we basically create a stump
stump basically means one level decision
tree that's it based on the information
gain and entropy I have selected the
feature and then I just made a decision
tree with only one level why it is
called as it is called as weak learner
okay so that is the reason we use only
stum that is just a one level decision
tree now the next step happens is that
we provide all the specific records to
this F1 and we train this specific model
only with one level decision tree we
train them
now after we train them let's say that
we are going to pass all these
particular records to find out how many
are correct and how many are wrong this
decision this decision tree is basically
giving so let's say that out of this
entire records one
record one record was just given as
wrong let's say that this is the this is
the record which was given as wrong okay
so let's say that this record output was
predicted wrong from this particular
model only one wrong was there after
training the model now what we need to
do in this specific case understand a
very important thing so let's say that
we have done this and probably after
this what we are actually going to do we
are going to calculate the total error
so how many error this particular model
made let's say that in this particular
case only one was wrong so this was only
wrong right one was wrong so if I want
to calculate the total error how will I
calculate how many how many of them are
wrong how many of them are wrong only
one is wrong what is the weight of this
so I will go and write 1X 7 so this is
specifically my total error out of this
specific model which is my stump over
here okay which is my F1 stop now this
is my first
step the second step is that I need to
see the performance of stump which stump
this specific stump and the performance
is basically checked by a formula which
is 1 by log e 1us total error divided
total error why we are doing this
everything will make sense okay in just
time every every in just a small time
everything will make sense the first
step that we do in adaab boost is that
we try to find out the total error the
second step we try to find out the
performance of stump now in this
particular case it will be 1 by log e 1
- 1 by 7 / 1X 7 so once I calculate it
it will be coming as
895 F2 and F3 see again understand out
of all these features I found out from
Information Gain and entropy that this
is the best feature let's say that I
have calculated this
as895 so this is my second step the
first step is find out the total error
the second step is performance of stum
what is te te basically means total
error te basically means total error now
see see the steps okay see the steps
whenever I'm discussing about boosting
I'm going to combine weak Learners
together to get a strong learner now
what is the next step out of this now
what what will be my third step
understand over here my third step will
be to update all these weights and that
is the reason why I'm calculating this
total error and performance of Step so
my third step will basically be new
sample weight from the decision tree one
which is my stump so I'll say new sample
weight is equal to I need to update all
these weights why I need to update all
these weights again understand I'll I'll
talk about it just a second so if I want
to up update the sample weights first
update I will do it for correct records
see for correct records whichever are
correct like these all records are
correct these all records are correct
now when I update the weights of this
update the weights of this particular
record it should reduce and when the the
the wrong records that I have this
update should increase why because
because if I increase this weights then
the wrong records that are there that
record should go to the next week
learner that is the reason why I'm doing
it now how to update this particular
weights for correct records for correct
records the formula looks something like
this weight multiplied by weight
multiplied by E to the^ of
minus this specific performance okay
this specific performance so e to the
power of PS I'll write performance of
stump and then I will basically be able
to write 1X 7 * e to the^ of minus
895 if I do the calculation everybody
try to do it the answer will be
05 now this is for correct records what
about incorrect records for the
incorrect
records the the weights that is going to
the formula that we going to apply is
multiplied by E to the^ of plus PS not
minus PS plus PS so here I'll write 1 by
7 multiplied e to the^ of
895 so if I go and probably calcul this
I'm going to get it
as 349 so this two are the weights that
I have got that basically means all
these records now which are correct 1X 7
the new updated weights will be 05 05
05
05 sorry not for the wrong
records then this will be 05 then 05 and
05 so let me just see what is 1x 7 so
here you can see initially it was. 142
now it has got reduced to 05 because all
these records are correct but the wrong
record value is 349 so my weights will
now become over here as 349 now I will
just go and go ahead and write over here
my new weight my new weight is nothing
but 05
055
05 05 05 1 2 how many 1 2 3 okay fourth
record is here fourth record is there 1
2 3 4 05 05 okay how many records are
there 1 2 3 4 5 6 7 so my fourth record
will basically become the new value that
I'm having is something called as
349 now tell me guys if I do the
summation of all these weights is this
is it one so prob
no I don't think so it is one because if
I try to add it up it is not one but if
I go and see over here these all are one
if I combine all the things 1 2 3 4 5 6
7 these all are one so here I have need
to find out my normalized weight now in
order to find out the normalized
weight all I have to do is that what I
have to do because the entire sumission
should be one so we have to
normalize now in order to normalize all
you have to do is that go and find out
what is the sum of all this things the
summation of all these things will be
0 649 all you have to do is that divide
all the numbers
by 649 divided by
649
649 like this divide all the numbers by
649 and tell me what will be the answer
that you'll be getting so here your
normalized weight will now look like
077 07 and this value will be somewhere
around uh
537 I guess in this case then this will
be 07
077 here we are going to divide by all
this 64 649 now this is my normalized
weight now after you get a normalized
weight we will try to create something
called as buckets because see one
decision tree we have already created
which is a stump and you know from this
particular stum what you're going to get
okay as an output then in the sequential
model we will go and combine another
model over here now it's the time that I
have to create this specific model now
in order to create this specific model I
need to provide some specific rows only
to this model to train because this
model is giving one wrong now what I
have to do is that whatever is wrong
along with other data points I need to
provide this specific model with those
records so that this model will be able
to train on this and probably be able to
get the output now let's create buckets
now based on buckets how the buckets
will be created over here I will take 07
until
sorry whatever is the value over here
normal we value okay so I will start
creating my buckets buckets basically
from 0 to
07 what did I say now for this decision
tree or stump I need to provide some
records so the maximum number of record
that should be going should be the wrong
records that should go over here now how
do we decide that okay there should be a
way that we should be able to say that
that specific wrong number of Records
should go to that decision tree so for
that purpose what we do is that this
decision tree will randomly create some
numbers between 0 to 1 randomly create
those numbers between 0 to 1 and
whichever bucket it will come in like 07
to 014 014 to 07 basically means 0 2 1
then 0 2 1 2 see how the bucket is
getting cre this value is getting added
to this so that becomes this bucket 021
+3 537 how much it is it is nothing but
470 747 then 747
to
751 like this you create all the buckets
okay you can create all the buckets now
tell me which record is basically having
the biggest bucket size obviously this
record so if I randomly create a number
between 0 to one what is the highest
probability that the values will be
going in so in this particular case most
of the wrong records will be passed
along with the other records obviously
other records there are chances that
other records will go to the next
decision tree but understand maximum
number will go with the wrong records
because the bucket is high over here so
the bucket is high over here so most of
the time this specific record will get
create selected and then it will be gone
to the second tree now suppose I have
this all records
so this is my first stump this is my
second stump this is my third stump
similarly the third stump from the
second stump whichever wrong records
will be going maximum number of Records
will go over here then again it will be
trained like this we'll be having lot of
stumps minimum 100 decision trees can be
added you know that every decision tree
will give one output for a new test data
new test data this week learner will
give one output this week learner will
give one output this week learner and
this will week learner will be giving
one output obviously the time complexity
will be more now from this particular
output suppose it is a binary
classification I will be getting 0 1 1 1
so again over here majority voting will
happen and the output will be one in
case of regression problem I will be
having a continuous value over here and
for this the average average will be
computed and that will give me an output
over here so for regression the average
will be done for classification what
will happen majority will be be
happening so everywhere that same part
will be going on buckets is very much
simple guys buckets basically means
based on this weights normalized weight
we are going to create bucket so that
whichever records has the highest bucket
based on this randomly creating code you
know it will select those specific
buckets and put it into random Forest
understand why this bucket size is Big
the other wrong records which are
present right suppose they are have more
than four to five wrong records their
bucket size will also be bigger and
because based on this randomly creating
num between 0 to 1 most of the wrong
records will be selected and given to
the second stum similarly this
particular decision tree will be doing
some mistakes then that wrong records
will get updated all the weights will
get updated and it will be passed to the
next decision tree guys when I say wrong
record the output will be same only no
zero and one so interesting everyone I
hope you understood so much of maths in
adab boost and how adab boost actually
work three main things one is total
error one is performance of stump and
one is the new sample weight these
things are getting calculated extensive
max normalized weight was basically used
because the sum of all these weights are
approximately equal to one when boosting
why not take the last output no no no we
have to give the importance of every
decision tree output every decision tree
output are important okay let me talk
about one model which is called as
blackbox model versus white box what is
the difference between blackbox model
and white box if I take an example of
linear regression tell me what kind of
model it is is is it a white box model
or black box if I take an example of
random
Forest is this a white box or black box
if I take an example of decision tree it
is a white box of blackbox model if I
take an example of a Ann is it a white
box of blackbox model linear regression
is basically called as an wide Box model
because here you can basically visualize
how the Theta value is basically
changing and how it is coming to a
global Minima and all those things in
random Forest I will say this as
blackbox model because it is impossible
to see all the decision tree how it is
working so that is the reason the maths
is so complex inside this if I talk
about decision tree this is basically a
white box model because in decision tree
we know how the split are basically
happening with the help of paper and pen
you'll be able to do it in the case of
an Ann this is a blackbox model because
here you don't know like how many
neurons are there how they are
performing and how the weights are
getting updated so this is the basic
difference between the blackbox and uh
uh white box model this entire thing is
the agenda of today's session so let's
start uh the first algorithm that we are
probably going to discuss today is
something called as K
means
clustering K means clustering and this
is a kind of unsupervised machine
learning now always remember
unsupervised machine learning basically
means that uh the one and the most
important thing is that in unsupervised
machine learning
in unsupervised ml you don't have any
specific output so you don't have any
specific output so suppose you have
feature one and feature two and suppose
you have datas different different data
you know and based on this data what we
do we basically try to create clusters
this clusters basically says what are
the similar kind of data so this is what
we basically do from uh clustering and
there are various techniques like K
Mains uh it is hierle clustering and all
so first of all we'll try to understand
about K means and how does it
specifically work it's simple uh suppose
you have a data points like this okay
let's say that this is your F1 feature
F2 feature and based on this in two
dimensional probably I will be plotting
this points and suppose this is my
another points so our main purpose is
basically to Cluster together in
different different groups okay so this
will be my one group and probably the
other group will be this group right so
two groups because obviously you can see
from this clusters here you have two
similar kind of data which is basically
grouped together right this is my
cluster one and this is my cluster 2 let
me talk about this and why specifically
it'll be very much useful then we'll try
to understand about math intuition also
now always understand guys uh where does
clustering gets used okay in most of the
Ensemble techniques I told you about
custom emble technique right so custom
emble techniques in custom assemble
techniques you know whenever we are
probably creating a model first of all
on our data set what we do is that we
create clusters so suppose this is my
data set during my model creation the
first algorithm we will probably apply
will be clustering algorithm and after
that it is obviously good that we can
apply regression or classification
problem suppose in this clustering I
have two or three groups let's say that
I have two or three groups over here for
each group we can apply a separate
supervis machine learning algorithm if
we know the specific output that we
really want to take ahead I'll talk
about this and uh give you some of the
examples as I go ahead now let's go on
go ahead and focus more on understanding
how does kin's clustering algorithm work
so let's go over here the word K means
has this K value this K are nothing but
this K basically means centroids K
basically means centroids so suppose if
I have a data set which looks like this
let's say that this is my data set now
over here just by seeing the data set
what are the possible groups you think
definitely you'll be saying K is equal
to 2 So when you say k is equal to two
that basically means you will be able to
get two groups like this and each and
every group will be having a centroid a
centroid Point here also there will be a
centroid point so this centroid will
determine basically this is a separate
group over here this is a separate group
over here so over here here you can
definitely say that fine this is two
groups but but how do we come to a
conclusion that there is only two groups
okay we cannot just directly say that
okay we'll try to just by seeing the
data because your data will be having a
high dimension data right right now I'm
just showing your two Dimension data but
for a high dimension data definitely
you'll not be able to see the data
points how it is plotted so how do you
come to a conclusion that only two
groups are there so for this there is
some steps that we basically perform in
K mins the first step is that we try
with different K values we try with
different K values and which is the
suitable K value K is nothing but
centroids okay it is nothing but
centroids we try with different
different centroids in this particular
case let's say that I have this
particular data point and I actually
start with k is equal 1 or 2 or 3 any
one you want let's say that I'm going to
start with k is equal 2 how to come up
with this K is equal to 2 as a perfect
value that I'll talk about it we need to
know there is a concept which is called
as within cluster sum of square so when
we try different K values let's say that
for K is equal to 2 what will happen the
first step we select a we try K values
so let's say that we are considering K
is equal to 2 the second step is that we
initialize K number of centroids now in
this particular case I know my K value
is 2 so we will be initializing randomly
let's say that K is equal to 2 so what
we can actually do let's say that this
is this is my one centroid I will I'll
put it in another color so this will be
my one centroid and let's say that this
is my another centroid so I have
initialized two centroids randomly in
this space now after this particular
centroid what we have to do is that
after initializing this centroid what we
have to do is that we have to basically
find out which points are near to the
centroid and which points are near to
this centroid now in order to find out
it is a very easy step we can basically
use ukan distance to find out the
distance between the points in an easy
way if I really want to show you that
you know like how many points I want to
in an easy way what I can do I can
basically draw a straight line over here
let's say that I'm drawing a straight
line over here in another color I can
draw a straight line and I can also draw
one parallel line like this so This
basically indicates that whichever
points you see over here suppose if I
draw a straight line in between all
these points you will be able to see
that let's say that I'm drawing one more
parallel line
which is intersecting together so from
this you can definitely find out let's
say that these are all my points that
are nearer to this green line Green
Point so what I'm actually going to do
in this particular case all these points
that you are seeing near the green it
will become green color so that
basically means this is basically nearer
to this centroid and whichever points
are nearer to this particular point that
will become red point so that basically
means this belongs to this group okay
this belongs to this group so I hope
everybody's clear till here then what
will happen is that this summation of
all the values then we initialize the K
number of centroids that is done then we
try to calculate the distance we try to
find out which all points is nearer to
the centroid let's say that this is my
one centroid this is my another centroid
and we have seen that okay these all
points belong to this centroid it near
to this particular centroid so this is
becoming red so that is based on the
shortage distance and here it is
becoming green now the next step let's
see what is the next step after this so
I am going to remove this thing now the
next step will be that the entire points
that is in red color all the average
will be taken so here again the average
will be taken now third step here I'm
going to write here we are going to
compute the average the reason we
compute the average is that because we
need to update the centroid so compute
the average to update centroid to update
centroids so here you'll be able to see
that what I'm actually doing as soon as
we compute the average this centroid is
going to move to some other location so
what location it will move it will
obviously become somewhere in Center so
here now I'm going to rub this and now
my new centroid will be this point where
I am actually going to draw like this
let's say this is my new centroid now
similarly this thing will happen with
respect to the green color so with
respect to the green color also it will
happen and this green will also Al get
updated so I'm going to rub this and
this will be my new Green Point which
will get updated over here then again
what will happen again the distance will
be calculated and again a perpendicular
line will be calculated here you can see
that now all the points are towards
there okay again the centroid based on
this particular distance again it will
be calculated and here you can see that
all the points are in its own location
so here now no update will actually
happen let's say that there was one
point which was red color over here
then this would have become green color
but since the updation has happened
perfectly we are not going to update it
and we are not going to update the
centroid right so now you can understand
that yes now we have actually got the
perfect centroid and now this will be
considered as one group and this will be
basically considered as the another
group it will not intersect but right by
default here intersection is happening
so I hope everybody's understood the
steps that you have actually followed in
initializing the centroids in updating
the centroids and in updating the points
is it clear everybody with respect to K
means now let's discuss about one
point how do we decide this K value okay
how do we decide this K value so for
deciding the K value there is a concept
which is called as elbow method so here
I'm going to basically Define my elbow
method now elbow method says something
very much important because this will
actually help us to find out what is the
optimized K value whether the K value
should be two whether uh the K value is
going to be three whether the K value is
going to become four and always
understand suppose this is my data set
suppose this is my data set initially
let's say that I have my data points
like this we cannot go ahead and
directly say say that okay K is equal to
2 is going to work so obviously we are
going to go with iteration for I is
equal to probably 1 to 10 I'm going to
move towards iteration from 1 to 10
let's say so for every iteration we will
construct a graph with respect to K
value and with respect to something
called as W CSS now what is this W CSS W
CSS basically means within cluster sum
of
square okay this is the meaning of wcss
within cluster sum of square now let's
say that initially we start with one
centroid so one centroid let's say it is
initialized here one centroid is
basically initialized here if we go and
compute the distance
between each and every points to the
centroid and if we try to find out the
distance will the distance value be
greater or it will be smaller will it be
smaller or greater tell me if you try to
calculate this distance from this
centroid to every point this is what is
within cluster sum of square it will
always be very very much greater so
let's say that my first point has come
somewhere here it is going to be
obviously greater let's say that my
first point is coming over here find
So within K is equal to 1 initially we
took and we found out the distance of w
CSS and it is a very huge value okay
because we're going to compute the
distance between each and every point to
the centroid now the next thing that I'm
actually going to do is that now we'll
go with next value that is K is equal to
2 now in K is equal to 2 I will
initialize two points okay I will
initialize two points and then probably
I will do the entire process which I
have written on the top now tell me me
whichever points is nearer to this green
point if we compute the distance and
whichever points is nearer to the red
point if you compute the distance like
this now this summation of the distance
will be lesser than the previous W CSS
or not obviously it is going to be
lesser than the previous W CSS so what
I'm actually going to do probably with K
is equal to 2 your value may come
somewhere here then with K is equal to 3
your value May come somewhere here then
K is equal to 4 will come here to 5 6
like this it will go so here if I
probably join this line you'll be able
to see that there will be an Abrupt
changes in the W CSS value in the wcss
value there will be an Abrupt changes
and this this is basically called as
elbow curve now why we say it as elbow
curve because it is in the shape of
elbow and here at one specific point
there will be an Abrupt change and then
it will be straight so that is the
reason why we basically say this as
elbow okay so this is a very important
thing see in finding the K value we use
elbow method but for validating purpose
how do we validate that this model is
performing well we use silard score that
I'll show you just in some time but
understand that in K means clustering we
need to update the centroids and based
on that we calculate the distance and as
the K value keep on increasing you'll be
able to see that the distance will
become normal or the wcss value will
become normal and then we really need to
find out which is the phys K value where
the abrupt change see over here suppose
abrupt change is there and then it is
normal then I will probably take this as
my K value so obviously the model
complexity will be high because we are
going to check with respect to different
different K values and wcss values and
this basically means that the value that
we'll probably get first of all we need
to construct this elbow curve then see
the changes where it is basically
happening we'll need to find out the
abrupt change and once we get the abrupt
change we basically say that this may be
the K value so K is equal to 4 as an
example I'm telling you so unless and
until if you really want to find the
cluster it is very much simple we take a
k value we initialize K number of
centroids we compute the average to
update the centroids then again we try
to find out the distance try to see that
whether any points has changed and
continue that process unless and until
we get separate groups okay so this is
the entire funa of claim in clustering
so finally you'll be able to see that
with respect to the K value we will be
able to get that many number of groups
if my K value is four that basically
means I will be probably getting four
different groups like this 1 two right
three like this and four I will be
getting four groups like this with K is
equal to 4 that basically means K is
equal to four clusters and every group
will be having its own centroids okay
every group will be having okay
centroids are very much important yes
I'll try to show you in the coding also
guys let's go towards the second
algorithm the second algorithm that we
will be probably discussing is called as
hierarchical clustering now hierarchal
clustering is very much simple guys all
you have to do is that let's say this is
your data points this is your data
points and this is my P1 let's say P2
now hierle clustering says that we will
go step by step the first thing is that
we will try to find out the most nearest
Value let's say this is my X and Y let's
say these are my points like this is my
P1 point this is my P2 point this is my
P3 point this is my P4 Point P5 Point P6
point p7 point okay so these are my
points that I have actually named over
here let's say that this may be the
nearest point to each other so what it
will do it will combine this together
into one cluster this we have computed
the distance so it will C create one
cluster now what will happen on the
right hand side there will be another
notation which you may be using in
connecting all the points one so suppose
this is my P1 this is my P2 this is my
P3 P4 let's say that I have this many
points and probably I will also try to
make
p7 so these are my points p7 now you
know that the nearest point that we are
having okay this will probably be
distance 1 2 3 this is distance okay 4 5
6 like this we have lot of distance so
hierle clustering will first of all find
out the nearest point and try to compute
the distance between them and just try
to combine them together into one what
do we do we basically combine them into
one group okay so P1 and P2 has been
combined let's say then it'll go and
find out the other nearest point so
let's say P6 and p7 are near so they are
also going to combine into one group so
once they combine into one group then we
have P6 and p7 which will be obviously L
greater than the previous distance and
we may get this kind of computation and
another combination or cluster will form
get formed over here then you have seen
that okay P3 and P5 are nearer to each
other so we are going to combine this so
I'm going to basically combine P3 and
P5 okay and let's say that this distance
is greater than the previous one because
we are basically going to sh start with
the shortest distance and then we are
going to capture the longest distance
now this is done now you can see that
the next point that is near right to
this particular group is P4 so we are
going to combine this together into one
group so once we combine this into one
group this P4 will get connected like
this let's say it is getting connected
like this P4 has got connected then what
is the nearest Point whether it is P6 p7
group or P1 P2 obviously here you can
see that P1 P2 is there so I am probably
going to combine this group together
that basically means P1 P2 let's say I'm
just going to combine this group group
together again circle is coming so I
will make a dot let's say I'm going to
combine this group together because
these are my nearest groups so what will
happen P1 and P2 will get combined to P5
sorry P4 P5 this one so I will be
getting another line like this and then
finally you'll be seeing that P6 p7 is
the nearest group to this so this will
totally get combined and it may look
something like this so this will become
a total group like
this so all the groups are combined so
finally you'll be able to see that there
will be one more line which will get
combined like
this this is basically called as
dendogram dendogram okay which is like
bottom root to top now the question
arises is that how do you find that how
many groups should be here how do you
find out that how many groups should be
here the funa is very much Clear guys in
this is that you need
to find the longest
vertical line you need to find out the
longest vertical line that has no
horizontal line pass through it no
horizontal
line passed through it this is very much
important that has no horizontal line
pass through it now what this is
basically meaning is that I will try to
find out the longest line longest
vertical line in such a way that none of
the horizontal line passes through it
what is horizontal line suppose if I
consider this vertical line This
vertical line over here if you see that
if I extend this green line it is
passing through this if I extend this
line it is passing through this right if
I'm extending this line it is passing
through this right so out of this the
longest line that may be passing in such
a way that no horizontal line probably
is this line that I can actually see so
what you do over here is that you
basically just create a straight line
over this and then you try to find out
that how many clusters it will be there
by understanding that how many lines it
is passing through if it is passing
through this one line two line three
line four line that basically means your
clusters will be four
clusters this is how we basically do the
calculation in heral clustering again
here it may not be the perfect line I've
just drawn with some assumptions but if
you are trying to do this probably you
have to do in this specific way okay
I've already uploaded a lot of practical
videos with respect to highill
clustering and all now now tell me
maximum effort or maximum time is taken
by is taken
by K
means or hierle clustering this is a
question for you yes guys number of
clusters may be three but here I'm just
showing you that how many lines it may
be passed by how do you basically
determine whether maximum time will be
taken by kin or Hier clustering this is
an interview question the maximum time
that will be taken is by hierarchical
clustering why because let's say that I
have many many many data points at that
point of time hierle clustering will
keep on constructing this kind of
dendograms and it will be taking many
many many time lot time right so hierle
clustering will take more time maximum
time that it is going to basically take
so it is very much important that that
you understand which is making basically
taking more time so if your data set is
small you may go ahead with hierle
clustering if your data set is large go
with K means clustering go with K means
clustering in short both will take more
time but K Min will perform better than
hle clustering see guys you will be
forming this kind of dendograms right
and just imagine if you have 10 features
and many data points how you're going to
do it it will be a cubers some process
you'll not be even able to see this
dendogram properly and manually
obviously you cannot do it so this was
with respect to K means clust swing and
H mean clust swing I hope everybody's
understood now the next topic that we'll
focus on is that how do we
validate see how do we validate a
classification problem we use
performance metric like confusion Matrix
accuracy um different different true
positive rate Precision recall but how
do we validate clustering model Model S
we are going to use something called as
so we are going to basically use
something called as
Sil score I'll show you what Sid score
is I'm going to just open the Wikipedia
so this is how a CID score looks like a
very very amazing topic okay how do we
validate whether my model basically has
perfect three or four model perfect
three suppose if I find out my K value
is three how do we find out now see one
more one more issue with K means one
issue with K means which I forgot to
tell you let's say that I have a data
point which looks like this and suppose
I have some data points like this I have
some data points which looks like this
let's say I have like this now in this
one issue will be that suppose I try to
make a cluster over here obviously
you'll be saying my K value will be two
okay in this particular case suppose
this is one cluster this is my another
cluster
right because of my wrong initialization
of the points okay understand because
suppose if I initialize just randomly
some centroids like this then what may
happen is that there is a possibility
that we may also have three clusters
like like like this kind of clusters one
cluster will be here one cluster will be
here one cluster will be here so this
initialization of the centroids one
condition is that it should be very very
far if we initialize our centroids very
very far at that point of time we will
be able to find the centroid exactly in
the center because it will keep on
updating it'll keep on going ahead right
but if we don't initialize that very far
then there will be a situation that
probably if I wanted to get only the
real thing was to get only two centroids
I was probably getting three centroids
right so this is a problem so for this
there is an algorithm which is called as
K means Plus+ and what this K means
Plus+ will do which I will probably show
you in Practical this will make sure
that all the centroids that are
initialized it is very very
far okay all the in centroids that is
basically there it is initialized very
very far we'll see that in practical
application where specifically those
centroids are basically used now let me
go ahead and let me show
you with respect to Sid clust string now
what is the solo color string I'm going
to explain you in an amazing way this is
important
if someone says you how do we validate
how do we validate cluster
model then at that point of time we
basically use this site it will be used
in it will be used with respect
to it will be used with respect to K
means it can be used in hierle mean
right if you want to validate how do we
validate okay that is what we are
basically going to see over here now in
C's clustering
what are the most important things the
first and the most important thing is
that we will try to find out we will try
to find out a ofi we will try to find
out a of I now what is this a ofi see
this a ofi that you basically see a ofi
is nothing but see three major steps
happens in order to validate cluster
model with the help of solo first thing
is that I will probably take one cluster
okay there will be one point
which will be my centroid let's say and
then what I'm going to do I'm just going
to whatever points are there inside this
cluster I'm going to compute the
distance between them so I'm going to do
the summation and I'm also going to do
the average of all this distance so here
you can see that when I said distance of
I comma J I basically means this point J
basically means all these points I is
nothing but it is the centroid so here
is nothing but this this is the centroid
let's say that I'm having the centroid
so I'm going to compute all the distance
over here which is mentioned by this and
this value that you see that I'm
actually dividing by C of I minus one in
Short I am actually trying to calculate
the average
distance so this is the first point
where I'm actually Computing the a ofi
now similarly what I will do is
that what I will do is that the next
point will be that suppose I have
computed a ofi the next the next that we
need to compute is B ofi now what is b
ofi b ofi is nothing but there will be
multiple clusters in a k means problem
statement we will try to find out the
nearest cluster okay suppose let's say
that this is the nearest cluster and in
this I have all the variety of points
then B ofi basically says that I will
try to compute the distance between each
point and the other point in this
centroid sorry in this cluster so this
is my cluster one this is my cluster two
so what I'm actually going to do is that
here I'm going to compute the distance
between this point to this point then
this point to this point then this point
to this point this point to this point
this point to this point this point to
this point every point I'm actually
going to compute the distance once this
point is done we will go ahead with the
next point and we'll try to compute the
distance and once we get all this
particular distance what we are going to
do we are going to do the average of
them average
now tell me if I try to find out the
relationship between a of I and B of I
if my cluster model is good will a of
I will be greater than b of I or
will B of I will be greater than a ofi
if I have a good clustering model if I
have a good clustering model will a of I
is greater than b of I will be greater
than b of I or whether B of I will be
greater than a of I out of this if we
have a really good model obviously the
distance between B of I will be greater
than a of I in a good model that
basically means if I talk about sloid
clustering the values will be between -1
to +1 the more the value is towards +1
that basically means the good the model
is the good the clustering model is the
more the values towards negative one
that basically means this condition is
getting applied now what does this
condition basically say that basically
means that this distance is far than the
cluster distance this is what this
information is getting portrayed and
this is the importance of CID
clustering finally when we apply the
formula of CID clustering you'll be able
to see that sloid clustering is nothing
but let me rub this everything guys for
you let me just show you what is CID
clustering CID clustering formula will
be something like this this B of I so
here you have solid clustering this is
the formula B of I minus a of I Max of a
of I comma B of I if C of I is greater
than one right so by this you will be
getting the value between -1 to + 1 and
more the value is towards + one the more
good your model is more the values
towards minus1 more bad your model is
because if it is towards minus1 that
basically means your a of I is obviously
greater than b of I so this is the
outcome with respect to cot crust string
if s is equal to zero that basically
means still your model needs to be uh
per basically the clustering needs to be
improved what is I over here I is
nothing but one data point you you can
just read this guys data point in I in
the cluster C of I so I hope everybody's
understood this now let's go ahead and
let's discuss about the next topic we
have obviously finished up solart
clustering over here let's discuss about
something called as DB
scan so for DB scan clustering this is
an amazing clustering algorithm we'll
try to understand how to actually do DB
clustering and probably you'll be able
to understand a lot of things from this
now in DB scan clustering what are the
important things so let's start with
respect to DB scan clustering and let's
understand some of the important points
over here the first point that you
really need to remember is something
called as score point points I'll also
talk about when do you say core points
or when do you say other points as such
so the first point that I will probably
discuss about is something called as Min
points the second point that I will
probably discuss about is something
called as score points the third thing
that I will probably discuss about is
something called as border points and
the fourth point that I will definitely
talk about is something called as noise
Point okay guys now tell me in C's
clustering
if I have this kind of groups don't you
think with the help of two different
clusters I may combine this two like
this with the help of two different
clusters I may combine something like
this right but understand over here what
what problem is basically happening with
the second clustering this is actually
an outliers let's say that let's say one
thing very nicely I will put okay let's
say I have one point over here I have
one point over here here so if I do
clustering probably I will get one
cluster
here and I may get another cluster which
is somewhere here now understand one
thing this point is definitely an
outlier even though this is an outlier
with the help of K means what I'm
actually doing I'm actually grouping
this into another group so can we have a
scenario wherein a kind of clustering
algorithm is there where we can leave
the outlier separately and this outlier
in this particular algorithm and this is
B basically uh we will be using DB scan
to relieve the outlier and this point
will be called as a noisy Point noisy
point or I can also say it as an outlier
so this will be a noise point for this
kind of algorithm where you want to skip
the outliers we can definitely use DB
scan that is density based spatial
clustering of application with noise a
very amazing algorithm and definitely I
have tried using this a lot nowadays I
don't use K means or Hier means instead
use this kind of algorithm now see this
what are the important things over here
first of all you need to go ahead with
Min points Min points so first thing is
that you need to have Min points this
Min points is a kind of
hyperparameter this basically says what
does hyper parameter says and there is
also a value which is called as
Epsilon which I forgot I will write it
down over here this is called as Epsilon
now what does epsilon mean Epsilon
basically means if I have a point like
this
and if I take Epsilon this is nothing
but the radius of that specific Circle
radius of that specific Circle okay so
Epsilon is nothing but radius over here
in this specific T what does minimum
points is equal to 4 mean let's say that
I have I have taken a point over here
let's say that this is my
point and I have drawn a circle which
looks like this and let's say that this
is my Epsilon
value okay this is my Epsilon value if I
say my Min point point is equal to 4
which is again a hyper
parameter that basically means I can if
I have four at least four points over
here near to this particular Circle
based on this Epsilon value then what
will happen is that this point this red
point will actually become a core
point a core point which is basically
given over here if it has at least that
many number of Min points inside or near
to this particular within this
Epsilon okay within this particular
cluster suppose this is my cluster with
the help of Epsilon I have actually
created it is there a particular unit of
Epsilon or we simply take the unit of
distance no Epsilon value will also get
selected through some way I I'll show
you I'll show you in the practical
application don't worry now the next
thing is that let's say let's say I have
another another point over here let's
say that I have another point over here
and this is my circle with respect to
Epsilon I have created it let's say that
here I have only one
point I have only one point inside this
particular cluster at that point this
point becomes something called as border
Point border Point border point also we
have discussed over here right so border
point is also there so here I'm saying
that at least one at least one if it is
only one it is present then it will
become a border point if it has Force
definitely this will become a core Point
core Point like how we have this red
color so and there will be one more
scenario suppose I have this one cluster
let's say this is my Epsilon and suppose
if I don't have any points near this
then this will definitely become my
noise point and this noise point will
nothing be but this will be a
cluster okay so here I have actually
discussed about the noise point also so
I hope everybody is able to understand
the key terms now what is basically
happening is that whenever we have a
noise Point like in this particular
scenario we have a noise point and we
don't find any points inside this any
core point or border point if you don't
find inside this then it is going to
just get neglected that basically means
this is basically treated as an outlier
I hope everybody is able to understand
here this point will be treated as an
outlier or it can also be treated as a
noise point and this will never be taken
inside a group okay it will never never
be taken inside a group suppose I have
this set of points which you see
basically over here red core and all and
there is also a border Point by making
multiple circles over here here you can
definitely say that how we are defining
core points and the Border points and
this can be combined into a single group
okay this can be combined into a single
group because how the connection is now
see this this yellow line is basically
created by one sorry this yellow point
is basically created by one Epsilon and
we have one One Core point over here
remember over here it should be at least
one core Point okay not one point but
one core point at least if it is having
one core point then it will become a
border point this will become a border
point that basically means yes this can
be the part of this specific group so
what we are doing Whenever there is a
noise we are going to neglect it
wherever there is a broader and core
points we are going to combine it so
I'll show you one more diagram which is
an amazing diagram which will help you
understand more in this a k means
clustering and Hier mean clustering now
see this everybody now the right hand
side of diagram that you see is based on
DB scan clustering and the left hand
side is basically your traditional
clustering method let's say that this is
K means which one do you think is better
over here you see this these all
outliers are not combined inside a group
But whichever are nearer as a core point
and the broader point separate separate
groups are actually
created right so this is how amazing a
DB scan clustering is a DB scan
clustering is pretty much amazing that
is basically the outcome of this here in
C's clustering you can see this all
these points has also been taken as blue
color as one group because I'll be
considering this as one group but here
we are able to determine this in a
amazing groups so in I'm saying you guys
directly use DB scan with without
worrying about anything so now let's
focus on the Practical part uh I'm just
going to give you a GitHub link
everybody download the code guys I've
given you the GitHub link quickly
download and keep your file ready I'm
going to open my anaconda prompt
probably open my jupyter notbook we'll
do one practical problem I've given you
the link guys please open it so this is
what we are going to do today this will
be amazing here you'll be able to see
amazing things how do you come to know
that over fitting or underfitting is
happening you don't know the real value
right so in in clustering there will not
be any underfitting or overfitting so uh
what all things we'll be importing first
is that we'll try cin clustering we'll
do silot scoring and then probably we'll
see the output and um and we'll do DB
scan Also let's say DB scan is also
there so uh what are the things we have
basically imported one is the cin
clustering one is the Sout samples and
Sout scores these all are present in the
SK learn and it is present in metrics
that basically means we use this
specific parameter to validate
clustering models okay now we'll try to
execute this and apart from that mat
plot lib we are just trying to import
numai we are trying to import and all
here we are executing it perfectly the
next thing is that here the next step is
that generating the sample data from
make underscore blobs first of all we
are just trying to generate some samples
with some two features and we are saying
that okay should have four centroids or
C centroids itself with some features
I'm trying to generate some X and Y data
randomly and this particular data set
will basically be used in performing
clustering algorithms okay forget about
range undor ncore clusters because we
need to try with different different
clusters and try to find out the solid
score so right now I just initialized
with 2 3 4 5 6 values it is very simple
so if I go and probably see my X data so
my X data will look something like this
so this is my X data with two features
and this is my Y data with one feature
which is my output which belongs to a
specific class okay so that you can
actually do with the help of make
underscore blobs let's say how to apply
kin's clustering algorithm so as I said
that I will be using W CSS W CSS
basically means within cluster sum of
square so I'm going to import K means
over here for I in range 1A 11 that
basically means I'm going to use
different different K values or centroid
values and try to C which is having the
minimal wcss value and I'll try to draw
that graph which I had actually shown
you with respect to Elbow method so here
I will basically be also using K means
number of clusters will be I and
initialization technique I will will be
using K means Plus+ so that the points
the centroids that are initialized those
those points are very very far and then
you have random state is equal to zero
then we do fit and finally we do wcss do
upend cins doin inertia okay this dot
inertia will give you the distance
between the centroids and all the other
points and this is what I'm going to
append in this wcss value and finally
I'll just plot it now here you can see
that I'm just plotting it obviously by
seeing this graph this graph looks like
an elbow okay this graph looks like an
elbow so the point that I'm actually
going to consider over here see which is
the last abrupt change so if I talk
about the last abrupt change here I have
the specific value with respect to this
okay I have one specific value with
respect to this this is my abrupt change
from here the changes are normal so I'm
going to basically select K is equal to
4 now what I'm actually going to do with
the help of sart with the help of s CL
score we are going to compare whether K
is equal to 4 is valid or not so that is
what we are going to do valid or not so
here we are going to do this now let's
go ahead and let's try to see it how we
are going to do it so here you can see n
clusters is equal to 4 then I'm actually
able to find out the prediction and this
is specifically my output okay this is
done now see this code okay this code is
a huge code I have actually taken this
code directly from the SK learn page of
Silo if you go and see this this code is
directly given over there but I'm just
going to talk about like what are the
important things we need to see over
here with respect to different different
clusters see see this clusters 2 3 4 5 6
I'm going to basically compare whether
the K value should be four or not with
the help of solid scoring so let's go
here and here you can see that I'm
applying this one first I will go with
respect to for Loop for ncore clusters
in range underscore clusters different
different cluster values are there first
we'll start with two so here you can see
initialize the cluster with and cluster
value and a random generator seed of 10
for reproducibility so ncore clusters
first I take took it as two and then I
did fit predict on X after I did fit
predictor on X I'm using this score on X
comma cluster label now what this is
going to do understand in Solo what did
we discuss it will it will try to find
out all the Clusters the Clusters over
here like this and it'll try to
calculate the distance between them
which is the a of I then it'll try to
compute the B of I then finally it'll
try to compute the score and if the
value is between minus1 to +1 the more
the Valu is towards + one the more
better it is right so these all things
we have already discussed and that is
what this specific function will do and
this will give my solo average value
over here solid value will be over here
okay this we have done and then we can
continuously do it for another another
things you can actually find it over
here and this value that you see this
code that you see is nothing nothing so
complex okay this is just to display the
data properly in the form of graphs okay
in the form of graphs so again I'm
telling you I did not write this code
I've directly taken it from the uh SK
learn page of solid okay so just try to
see this particular uh plotting diagrams
and all that you can definitely figure
out but let's see I will try to execute
it and try to find out the output now
see for ncore cluster is equal to 2 the
average solid score is 70 I told you the
value will be between -1 to +1 and I'm
actually getting 704 which is very very
good and then for ncore cluster is equal
to 3 588 then ncore cluster is equal to
4 I'm getting 65 which is pretty much
amazing and then for ncore cluster equal
to 5 the average score is 563 and ncore
cluster is equal to 6 you are saying
.45 here directly you can actually say
that fine for _ cluster equal to 2 I'm
getting an amazing score of
704 obviously you're you're getting the
highest value over this so should we
select ncore cluster isal to two Okay we
should not directly conclude from it
because here we need to also see that
any feature value or any cluster value
is also coming as negative value that
also we need to check so here we will go
down over here you will see the first
one over here with respect to the first
one you see that I'm get getting the
value from 0 to 1 it is not going going
to Min -.1 so definitely two clusters
was able to solve the problem so I'll
keep it like this with me I definitely
have a chance that this may this may
perform well I may have a chance that
this K uh K is equal to 2 May perform
well okay so I may have a chance let's
see to the next one to the next one over
here you can see that for one of the
cluster the value is negative if the
value is negative that basically means
the AI is obviously greater than b ofi
so I'm not going to prer this because it
is having some negative values even
though my cluster looks better but again
understand what is the problem with
respect to this cluster is that if I
take this cluster and probably compute
the distance between this point to this
point and if I probably compute from
this point to this point or this point
to this point this point is obviously
nearer to this right it is obviously
nearer to this so that is the reason why
I'm getting a negative value over here
okay negative value over here this is my
uh output my score this point that you
see dotted points this is my score 58
what whatever it is this is basically my
score so obviously this basically
indicates that this point is near the
other cluster point is nearer to this so
I'm actually getting a negative value
right so this you really need to
understand okay now similarly if I go
with respect to ncore Cluster is equal
to 4 this looks good because here I
don't have any negative value and here
you can see how cooly it has basically
divided the points amazing inly with the
help of k equal to 4 right and similarly
if I go with five obviously you can see
some negative values are here some
dotted line negative value are there
with respect to six you also have some
negative values so definitely I'll not
go with six I may either go with four or
I may either go with two now whenever
you have this options always take a
bigger number instead of two take four
because four is greater than two because
it will be able to create a generalized
model so from this I'm actually going to
take and is equal to 4 K is equal to 4
now should we compare with this with the
elbow method here also I got four right
so both are actually matching so this
indicates that with the help of this
clustering this siluette score we can
definitely come to a conclusion and
validate our clustering model in an
amazing way so I hope everybody is able
to understand and this way you basically
validate a model and definitely you can
try it out you can understand this code
definitely I but till here you have
understood that here I'm going to get
the average value then for iore clusters
whatever cluster this is matching it is
just mapping over there and it is
basically giving so this was the session
and uh yes in today's session we
efficiently covered many topics we
covered kin hierle clustering solid
score DB clustering in tomorrow's
session the topics that are probably
pending is first I'll start with svm and
svr second I will go ahead with XG boost
and and third I will cover up PCA let's
see whether I'll be able to complete
this session uh one one amazing thing
that I want to teach you guys because
many people ask me the definition of
bias and variance so guys uh many people
get confused when we talk about bias and
variance you know because let's say that
uh I have a model for the training data
set it gives us somewhere around 90%
accuracy let's say I'm getting a 90%
accuracy for the test data I may
probably getting somewhere around 70%
accuracy now tell me which scenario is
basically this most of the people will
be saying that okay fine it is
overfitting now when I say overfitting I
basically mention overfitting by low
bias and high
variance right so many people get
confused Krish tell me just the exact
definition of bias and variance low bias
obviously you are saying that because
the training is performed like the model
is performing well with the help of
training data set but with respect to
the test data set the model is not
performing well with respect to training
data set why do we always say bias and
with respect to test data set why do we
always say variance so for this you need
to understand the definition of bias so
let me write down the definition of bias
over here so here I can definitely write
that bias it is a
phenomena that
skews the
result of an
algorithm in
favor in favor or against an
idea against an idea I'll make you
understand the definition uh um but
understand the understand understand
what I have actually written over here
it is a phenomena that skewes the result
of an algorithm in favor or against an
idea whenever I say this specific idea
this idea I will just talk about the
training data set initially now when we
train a specific model suppose if I have
this specific model over
here and I'm training with this specific
training data set so this is my training
data set now based on the definition
what does it basically say it is a
phenomenon that skews the result of an
algorithm in favor or against an idea or
a this specific training data set so
even though I'm training this particular
model with this training data set
with this data set it may it may be in
favor of that or it may be against of
that that basically means it may perform
well it may not perform well if it is
not performing well that basically means
the accuracy is down if the accuracy is
better at that point of time what will
say see if the accuracy is better that
time what we'll say we we'll come up
with two terms from here obviously you
understand okay there are two scenarios
of bias now here if it is in favor that
basically means it is performing well
with respect to the training data set I
will basically say that it has high bu
if it is not able to perform well with
the training data set then here I will
say it as low
bias I hope everybody is able to
understand in this specific thing
because many many many people has this
kind kind of confusion now similarly if
I talk about variance let's say about
variance because you need to understand
the definition a definition is very much
important okay if I if I just talk about
the definition of variance I'm just
going to refer like this the variance
refers to the changes in the model when
using when using different
portion of the
training or test
data now let's understand this
particular
definition variance refers to the
changes in the model when using
different proportion of the test
training data or test data we obviously
know that whenever initially if I have a
model understand from the definition
everything will make sense I am
basically training initially with the
training
data okay because we divide our data set
see our data set whenever we are working
with we divide this into two parts one
is our train data and test data okay
because this is a tra test data is a
part of that particular data set right
and suppose in this particular training
data it gets trained and performs well
here I'm actually talking about bias but
when we come with respect to the
prediction of the specific model at that
point of time I can use other training
data that basically means that training
data may not be similar or I can also
use test data now in this test data what
we do we do some kind of predictions
these are my predictions and in this
prediction again I may get two
scenario I may get two scenario which is
basically mentioned by variance it
refers to the changes in the model when
using when using different portion of
the training or test data refers to the
changes basically means whether it is
able to give a good prediction or wrong
predictions that's it so in this
particular scenario if it gives a good
prediction I may definitely say it as
low variance that basically means the
accuracy with the accuracy with respect
to the test data is also very good if I
probably get a bad if I probably get a
bad accuracy at that time I basically
say it as high variance so if I talk
about three scenarios over here let's
say this is my model one and this is my
model
two and this is my model
three now in this scenario let's
consider that my model one has the
training
accuracy of 90% and test accuracy of
75% similarly I have here as my train
accuracy of 60% and my test accuracy
of
55% now similarly if I have my train
accuracy of 90% And my test accuracy of
92% now tell me what what things you
will be getting here obviously you can
directly say that fine your training
accuracy is better now you're talking
about bias so this basically indicates
that this has low
bias and since your test accuracy is bad
because it is when compared to the train
accuracy it is less so here you are
basically going to say high
variance understand with respect to the
definition similarly over here what
you'll say high
bias High variance because obviously it
is not performing
well this is another scenario last the
last scenario is that this is the
scenario that we want because it is low
bias and low variance
okay many many people have basically
asked me the definition with respect to
bias and variance and here I've actually
discussed and this indicates this gives
me a generalized model and this is what
is our aim when we are working as a data
scientist so I hope you have understood
the basic difference between V bias and
variance and I was able to give you lot
of examples lot of understanding with
respect to this so I hope you have
actually got this particular uh
understanding of this uh two terms which
we specifically talk about high bias low
bias High variance low variance right so
this was it from my side guys uh and uh
I hope you have understood
this
okay so let's take let's consider a data
set credit
and let's say this is a
approval so we are going to take this
sample data set and understand how does
XG boost work suppose salary is less
than or equal to 50 and the credit is
bad so approval the loan approval will
be zero that basically means he he or
she will not get if it is less than or
equal to 50 if the credit score is good
then probably approval will be one if it
is less than or equal to 50 if it is
good
again then it is going to get one if it
is greater than
50 and if it is bad then obviously
approval will be
zero if it is greater than
50 if it is good we are going to get it
as one if it is greater than
50k and probably if it is normal then
also we are going to get
it so this is this is my data set so how
does XG boost classifier work understand
the full form of XG boost is
Extreme gradient
boosting extreme gradient boosting so we
will basically understand about extreme
gradient boosting now extreme gradient
boosting uh will be actually used to
solve both classification and the
regression problem statement so first of
all let's understand how it is basically
exib basically how it actually if you if
you just talk about XG boost you
understand that it is a boosting
technique and internally it tries to use
decision tree so how does this decision
Tre is basically getting constructed in
the case of XV boost and how it is
basically solved we are going to discuss
about it so whenever we start exib boost
classifier understand that first of all
we create a specific base model suppose
if I say this is my base model and this
base model will be a weak learner okay
and this base model will always give an
output of probability of 0.5 in the case
of classification problem so suppose if
I say this is probability 0.5 then I
will try to create a field over here
this field is called as residual field
so first base model what I'm going to do
any data set that you give from here to
train it will always give you the output
as 0.5 so this is just a dummy base
model now tell me if my probability
output is is 0.5 if I want to calculate
the residual that basically means I need
to subtract approval minus this
particular value so what will be the
value over here 0 -.5 will be
-.5 1 -.5 will be5 1 -.5 will
be5 and 0 -.5 will be -.5 and this 1 -.5
will
be uh 0.5 and this will also be 0.5
let's consider that I have one more
record uh and this specific record can
be anything uh because I want to keep
some more records over here so let's
consider that I have one more record
which is less than or equal to 50K and
if the credit scod is normal you're
going to get zero so here also if I try
to find out the residual it will be
minus5 now the first step I hope
everybody's understood we have to create
a base model okay this base model is
very much important because we have to
create all the decision Tree in a
sequential manner so the first
sequential base tree which is again this
is also a decision tree kind of thing
you can consider but this is a base
model which takes any inputs and gives
by default the probability as 05 now
let's go ahead and understand what are
the steps in constructing decision tree
after creating the base model the first
step is that create uh binary decision
tree so I'm going to write it down all
the steps please make sure that you note
it down so so create a binary tree
binary decision tree using the features
second step we basically Define we we we
say it as okay Second Step what we do we
actually calculate the similarity weight
we calculate the similarity weight I'll
talk about this similarity weight what
exactly it is if I want to use this a
formula it is summation of residual
Square
divided
by summation of probability 1 minus
probability plus Lambda I'll talk about
this what is exactly Lambda it is the
kind of hyperparameter again so that it
does not overfit the third thing is that
we calculate the Information Gain okay
Information Gain so these are the steps
we basically use in constructing or in
solving uh in creating an HD boost
classifier the first step is that we
create a inary decision tree using the
feature then we go ahead with
calculating the similarity weight and
finally we go ahead and calculate the
information gain so how does it go ahead
let's understand over here and let's try
to find out okay now let's go ahead and
let's try to construct the decision tree
as I said that let's consider that I'm
considering salary feature So based on
using salary feature what I'm actually
going to do I am going to take this as
my node and I'm going to split this up
and remember whenever we are creating
decision Tree in this particular case it
will be a binary decision tree let's say
that in salary one is less than or equal
to one is greater than 50 so this two
you obviously have in the case of binary
in case of credit where there are three
categories I'll also show you how that
further split will happen and how that
will get converted into a binary team so
here you have less than or equal to 50K
and greater than 50k now let's go ahead
and understand how many vales are there
in this salary so if I see before the
split you can definitely see that I'm
going to use this residual and probably
train this entire model now if I really
wanted to find out the residual
initially these are my residuals over
here so one resid is -.5 then I have 0.5
over here then I have .5 then again I
have -.5 then again I have 0.5 then
again I have 0.5 and finally I have
minus .5 so these are my total residuals
that are there suppose if I make this
split less than or equal to 50 First
less than or equal to 50 the residuals
what are things are there so here I'm
going to have minus5 then less than or
equal to 50 again I'm going to have 05
then again less than or equal to 50 I'm
going to have 0.5 and less than or equal
to again one more 0.5 is there I'm just
going to remove this the last5 which is
nothing but Min -.5 so I hope you
understood this split so half of the
things came over here the remaining half
will be greater than or equal to greater
than 50 so you have one value here one
value here one value here so it will be
Min -.5 then you have 0.5 and then
finally you have 0.5 residuals how do we
get it guys see from the base model
which is by default giving 0.5 first my
data goes over here by default
probability I'm going to get 0.5 so
residual is basically calculated from
this probability and approval so this
probability minus approval so if you
subtract 0 -.5 sorry I'm just going to
rub this so if you subtract 0 -.5 you're
going to get -.5 1 -.5 you're going to
get .5 1 -.5 you're going to get .5 so
everybody I hope is very much clear with
respect to this so this is the first
step we constructed a binary tree now in
the second step it says calculate the
similarity weight now how to calculate
the similarity weight similarity weight
formula is sum of residual Square now
what is residual Square let's say that
I'm going to calculate the the the uh
I'm going to calculate for this okay
similarity weight now in this particular
case if I go and calculate my similarity
weight it will be summation of residual
Square this is my residual values this
is my residual Valu so I'm going to do
the summation of this Square okay this
value square you can see over here sum
of residual Square everybody you can see
sum of of residual squares so what do
you think sum of residual squares will
be in this particular case how I have to
do it I will just take up this all
values like
-.5
+5
+5 and
-.5 whole square right I'm just going to
do the squaring of this divided by
understand what it is divided by it is
divided by probability of 1 minus
probability now where do we get this
probability value where do we get this
probability value value we get this
probability value from our base model
right so here I'm basically going to say
that we are going to do the summation of
probability of 1 minus probability 1
minus probability that basically means
for each and every point for each and
every Point what is the probability see
probability is basically coming from the
base model so for each Pro each point
I'm going to come compute two things one
is the probability and then 1 minus
probability and this I'm going to do the
summ
like this I will do it four times 1 -.5
then .5 * 1 -.5 and finally you'll be
able to see one more will be there which
is
+5 1 -.5 so this will be your total
things with respect to this so I hope
you have understood till here uh where
you are able to understand that what we
have done this is summation of uh
residual square and this is the
remaining probability multiplied by 1
minus probability now tell me what are
you able to find out from this if you
cancel this and this this and this this
value is going to become zero so this
entire value is going to become Zer
because 0 divided by anything is 0er so
here I hope everybody is understood what
is the similarity weight of this
specific node if I want to write it is
nothing but zero now you may be
considering where is Lambda
value okay we will initially initialize
Lambda by 1 I'll talk about this hyper
parameter let's consider it as 1 so here
+ 1 or plus 0 let's let's consider
Lambda value 0 let's say for right now
okay I'm just going to make it Lambda is
equal to0 I'm just going to talk about
it because it is a kind of hyper
parameter by Z -.5 -.5 +5 +5 if I do the
summation if I do the summation here you
will be able to see that I'm going to
get zero so this calculation we have
done and we have got uh the sumission of
weight is equal to Z and let's go ahead
and calculate the sumission of the
weight of the next node no no no it's
not first Square it is whole squar so
here also if I do so it is5 +5 now let's
do it for this if I want to find out the
similarity weight again see I'm going to
repeat it .5 +5 whole squ and since
there are three points so I'm going to
basically use probability 1 minus
probability for one point then plus
probability 1 minus probability second
point and then probability and 1 minus
probability for the third point and
Lambda is zero so I'm not going to write
anything now go let's go and do the
calculation for this node so - 5 - 5 it
becomes zero then .5 whole square right
so here I'm going to get 0.25 here if
you do the calculation here you are
going to get 75 so this value is going
to be 1x3 and which is nothing at33 so
the similarity weight for this node for
this node
is33 so here you can see probability of
multiplied by 1 minus
probability okay now the next step that
we do is that calculate the information
gain now you know how to calculate the
information gain but before that let's
do the computation for this also for
this root node also go ahead and
calculate the similarity weight of
this okay they
why the base model probability is5
because it is just understand that it is
a dummy dummy model I have just put a if
condition there saying that it is going
to give 0.5 now do it for this one guys
root node what it will be see I can
calculate from here only minus1 gone
this is also gone this is also gone this
will be .25 divided by something now
tell me guys what should be for the root
node what is the similarity similarity
weight what is the similarity weight for
for this do this calculation everyone up
one I know it will be. 25 divided by
this will be 1.75 are you getting this
similarity weight which will be nothing
but 1 by 7 and if I divide 1 by 7 if I
say what is 1 by 7 it
is42 so it is nothing but .14 if I want
to calculate the root node similarity
weight over here
is4 so I know 0.14 here 0 here 33 now
see over here we calculate the
Information Gain Next Step the third
step what we do is that we calculate the
information gain now Information Gain is
nothing but in this particular case the
root node similarity weight we'll try to
add up so I will be getting
0.33 minus this particular Top Root node
whatever split has happened that
similarity weight I'll take 0 +33
-14 so Point
-14 and if I do it it is nothing but
just open your calculator again and
33
-14 so it is nothing but .19 I'm getting
.19 as my information gain the
information gain of this specific tree I
got it
as19 obviously you know how the features
will get selected based on the
Information Gain but let's say that the
highest Information Gain that is given
by salary okay now we will go ahead and
do the further split let's go ahead and
do the further split so I I know my
information gain now it is1 n and
Information Gain is basically used to
select that specific node through which
the split will happen now I'll further
go and do the split let's say that I'm
going to do the further split with the
next feature that is which one credit so
I'm going to take credit over here I'm
going to take credit over here and again
I have to do a binary split again but
you may be considering chish here are
only three categories how we are going
to basically do this particular split
right because we don't know how to do
the split because we have three
categories over here so in this case
what I will do is that we what we can
definitely do is that in this particular
case the split that we are probably
going to do is that let's consider two
categories like good and normal at one
side bad at one side so here it becomes
a binary split again now let's go ahead
and let's try to see that how many data
points will fall here and how many data
points will fall here so for writing
down the data points let's say if it is
less than or see go to the path if it is
less than or equal to 50 it'll go this
path and if it is B then we are probably
going to get how much is the residual we
are going to get one residual over here
first of all so this is my one residual
that is -.5 then similarly if I see less
than or equal to 50 good is there right
good or normal is there so here again 0.
five will come I hope everybody is able
to understand see the second record less
than or equal to 50 we go in this path
but it is good we come over here again
less than or equal to 50 good again we
are going to get 1
more5 then go with respect to greater
than or equal to 50 which is coming over
here we'll not worry about it right now
again less than or equal to 50 normal
again it is
-.5 right so this many records
definitely coming over here only one
record is basically coming over here
then again we will start the same
process again we will start the same
process now for the same process what we
are going to do again try to calculate
the similarity weight now in order to
calculate the similarity weight what I
will do I will basically say this is my
similarity weight this will become .25
divided 025 why because this whole
square right this whole Square residual
square right summation of residual
square but here I have only one residual
so this Square it will become and then
what I'm actually going to do I'm going
to basically write .5 - 1 -.5 this is
nothing for only for one data point so
this is nothing but .5 * .5 which is
nothing but 0.25 right now in this
particular case I will get similarity
weight as I hope everybody I'm getting
it as one now what about this similarity
weight if you want to compute it is
again very very simple this and this
will get cancelled then again it will be
025 divided by um if I say one like this
.25 then again it will be 75 then this
will also be 1 by3 that is nothing but
33 so similarity weight will
be33 then again I have to calculate the
information gain of this node what I
will do I will add this up see 1
+33 I'll add like 1
+33 minus 0 why zero because the
information gain the similarity weight
of this uh the up one is basically 0
right for this particular credit node
similarity weight is zero so 1
+33 minus 0 this will be 1.33 so like
this further split will again happen
over here with different different node
and we will only be getting a binary
split but we will be comparing based on
Information Gain which one is coming
good now let's say that I have created
this path I have I have designed I have
developed my entire binary decision tree
which is a speciality in XG boost now
what I'm going to do over here is that
see everybody what I'm going to do let's
consider the inferencing part let's say
this record is going to go how we are
going to calculate the output so this
first of all went to this base model now
let's go ahead and see how the
inferencing will happen suppose This
Record is going right so first of all
this record will go to this base model
the base model is giving the probability
as 0.5 so the first base model is
basically giving 0.5 now base based on
this 05 how do we calculate the real
probability how do we calculate the real
probability in this okay so we apply
something called as logs so we basically
say log of P / 1us P so this is the
formula we basically apply in only the
case of base model so if we try to see
this it is nothing but log
of5 / .5 which is nothing but zero log
of one is nothing but zero so in the
first case whenever any record goes I
will be getting the zero value over here
okay zero value over here then plus why
plus I'm doing because it will now go to
the binary decision tree now this record
will go to my binary decision Tre
whatever value I'm getting from this I'm
actually adding that up and now it will
go over here now when it goes over here
first of all let's see which branch it
is following it is following less than
or equal to 50 Branch first Branch over
here then this is bad it'll go and
follow here so here I can see that the
similarity weight is one now the
similarity weight is basically one in
this case so what we do in the case of
this we pass it to a learning rate
parameter so this specifically is my
learning rate multiplied by 1 one
because why similarity weight is one
over here so this will basically be my
first references and Alpha over here is
my learning rate it can be a very small
value based on the learning parameter
that we use like how we have defined
learning parameters elsewhere on top of
this we apply an activation function
which is called as sigmoid since this is
a classification problem we apply an
activation function which is called as
sigmoid and I hope you know what is the
use of sigmoid based on this based on
the alpha value based on this the output
will be between 0 to 1 now I hope you
getting it guys this is how the entire
inferencing will probably happen now
similarly what I will do I will try to
construct this kind of decision tree
parall so we we can also write our
entire function will look something like
this Alpha 0 + alpha 1 and this will be
your decision tree 1 output then Alpha 2
your decision tree output Alpha 3 your
decision 3 output like this Alpha 4 your
decision 3 output fourth decision tree
like this it will be alpha n your
decision tree n output and this will be
your output finally when you're trying
to inference from any new
record now the reason why we say this as
boosting because see understand we are
going to add each and every decision
tree output slowly to finally get our
output with respect to the working of
the decision tree this is how XG boost
actually work don't credit further needs
to be simplified yes see like this
similarly we can split credit with the
help of like we can make blue green one
side normal at one side But whichever
will be giving the information gain more
that will be taken into consideration
right and this is how your entire X
boost classifier works it is very very
difficult to basically calculate all
those things so that is the reason we
say that XG boost is also a blackbox
model so this is basically a blackb
model it is it prone to overfitting see
at one stage we also need to perform
hyperparameter tuning and this we
specifically say pre- pruning we tend to
do pre pruning and since we are
combining multiple decision trees no no
this decision tree this decision tree is
this one this independent decision tree
which I have created now parall after
this what I'll do I'll create one more
decision tree so it'll be looking like
this see finally how it will look so
this is my base model then my data then
my data will go to this decision tree
which I have actually done as a binary
split on different different records
then again we will make another decision
tree which will again be a binary tree
the splits will look like this then this
is my base model where I'm getting the
value as zero this will be alpha 1
multiplied by decision tree 1 which is
this then this is Alpha 2 multiplied by
decision tree 2 which is this and like
this we will keep on continuously adding
more decision trees unless and until
this entire things becomes a very strong
learner so this is how how we basically
do the combination of all these things
so I hope everybody is able to
understand about the XG boost classifier
now you may be thinking how does
regressor work do you want a regressor
problem statement also the decision tree
will get constructed based on
Independent features and again Lambda
value is a hyperparameter we basically
set up Lambda value with the help of
cross validation now uh let's go ahead
and discuss about ex boost regressor the
second algorithm that we we will
probably discuss about is something
called as XG boost regressor and how
does X boost regressor actually work
some fundamental is follow in random
Forest no in random Forest it is
completely different there bagging
happens bagging happens so over here
let's go ahead with the regressor so
here I'm going to take some example
let's say that I have this many
experience this many Gap and based on
that we need to determine the salary my
salary is my output feature let's say
the experience is 2 2.5 3 4 4.5 okay now
in this Gap let's say it is yes
yes no no yes and let's say that the
salary is somewhere around 40K it is
41k
52k and uh let's see some more data set
over here 60k and 62k now the first step
in classifier we created a base model
here also we'll try to create a base
model first of all this base model what
output it will give it will give the
average of all these values what is the
average of all these values okay what is
the average of all these value 40 81 52
60 62 if I just do the average it is
nothing but 51k so by default I will
create a base model which will take any
input and just give the output as 51
this is the first step now based on this
I will try to calculate my residual now
how do I calculate my residual I will
just subtract 40 by 51k so this will
basically be - 11k
and uh this will be 10 K - 10 K - 10 and
this will be 1 this will be 9 and this
will be 11 I hope everybody's able to
get this let's say that I I make this as
42k okay for just making my calculation
little bit easy so I have 9 over here so
this is my residual then again the first
step is that I construct my uh decision
tree now let's say say that I'm going to
use The Experience over here so this is
my experience node and based on this
experience node I have my features over
here so here I will take up all my
residuals - 11 99 1 99 11 and then how
do I do the split based on experience
this is a continuous feature so I have
to basically do split with respect to
continuous feature which I have already
shown you in decision tree how do we do
so here is my residual here it is 40
minus this
is - 11 K - 9 K uh this is 1 K this is 9
K and
11k - 9k so now I will just create take
up my first node here I'm going to use
my experience feature I know my values
what all things are going to come 11k in
the root node - 9 1 9 and 11 now what we
are going to do over here is that so I'm
going to do again a binary split over
here now the binary split will happen
based on the continuous feature that is
experienced so two types of Records I
may get one is less than or equal to two
and one is greater than 2 less than or
equal to two and one is greater than two
now less than or equal to two when I do
the split let's see how many values we
are getting less than or equal to two I
will get only one value that is -1 and
here I'm actually going to get all the
other values - 9 1 9 11 now what we are
going to do after this is that calculate
the similarity weight now here the
similarity weight will little bit the
formula will change with respect to
regression so similarity weight is
nothing but summation of residual
squares divided by number of residuals
plus Lambda again here we are going to
consider Lambda is zero because this is
a hyper parameter tuning more the value
of Lambda that basically means more more
we are penalizing with respect to the
residuals so this will be the formula
that we are going to apply okay so let's
see for the first number that that we
want to apply so how this will get
applied again I'm going to write this
formula here it'll be better let's say
here similarity weight is equal to
summation of residual square and here
you have number of residuals plus Lambda
see previously we were using probability
and then all those things we are using
so if you want to calculate the
similarity weight of this this will
become 121 divided by number of residual
is 1 plus Lambda is 0 so this is going
to be 121 so here we are going to
calculate the similarity weight which is
nothing but 121 if if we probably take
Alpha let's let's do one thing if we
probably take uh if if we probably take
Alpha is equal to 1 then what will
happen if you take Alpha is equal to 1
just think over here what will what may
happen we may directly penalize the
similarity weight right by just adding
one okay so let's do that also suppose I
say I'm going to take Alpha is equal to
1 so what will happen this will not be
the formula now now what will become 121
divided number of residual is 1 + 1 this
is nothing but 65.5 let's say that I now
have 65.5 as my similarity weight now
similarly I will go ahead and compute
the similarity weight for the next one
so here it will become - 9 + 9 + 9 + 11
whole Square divided 4 + 1 so this and
this will get subtracted 12 squ is
nothing but 14 4 144 divid 5 so if I go
ahead and calculate 144 ID 5 it is
nothing but 28.5 so here I get
28.5 so the similarity weight for this
is
28.5 similarly I can go ahead and
calculate the similarity weight for this
for the top one so it'll be nothing but
what it will be 11 + sorry - 11 - 11 - 9
+ + 1 + 9 + 11 divided 1 2 3 4 5 5 + 1
is 6 so this is getting subtracted this
will be 1X 6 anyhow this will be whole
square right so anyhow it will be 1X 6
only so 1X 6 will be my similarity
weight over here okay 28.8 hits okay now
finally The Information Gain that we
need to compute will be very much simple
what will be the Information Gain 65.5 +
28.8
minus 1X 6 so try to get it whatever we
are trying to get it over here just tell
me what will be the output is it 98.34%
60.5 60.5 + 28 88 then this will change
just a second 89.1 3 understand you
don't have to worry about calculation
automatically that things will be doing
it okay so you don't have to worry now
see we have now further the decision
tree can be splitted into any number of
times probably the next split what we
can do is that we can we can do next
split something like this this will be
my experience the two splits that may
happen with respect to less than or
equal to 2.5 less than or equal to 2.5
or greater than 2.5 now if this probably
gives the Information Gain better then
the split will happen like this
otherwise whichever gives the better
information again the split will
basically happen like this I hope like
let's say that this is this is the split
that is required - 11 - 11 is 9 is over
here and then we have 1 comma 9A 11 okay
because less than or equal to 2.5 this
two records will definitely go over here
and this two This Record will definitely
go over here now if I try to calculate
the similarity weight for this it will
be nothing but - 11 - 9 - 11 - 9 whole S
ided 2 + 1 right now in this particular
case it will be - 20 s / 3 which is
nothing but 400 2 20 into 20 is 400
which is nothing but 3 so if I go and
probably use a
calculator and show it to you
400 / 3 which is nothing but
133.33 so the similarity weight for this
is
133.33 similarly I can go ahead and
compute for this it will be 1 + 9 + 11
whole s / 3 + 1 right so it will be 10 +
11 10 + 11 is nothing but 21 whole s/ 4
so what it is 21 whole square if I open
my calculator 21 s 21 * 21 which is
nothing but 441 divid by 4 divid by 4 so
this will probably 110 110.
2.25 and similarly I can go ahead and
compute for this so if I want to compute
for this what it will be the same thing
that we have got over here that is 1x 6
so this will basically be 1X 6 so
finally if I compute the information
again it will be what it will be 133
1333 +
1.25 - 1X 6 obviously this value will be
greater than the previous one what we
have got that is
8913 so definitely we are going to use
this split which is better than the
previous split right let's say that this
split has been considered finally how do
we see the output okay I hope everybody
is able to understand right let's say
that this split has worked well so I'm
going to rub all these things
11.25 is there now suppose I want to do
the inferencing how the inferencing will
be done
11.25 here 110.2 now suppose any record
comes from here first of all any record
that will go it will go to the base
model so the base model whenever it goes
the value is 51 51 plus alpha 1 this is
my learning rate one suppose if it goes
in this route then what we have we have
- 11 - 9 whenever we go in this rote
which has - 11 and - 9 the average of
both these numbers will be considered
what is average of both these numbers -
11 - 1 9/ 2 this is nothing but - 10
right so - 10 will get multiplied here
suppose if it goes in this route then
here what will happen here will 1 + 9 +
11 divide by 3 average will be taken so
21 divid 3 7 will be there so this will
get replaced by 7 so similarly anything
that you are doing this is with respect
to decision tree 1 like this we will
again construct decision tree separately
and again it will become Alpha 2 by
decision Tre 2 Alpha 3 by decision 3 3
and like this you will be doing till
Alpha and decision 3 n and once you
calculate this this will be your
specific output in a regression tree so
in this particular case what will happen
you're just trying to play with
parameters and you're trying to use in a
different way to compute all this things
everybody clear but again it is a
blackbox model you cannot visualize all
this things now let's go to the third
algorithm which is called as s VM see
svm is almost like decision uh logistic
regression okay so the major aim of svm
is
that major aim of svm is that suppose if
I have a do data points like this okay
we obviously use uh logistic regression
to split this data points right like
this we try to create a best fit line
which looks like this and probably based
on this best fit line we try to divide
the point now in svm what we do is that
we not only create a best fit line but
instead we also create a point which is
called as marginal
planes so like this we create some
marginal
plane so this is your hyper plane and
this is your marginal plane and
whichever plane has this maximum
distance will be able to divide the
points more efficiently but usually in
in a normal scenario you know whenever
we talk about hyper plane or whenever we
talk about marginal plane there will be
lot of overlapping of points right
suppose if I have some specific points I
have one point which looks like this I
may also have another points which may
overlap so it is very difficult to get
an exact straight marginal planes and
split the point based on this now this
specific marginal plane should be
maximum because we can create any type
best fit line and probably
uh use this marginal plane now if we
have this overlapping right if for what
do we call for this kind of plane this
kind of plane is basically called as
hard marginal plane so this is basically
called as hardge marginal plane okay and
similarly if any points are overlapping
suppose this yellow points can also get
overlapped over here and there may be
some kind of Errors so for this
particular case we basically say as soft
marginal plane because here we will be
able to see that errors will be there
now in asvm what we focus on doing is
that we focus on creating this marginal
plane with maximum distance even though
there are some errors we consider it in
solving it by providing some kind of
hyper parameter now how do we go ahead
and basically create this all marginal
planes and how do we go ahead with this
it's very much simple uh just imagine in
this specific way that initially let's
consider that I have this data point
suppose this is my
best fit line how do we give this best
fit line as equation we basically say
yal mx + C right we we basically say
this equation as y mx + C no hard hard
marginal it is impossible in a normal
data set obviously you'll not be able to
get it but definitely we go ahead with
creating a soft marginal plan now Y is
equal to MX plus C what does this m
indicate m is nothing but slope and C
indicates nothing but intercept
can I say that this both equations are
same ax + b y + C isal 0 can I also say
that this is the equation of a straight
line can I say that this is also the
equation of straight line I will say
that both of them are equal can I say
both of them are equal see if I try to
prove this to you if I take this
equation and try to find out y it will
be nothing but minus C Min - c
minus a sorry - a x and this will be
divided by B this will be divided by
B this will be divided by B so here you
can see that it is almost the same in
this particular case my M value will be
- A by B and my C will basically be
minus C by B so both the equation are
almost same
so let's consider that this is my
equation and I am actually and whenever
I say Y is equal to mx + C can I also
write something like this Y is equal to
W1
X1 + W2 X2 plus like this plus C or plus
b same thing no so here also we can
write y w transpose x + B same equation
right we are basically using same
equation yes we can also write it in a
different way but at the end of the day
we are also treating something like this
let's say that this slope is in this
direction if this slope is in this
direction then I can basically say that
let's consider that the slope is minus
one
let's say that this slope is minus one
see it is in the negative Direction
let's say that this slope is minus one
I'm just trying to prove that this slope
is negative value let's consider this
now suppose this is one of my point - 4a
0 and obviously this particular equation
is given by this particular line is
given by this equation now if I really
want to find out the Y value let's say
that this is my
X1 this is my X1 and this is my X2 let's
say that
I want to find out I want to find out
this W transpose x + b the Y value based
on this line if I want to compute the y-
value based on this line how will I
compute W transpose X basically means
what w value what all things will be
there one value is B right B is
intercept right now intercept is passing
from origin can I say my B will be zero
obviously I can assume that b will be
zero now in this particular case if I
talk about w w in this case is minus one
which I have initialized over here so if
I want to do this matrix multiplication
it will be W transpose can be written as
like this and this x value can be
written as -4 comma - 4 and 0 -4 and 0
right so I can basically write like this
now if I do this multiplication what
will my value I get I will basically get
four right so this is a positive
value this is a positive value Now
understand since this is a positive
value any points that are below this
line any points that I consider below
this line and if I try to calculate the
Y can I say that it will always be
positive yes or no similarly if I could
probably consider one point over here as
4A 4A 4 now tell me in this 4A 4 if I
calculate the Y value what will you get
whether you'll get a positive value or a
negative value if I try to calculate the
Y value in this case because here only
positive values will'll be getting right
so if I calculate the Y value will the Y
value be negative or positive just try
to calculate how do you calculate again
I will use y equation this time again my
slope is minus1 my intercept is zero and
here I will have 4 comma
4 now here Min
-4 and then this is + 0 this will be Min
-4 right so this will be a negative
value negative value guys negative see -
4 + 0 negative so any point that I will
probably have in top of this any
points Above This Plane right and if I
try to calculate the Y value it will
always be negative so what two things
you are able to get positive and
negative so you can consider this
entirely one category this another
category at least these two things you
can basically
consider guys I hope everybody is able
to understand this so this will be my
one
category and this will be my another
category obviously so that basically
means I can definitely use a plane and
split this point I hope everybody is
able to understand now let's go ahead
and let's see how this marginal plane
will get created and what is the cost
function to basically do this or what is
the cost function in making sure that
the marginal plane will definitely work
right it becomes difficult right so
suppose let's consider an
example suppose I say that this is my
lines let's say uh I want to basically
create a kind of I have two variety of
points one is this point let's say I
have all this points like this and the
other points I have somewhere here let's
consider I am just using directly good
number of points so that I can split it
okay because I will try to talk about it
what I'm actually trying to prove so
obviously this is my best fit line that
splits and apart from that what I will
do is that I'll also create a marginal
points so in order to create the
marginal point I may use some different
color let's see which color this will be
my one marginal point remember it will
be to the nearest point over here and
basically we will construct like like
this and similarly here we will be
constructing like this I've already told
you guys this equation can be mentioned
at w transpose x + B = 0 right I can
definitely say this because ax + b y + C
is equal to 0 so this I can also write
it as W transpose x equal to 0 sorry
plus b plus b equal to 0 so both are
same okay this I don't have to prove it
I hope everybody's clear with this now
what I'm going to do let's represent
this line also with some equation so
this line if I want to represent this
will be W transpose x + B what value
will come over here positive or negative
C from this line anything above this
plane right any any any distance that we
try to find out it will always be
negative so let's say that I'm using it
as minus one to just read as it is a
negative value and this line that I am
going to mention it it will be W
transpose x + B is equal to + 1 Min -1
above + 1 because we have already
discussed from this point if you're
trying to calculate the Y value it is
always going to be + one this is going
to be minus one here I should definitely
say this as K okay but I'm not
mentioning K in many articles you'll see
it as minus one uh many research paper
also they use it as minus one but I
would like to specify uh minus and plus
K but here let's go and write minus1 and
plus now my aim is to increase this
distance okay this distance I really
want to increase this distance now in
order to increase this if I increase
this distance that basically means my
model is performing well so let's say I
want to find this distance first of all
so if I write w transpose X Plus Bal to
1 and here I will write w transpose x +
B isal minus1 so what I'm going to do
I'm going to do the computation and
subtract it like this so here obviously
this will be my X1 this will be my X2
okay because these are my another points
X2 and X1 so I can write w transpose X1
-
X2 B and B will get cancell and here I
will be writing two right so from here
we can definitely write two different
things let's see what all things we can
write so here this is nothing but the
difference between my this plane and
this plane which is given by like this
okay now always understand whenever we
consider any any vector vors right any
vectors right it also has something
called as
magnitude so if I want to remove this
magnitude I can divide this by W this
magnitude of w then only my Vector will
remain which is indicated like this so
I'm going to basically divide by this
particular operation both both the side
I'm dividing by this magnitude of w and
I don't care about the directions over
here right now we just care about the
vectors now when I write like this what
is our aim our aim is to can I say our
aim is to our aim is to
maximize 2 byw can I say this guys yes
or
no what is our aim our aim is to
basically maximize this right by
updating W comma B value I need to
maximize this yes everybody's clear with
this can I say that yes I want to
maximize this yes or no everybody I want
to maximize this if I maximize this that
basically means my marginal plane will
become bigger my marginal plane will be
bigger okay now can I write along with
this that such that y of I my output
will be dependent on two different
things one is I can say that my y y of I
is plus of uh is + one when w transpose
x + B is greater than or equal to 1
everybody see in this equation what I'm
actually trying to specify such that y
of I is + 1 when w transpose x + B is
greater than 1 and when it is minus 1
that basically means w transpose of X is
B is less than or equal to minus now
what does this basically mean see all my
values whenever I compute W transpose x
+ B is greater than or equal to 1 I'm
obviously going to get this + one when w
transpose X+ B is less than or equal to
1 I'm always going to get the output as
minus one I hope that is the reason why
I have actually written like this so
this two we have already discussed why
we are specifically writing we want to
increase the marginal plane which is
this this is my marginal plane and I'm
writing one condition that my Yi value
will be+ one when w transpose X plus b
is greater than or equal to 1 otherwise
it when it is less than or equal to
minus one it is going to be very much
clear with this transpose condition we
have already done it everybody clear
with this now on top of it we can add
one more very important Point instead of
writing such that and all you can also
say that our major
aim our major aim is that if I multiply
y i multiplied by W transpose X of I + B
If I multiply this two this will always
be able greater than or equal to 1 for
correct points right for correct points
because understand if it is minus one if
I'm multiplying with this and if it is a
correct Point minus into minus will
obviously be greater than or equal to
one only right similarly for this it
will be greater than 1 so I can also
definitely say that my major M If I
multiply y of I with this it will be
always greater than or equal to + 1 U
which is definitely saying that it will
be a positive value so this is just a
representation guys but understand what
is the minimized cost function this is
my minimized cost function maximized
cost function now I'm going to again
write it down
maximize W comma B maximize W comma b 2
by magnitude of w I can also write
something like this minimize W comma B
and I can just inverse this which looks
like this are these both are same or not
because always understand in machine
learning algorithm why do we write
minimize things because we are trying to
minimize something okay both are
equivalent these both are equivalent and
why we specifically write minimization
because in the back propagation when we
we are continuously updating the weights
of w and B so we can definitely write
like this so here my main target is to
minimize this particular value by
changing W and B and I will start adding
some more parameters over here this is
fine till here I think everybody has got
it this is our aim and we are going to
do this but I'm going to add two more
parameters in this Optimizer one is C of
I and one is summation of I equal 1 to n
and here I will use something called as
EA EA of I first of all I'll tell what
is C of I see if I have this specific
data point let's say if some of my
points are over here then is it a right
right prediction or wrong prediction if
some of my points are over here is it a
right prediction or wrong prediction
obviously it is a wrong prediction if my
points are somewhere here is it a WR
prediction wrong wrong incorrect
prediction right so this C value
basically says that how many errors we
can have how many errors we can have if
it says that fine we can have six errors
or seven errors how many errors we can
have even though we are using the
marginal plane how many errors we can
have so here I'm specifically writing
how many errors we can have this is what
is specified by C ofi EA of I basically
says that what is the summation of I'm
going to write it down since we are
doing the sumission this entire term
basically mentions that sumission
of the distance of the values distance
of the wrong points and how do we
calculate the distance from here to here
suppose this is a wrong point I will try
to calculate the distance from here to
here I will do the sumission of this
I'll do the sumission of this I will do
the sumission of this similarly for the
Green Point another sumission will
happen from here to here like this here
to here and we going to do that specific
sumission so we are telling that fine if
you are not able to fit properly try to
apply this two hyperparameters and try
to make sure that this many errors are
also there it is well and good no
problem we will go ahead with that try
to do the submission of the data points
and based on that try to construct the
best fit line along with the marginal
plane like this even though there are
some errors over here or errors over
here we are good to go with respect one
more thing is there which is called as
Al svr svr only one thing is getting
changed in svr only this value will get
changed so I want you all to explore and
just let me know this will be one
assignment for you only this value will
be changing remaining everything are
same so just try to if you change this
particular value that becomes an svr
just try to explore and just try to find
out and just try to let me know so
overall uh did you like the entire
session everyone okay in this one more
thing is there which is called as kernel
Matrix svm kernel we say it as svm
kernel now in s VM kernel what happens
suppose if I have a specific data points
which looks like this which looks like
this so we obviously cannot use a
straight line and try to divide it so
what we do we convert this two Dimension
into three dimensions and then probably
we push our Point like this one point
will go like this and the white point
will go down and then we can basically
use a plane to split it so I uploaded a
video around uh around that and uh you
can definitely have a look onto that and
I have also shown you practically how to
do it that is the reason I've created
that specific video so great uh this was
it from my side I hope you like this
session so thank you everyone have a
great day keep on rocking keep on
learning and never give up
Full transcript without timestamps
so today's session what all things we are basically going to discuss so first of all we going to discuss about different types of machine learning algorithm like how many different types of machine learning algor understand the purpose of taking this session is to clear the interviews okay clear the interviews once you go for a data science interviews and all the main purpose is to clear the interviews I've seen people who knew machine learning algorithms in a proper way okay they were definitely able to clear it because they just explain the algorithms in a better way to the recruiter so that they got hired first of all is the introduction to machine learning here I'm just specifically going to talk about AI versus ml versus DL versus data sign then the second thing that we are going to talk about over here is the difference between supervised MS and unsupervised ml the third thing that we are probably going to discuss about is something called as linear regression so we are going to clearly understand the maths and geometric intuition the next thing that we are probably going to discuss about is R square and adjusted R square the fifth topic that we are going to discuss about is Ridge and lasso regression the first topic that we are going to discuss about is AI versus ml versus DL versus data science so this is the first topic that we are probably going to discuss if you really want to understand the difference between AI versus ml versus DL versus data science we will go in this specific format so just imagine the entire universe so this entire universe I will probably call it as an AI now specifically when I say AI this basically means AI artificial intelligence whatever role you are in you are as a machine learning developer you working as a deep learning developer Vision developer or a data scientist or an AI engineer at the end of the day you are actually creating AI application so if I really want to Define what is this artificial intelligence you can just say that it is a process wherein we create some kind of applications in which it will be able to do its task without any human intervention so that basically means a person need not monitor this AI application automatically it'll be able to make decisions it will be able to perform its task and it will be able to do many things so this is what an AI application is some of the examples that I would definitely like to consider so the first example that I would like to consider AI application AI module Netflix has an AI module suppose if you see a kind of action movie for some time then the kind of AI work or AI work that is basically implemented over here is something called as recommendation so here through this application what happens is that when you're continuously seeing the action movies then automatically the AI module that is present inside Netflix will make sure that it gives us recommendation on action movies second if I take an example of comedy movie If I continuously see comedy movie then also it'll give us the recommendation of the comedy movie so this through this what happens is that it understands your behavior and it is being able to do its task without asking you anything the second example that I would like to take up in is amazon.in now amazon.in again if you buy an iPhone then it may recommend you a headphones so this kind of recommendation is also a part of AI module that is integrated with the amazon.in website the ads that you see probably when you opening my channel through which I get paid a little bit from my from a from the hard work that I do in YouTube right so through that ads how that is recommended to you uh that is also an AI engine that is included in the YouTube channel itself which really plays it is a business-driven goal understand it is a business driven things that we basically do with the help of AI one more example that I would like to give you is if I consider it self-driving cars so here you'll be able to see self-driving cars if you take an example of Tesla so self-driving cars what happens based on the road it is able ble to drive it automatically who is doing that there is an AI application integrated with the car itself right so if I consider all these things these all are AI application at the end of the day whatever role you do you are going to create an AI application this is the common mistake what people do you know like our CEO sudhansu Kumar he has written in his profile that he's an AI engineer that basically means his goal is to create an AI application so probably in a product based companies you'll be seeing this kind of roles called as AI engineer now let's go to the next role which is called as machine learning so where does machine learning comes into existence so if I try to create this machine learning is a subset of AI and what is the role of machine learning it provides stats tools to analyze the data visualize the data and apart from that to do predictions I'm forecasting so you will be seeing a lot of machine learning algorithms so internally those machine learning algorithm the equation that we are basically using it is basically using it is having a kind of stats tool stat techniques because whenever we work with data statistics is definitely very much important so this exactly is called as machine learning so it is a subset of AI this is very much important to understand ml is a subset of AI so here you can see that it is a part of this now let's go to the next one which is called called as deep learning deep learning is again a subset of ml now let's consider why deep learning came into existence because in 1950s 60s scientists thought that can we make machine learn like how we human being learn so for that particular purpose deep learning came into existence here the plan is to basically mimic human brain so when I say mimicking human brain that basically means we are trying to mimic the human brain to implement something to learn something so for this you use something called as multi-layered neural networks so this is what deep learning is it is a subset of machine learning its main aim is to mimic human brain so they actually create multi-layer neural network and this multi-layered neural network will basically help you to train the machines or applications whatever we are trying to create and deep learning has really really done an amazing work with the help of deep learning we are able to solve such a complex complex complex use cases that we will be probably discussing as we go ahead now if I come to data science see this is the thing guys if you want to say yourself as a data scientist tomorrow you given a business use case and situation comes that you probably have to solve that use case with the help of machine learning algorithms or deep learning algorithms again the final goal is to create an AI application right you cannot say that I am a data scientist and I'll just work in machine learning I or I'll work in deep learning or I may I don't know how to analyze the data no you cannot do that when I was working in Panasonic I got various different kind of task sometime I was told to use W powerbi to visualize analyze the data sometime I was given a machine learning project sometime I was given a deep learning project so as a data scientist if I consider where does data scientist fall into this it will be a part of everything so if I talk about machine learning and deep learning with respect to any kind of problem statement that we solve the majority of the business use cases will be falling in two sections one is supervised machine learning one is unsupervised machine learning so most of the problems that you are basically solving this is with respect to this two problem statement two different types of machine learning algorithms that is supervised machine learning and deep learning if I talk about supervised machine learning two major problem statements that you are basically solving here also one is regression problem and the other one is something called as classification problem and in the case of unsupervised machine learning problem statement you are basically solving two different types of problem one is clustering and one is dimensionality reduction and there is also one more type which is called as reinforcement learning reinforcement learning I can I I will definitely talk about this not right now right now we are just focusing on all these things now understand what happens in supervised machine learning let's consider consider a data set so here I have a data set which says this is my age and this is my weight suppose I have these two specific features let's say that I have values like 24 62 25 63 21 72 257 uh 62 and many more data over here let's say that my task is to basically take this particular data and create a model wherein so suppose my task is that I need to create a model whenever it takes the New Age first of all we train this model with this data and whenever we take age a new age it should be able to give us the output of weight this particular model is also called as hypothesis okay I'll discuss about this today when I we discussing about linear regression now what are the important components whenever we have this kind of problem statement first of all you need to understand there are two important things one is independent features and the other one is something called as dependent features now let's go ahead and discuss what is independent feature independent feature basically means in this particular case since the input that I'm basically training in all those features becomes an independent feature now in this particular case my age is independent feature and whatever I'm actually predicting so when I say predicting I know this is my output okay this is the what I have to basically make my model uh give this as a an output so in this particular casee my dependent feature becomes weight why we specifically say a dependent feature because this is completely dependent on this value whenever this is increasing or decreasing this value is basically getting changed so that is the reason why we basically say this has independent and dependent feature whenever we are solving a problem right in the case of supervised machine learning remember they will be one dependent feature and there can be any number of independent features now let's go ahead and let's discuss about regression and classification what is the difference between them now let let's go ahead and let's discuss about two things one is let's say I want a regression problem statement suppose I take the same example as age and weight so I have values like as discussed 24 72 23 71 uh 24 or 25 71.5 okay so this kind of data I have see this is my output variable which is my dependent feature now in this particular dependent feature now whenever I'm trying to find out the output and in this particular output you have a continuous variable when you have a continuous variable then this becomes a regression problem statement now one example I would like to give suppose this is my data set right this is my age this is my weight suppose I am populating this particular data set with the help of scatter plot then in order to basically solve this problem what we'll do suppose if I take an example of linear regression I will try to draw a straight line and this particular line is my equation which is called as yal mx + C and with the help of this particular equation I will try to find out the predicted points so this will be my predicted point this will be my predicted point this this any new points that I see over here will basically be my predicted point with respect to Y so in this way we basically solve a regression problem statement so this is very much important to understand let's go to the always understand in a regression problem statement your output will be a continuous variable the second one is basically a classification problem now in classification problem suppose I have a data set let's say that number of hours study number of study hours number of play hours so this is my independent feature let's say a number of sleeping hours and finally I have my output which will will be pass or fail so in this I have all this as my independent features and this is my dependent feature so I will be having some values like this and here either you'll be pass or fail or pass or fail now whenever you have in your output fixed number of categories then that becomes a classification problem suppose it just has two outputs then it becomes a binary classification if you have more than two different categories at that time it becomes a multiclass classification so this is the difference between regression problem statement and the classification problem statement now let's go ahead and let's discuss about something called as unsupervised machine learning now in unsupervised machine learning which is my second main topic over here I'm just going to write unsupervised machine learning now what exactly is unsupervised machine learning here whenever I talk about there are two main problem statement that we solve one is clustering one is dimensionality reduction let's take one example of a specific data set over here let's say that my data set is something called as salary and age now in this scenario we don't have any output variable no output variable no dependent variable then what kind of assumptions that we can take out from this particular data set suppose I have salary and age as my values so in this particular case I would like to do something called as clustering now why clustering is used just understand let's say I am going to do something called as customer segmentation now what does this customer segmentation do clustering basically means that based on this data I will try to find out similar groups groups of people suppose this is my one group this is my another group this is my third group let's say that I was able to create this many groups this many groups are clusters I'll say cluster 1 2 three each and every cluster will be specifying some information this cluster May specify that this person uh he was very young but he was able to get some amazing salary this person it may some specify that these people are basically having more age and they are getting good salary these people are like middle class background where with respect to the age the salary is not that much increasing so here what we are doing we are doing clustering we are grouping them together main thing is grouping this word is very much important now why do we use this suppose my company launches is a product and I want to just Target this particular product to rich people let's say product one is for rich people product two is for middle class people so if I make this kind of clusters I will be able to Target my ads only to this kind of people let's say that this is the rich people this is the middle class people I will be able to Target this particular ads or this particular product or send this particular things to those specific group of people by that that is basically called as ad marketing and this uses something called as customer segmentation a very important example and based on this customer segmentation we can later apply any regression or classification kind of problem statement now coming to the second one after clustering which is called as dimensionality reduction now in dimensionality reduction what we are focusing on suppose if we have th000 features can we reduce this features to lower Dimensions let's say that I want to convert this uh th000 feature to 100 features lower Dimension so can we do that yes it is possible with the help of dimensionality deduction algorithm there are some algorithms like PCA so I'll also try to cover this as we go ahead understand clustering is not a classification problem clustering is a grouping algorithm there is no output feature no dependent variable in clustering sorry in unsupervised ml so yes I will also try to cover up LDA we'll cover up PCA and all as we go ahead so with respect to supervised and unsupervised so first thing that we are going to cover is something called as linear regression the second algorithm that we will try to cover after linear regression is something called as Ridge and lasso third that we are going to cover is something called as logistic regression the fourth that we are basically going to cover is something called as decision tree decision tree includes both classification and regression four fifth that we are going to cover is something called as adab boost sixth that we are going to cover is something called as random Forest seventh that we are going to cover is something called as gradient boosting eighth that we are going to cover is something called as XG boost N9 that we are going to cover is something called as n bias then when we go to the unsupervised machine learning algorithm the first algorithm that we are going to do is something called as K means K means algorithm then we also have DV scan then we are also going to do higher C clustering there is also something called as K nearest neighbor clustering fifth we'll try to see about PCA then LDA so different different things we will try to cover up yes svm I have missed here I'm going to include svm KNN will also get covered so I have that in my list probably I may miss one or two but we are going to cover everything so let's start our first algorithm linear regression so let's go ahead and discuss about linear regression linear regression problem statement is very simple guys so suppose I have let's say I have two features one is my X feature and one is my y feature let's say that X is nothing but age and Y is nothing but weight so based on these two features I have some data points that has been present over here so in linear regression what we try to do is that we try to create a model with the help of this training data set so this will be my training data set what I'm actually going to do is that I'm going to basically train a model and this model is nothing but a kind of hypothesis testing or it is just kind of hypothesis which takes the new age and gives the output of the weights and then with the help of performance metrics we try to verify whether this model is performing well or not now in short what we are going to do in linear regression is that we'll try to find out a best fit line which will actually help us to do the prediction that basically means if I get my new age over here then what should be my output with respect to Y okay so with respect to this what should be my output over here in this particular case whenever we are drawing a diagram like this I can basically say that Y is a linear function of X so this is what we are going to do now understand how we are going to create this best fit line this is very much important whenever we say linear regression it basically means that we are going to create a linear line over there you may be thinking sir why to create linear line why not nonlinear line that I'll discuss about it as we go ahead see other other algorithms so to begin with let's consider this line that you see over here right this line equation can be given by multiple equations someone some people people write yal mx + C some people write uh H some people write yal beta 0 + beta 1 into X some people write H Theta of xal to Theta 0 + Theta 1 into X many many equations are there for this this straight line this straight line many many equations are there with respect to many many different kind of notations but the first algorithm that I have probably learned of linear regression is from Andrew Ng definitely I would like to give him the entire credits and based on his notation whatever he has explained I'll try to explain you over here so the credits for this algorithm specifically goes to Andrew NG so let's consider this one over here in order to create this straight line I will basically use a equation which is called as H Theta so this is the equation of a straight line if I know the equation of the straight line whatever I can write I can write many things yal mx + C yal beta 0 + beta 1 * X and then I can also write one more that is H Theta of xal theta 0 + Theta 1 into X of I here also you can basically say x of I here also you can say x of I now let's go ahead and let's take this equation for now let's take this equation of now so I'm I'm going to take out this equation and just write one equation through which I have also studied but I will definitely be adding some points which probably Andrew and could not mention mention in his video but I'll try my level best obviously he is the best I cannot even compare myself to him so Theta 0 + Theta 1 into X now let's understand what is Theta 0 Theta 1 as I said that let's say I have a problem statement over here let's say I this is my X and this is my y this is my data points now what I'm doing I'm trying to create a best fit line like this now what is this best fit line what is uh when I say this best fit line is basically given by this equation what does Theta 0 basically indicate Theta 0 over here is something called as intercept now what exactly is intercept intercept basically means that when your X is zero then H Theta of X is equal to Theta 0 so in this particular case intercept basically indicates that at what point you are meeting the Y AIS so this particular point is basically your intercept when your X is equal to 0 at that point of time you'll be seeing that this line is intersecting the y- AIS whatever value this will be that is your intercept now the second thing is about your Theta 1 what is Theta 1 this is nothing but slope or coefficient now what does this basically indicate this indicates let let's say that this is the unit one unit in the x-axis and probably with respect to this I can find one point over here one point over here and if I try to draw this over here to here this is the unit movement in y so what does it basically say slope with the unit movement in one one unit movement towards the x-axis what is the unit movement in y- axis that is basically slope or coefficient Theta 0 and Theta 1 two things and X of I is definitely your data points now our main aim is to create a best fit line in such a way that I I'll just try to show it to you what is our main aim let's let's understand what is the aim of a linear regression so if I take an example of linear regression I need to find out the best fit line in such a way that the distance between this data points that I have and the predicted points should be very very less suppose I'm creating a best fit line okay I'm creating a best fit line so with respect to this data points initially was this right but my predicted point is this point in this particular case my predicted point is this point so and if I do do the summation of all these points those distance should be minimal then only I'll be able to say that this is the best fit line so I I cannot definitely say that this is exactly the best fit line or not how will I say when I try to calculate the difference between this point and the predicted Point these are my predicted point right if I try to calculate the distance between them then I will basically have a aim to it should be minimal if I do the summation of all the distance it should be minimal so for that what I can do is that see you may be also thinking Krish why not just do one thing okay suppose if these are my data points why not just play and create multiple lines and try to compare what we can do is that we can compare multiple we can create multiple lines right like this and then whoever is giving the best minimal point I will go and select that but how many iteration you will do how you will come to know that okay this line is the best line so for that specific purpose we should start at one point and we should lead towards finding the best fit line start at one point and then we should go towards finding the best fit line so for this particular purpose what we do is that we create a something called as uh cost function I have already shown you what is my hypothesis function my best fit line equation is basically given as H Theta of x equal to Theta 0 + Theta 1 * X this is my hypothesis right now coming to the cost function which is super super important why this it is super important because cost function basically what what is cost function over here I told right right this distance when I do the summation this distance that I when I'm doing the summation it should be minimal so if I really want to find out this particular distance I will be using one more equation how can I use a distance formula between the predicted and the real point I will just say that H Theta of x - y so when I say h Theta of x - Y what does this basically mean this is my real point and this is my predicted Point predicted point is basically given by H Theta of X and what I'm going to do I'm going to basically do the squaring because I may get a negative value so because of that I really want to do the squaring part Now understand one thing I need to also do the summation I = 1 to compl complete M let's say that I'm taking the number of data points over here as M because I need to calculate the distance between all the points right with respect to the predicted and the predict with respect to the real points so after this I also need to divide by 1X 2m the reason why I'm dividing by first of all let me show you why we are dividing by 1 by m 1 by m will give us the average of all the values that we have the specific reason why we are dividing by 1 by 2 do is for the derivation purpose it helps us to make our equation very much simpler so that later on when I am updating the weights when I say weights I'm basically updating Theta 0 and Theta 1 Theta 0 and Theta 1 at that point of time you'll be able to see that this particular value when we probably do the derivative it will help us to do it again I'm going to repeat it I'm going to write it down for you first of all now in order to find find out the best fit line I need to keep on changing Theta 0 and Theta 1 unless and until I get the best fit line unless and until I don't get the best fit line I need to keep on updating Theta 0 and Theta 1 now if I need to keep on updating Theta 0 and Theta 1 I probably require a cost function okay what this cost function will do I'll just tell you so cost function over here I will specify as J of theta 0 comma Theta 1 is equal to now what is cost fun function over here what this distance I told right this distance between the H Theta of X and Y if I do the summation of all these things it needs to be minimal it needs to be less because with respect to an X point this is my y point right similarly with respect to this x point this is my y point so what I'm actually going to do I'm going to use a cost function now in this cost function my main aim is to basically write H Theta of x - y s this will be with respect to I I I why I am saying I because this will be moving from I equal to 1 to all the points that is m m is basically all the points over here now apart from this what I actually going to do I'm going to divide by 1X 2 m I'll tell you why I'm specifically dividing by 1X 2 m first of all by dividing by m I will be getting an average output average cost function because here I'm iterating M the reason why I'm dividing by two because it will help us in derivation why let's say that I have x² if I try to find out derivative of x² with respect to X then what will I get I will basically get 2x right that is what is the formula what is the derivation of X of n it is nothing but n x of n minus1 so that is the reason why I'm actually making it 1 by two so that when two comes over here this two and two will get cancelled so I hope everybody's able to understand so this is my cost function Now understand what is this called as this entire equation is basically called as squared error function yes mathematical Simplicity basically means because when we are updating Theta 0 and Theta 1 we basically find out derivation in the cost function so that is the reason why we are specifically doing it squaring off is basically done because so that we don't get any negative values here squared error function now let's go towards the what we need to solve this is my cost function okay so I need to minimize minimize this particular value that is 1x 2 m summation of I = 1 2 m and then this will basically be H Theta of X of I minus y of I whole Square we need to minimize this by adjusting parameter Theta 0 and Theta 1 this entirely is what this is nothing but J of theta 0 comma Theta 1 and we really need to minimize this so this is our task okay this is our task now let's go ahead and let's try to compare with two different thing one is the hypothesis testing and one is with respect to the cost function okay let's take an example so right now my equation of the hypothesis is nothing but H Theta of x equal to Theta 0 + Theta 1 * X if Theta 0 is 0 then what does this basically indicate can I say that it basically the line the line the best fit line passes through the origin and this is nothing but s Theta of xal to Theta 1 multiplied by X can I say like this obviously I can definitely say like this right so my equation will be like this so for right now let's consider that your Theta 0 is equal to 0 so this is what it is we have done till here we have minimized we have written the equation everything yes so it is passing through the origin and this is what is the equation I'm actually getting now let's take one example and let's try to solve this if I if I have H Theta of X so this is my new hypothesis considering that my intercept is passing through the region so with respect to this let's say that I will create one line over here let's say this is my this is my data points like X1 y1 I have 1 2 3 I have 1 2 3 now let's consider that if I have T I have data points like what I have data points like let's say I have three data points 1 comma 1 2A 2 3 comma 3 so 1A 1 is nothing but this is my data point 2A 2 is nothing but this is my data point and 3 comma 3 is this is my data point so these are my data points from the data set that I have so 2 comma 2 is this point and 3 comma 3 is basically this point let's consider that these are my points that I have these are my data points now if I consider Theta 1 as 1 where do you think the straight line will pass through where do you think the straight line will pass the straight line will definitely pass like this right my straight line will definitely pass through all the points this same point becomes a prediction point also right same point let's consider that this is also getting pass through this it passes through all the points when Theta 1 is equal to 1 Theta 1 is nothing but slope when slope is equal to 1 in this scenario it passes through all the points now go ahead and calculate your J of theta so what will the form of J of theta 1 become because Theta 0 is 0 okay we can basically write 1 by 2 m summation of I = 1 2 three how many points are there three right and here I have J of H of theta of X1 sorry X of theta of x i - y i s right now let's go ahead and compute now in this particular scenario what will happen 1X 2 m then what is what is this point minus y of I see h of X is also 1 y of I is also one both the point are 1 so this will become 1 - 1 whole S Plus because we are doing summation the next point is also falling in 2A 2 so this will become 2 - 2 s + 3 - 3 S so in total this will become zero so when your J of theta when Theta 1 is 1 Theta 1 is 1 so J of theta 1 is how much it is Z right so what is this J of theta 1 it is the cost function so let me draw the cost function graph over here let's say that this is my Theta and this is my so here I have 0.5 here I have 1 here I have 1.5 so this is my Theta here I have two then I have 2.5 okay then similarly I have 0. five then I have 1 1.5 2 2.5 this is my J of theta 1 so right now what is my Theta 1 my Theta 1 is 1 at this particular Point what did I get J of theta 1 is nothing but zero so this will be my first point this will be my first point guys I have discussed why why the value will be 1X 2m basically to make the calculation simpler we are dividing by 1X 2 m is basically used to average aage is the sumission that we are actually doing over here now let's go ahead and let's take the second scenario in the second scenario let's consider my Theta 1 let's say that my Theta 1 over here is now 0.5 if my Theta 1 is 0.5 then tell me what are the points that I will get for x equal to 1.5 * 1 so it will come as 0.5 over here right then similarly when X is equal to 2.5 * 2 is nothing but 1 over here and then similarly when uh for x equal to 35 multiplied by 3 see we are multiplying here right5 multi by 3 is 1.5 so the next point will come over here now when I create my best fit line what will happen so here is my next best fit line which I will probably create by green color okay so this is my second one which is green color here definitely slope is decreasing so if I go ahead and calculate my J of theta let's see what I'll get so J of theta 1 is nothing but 1X 2 m again same equation summation of I = 1 2 3 H Theta of X of i - y of i² so what we have for over here we have nothing but 1X 2 m now let's do the summation what is this point this point is nothing but the predicted point and this point is the real point right so in this particular scenario the first point that I will get is nothing but. 5 - 1 whole s how I'm getting. 5 - 1 whole Square this is 1 this is the real Point 1 this is the predicted Point .5 so here I'm getting. 5 - 1 whole Square the second point will be 1 - 2 whole s right 2 so 1 - 2 whole s and then I will finally get 1.5 - 3 whole s so finally if I do this calculation how much I'm actually getting 1X 2 * 3 which is 6 here I'm getting .25 5 Square here I'm getting 1 here I'm getting 1.5 whole Square so my final output will be which I have already calculated it is nothing but point it will be approximately equal to. 58 so 58 now with Theta as this is nothing but Theta Theta 1 as .5 right that is what Theta 1 as .5 we are able to get. 58 so Theta 1 is .5 over here and. 58 will be coming somewhere here right so this is my next point which will be again in green color now let's go ahead and calculate the third condition now in third condition what I'm actually going to write I'm going to basically say Theta 1 as 0 at that point of time just go and assume what is 0 multiplied by X it will obviously be zero so I will be getting three points and my next line will be in this line that is the x-axis and this is basically all my points now if I go ahead and calculate this what is J of theta 1 now what is J of theta 1 now in this particular case when my Theta 1 is equal = to 0 1X 2 m now this part you'll be able to see this is 0 - 1 0 - 2 0 - 3 okay so it will become 0 - 1 s 0 - 2 s and 0 - 3 S okay so this will become 1X 6 * 1 + 4 + 9 which will not be it will be nothing but 2.3 which is approximately equal to 2.3 then what will happen with respect to Theta 1 as 0 we are getting 2.3 so if I draw this it is nothing but with respect to zero I'm getting 2. 2 2.3 this is my point so similarly when you start constructing with Theta 1 is equal 2 I may get some point over here so here when I join this points together you will be seeing that I will be getting this kind of curve okay and this curve is something called as gradient descent and this gradient descent will play a very very important role in making sure that in making sure that you get the right Theta 1 value or light slope value now which is the most suitable point the most suitable point is to come over here because this is this this point is basically called AS Global Minima because see out of all these three lines which is the best fit line this is the best fit line right this is the best fit line when I had this best fit line my point that came over here was here itself this was my point that came over here right and I want to basically come to this region because this is my Global Minima when I basically am over here the distance between the predicted and the real point is very very less right so this specific point is basically called AS Global minimum but still I did not discuss Krish you have assumed Theta 1 is 1 Theta 1 is .5 Theta 1 is 0 here also you're assuming many things right and then you probably calculating and you're creating this gradient descent but the thing should be that probably you come to one point over here and then you reach towards this so for that specific reason how do you do that how do I first of all come to a point and then move towards This Global Minima so for that specific case we will be using one convergence algorithm because if I come to one specific point after that I just need to keep on updating Theta 1 instead of using different different Theta 1 value so for this we use something called as convergence algorithm so here the convergence algorithm basically says repeat until convergence that basically means I'm in a while loop let's say and here I'm basically going to update my Theta value which will be given by this notation which is continuous updation where I'll say Theta J minus I'll talk about this Alpha don't worry and then it will be derivative of theta J with respect to this J of theta 0 and Theta 1 so this should happen that basically means after we reach to a specific point of theta after performing this particular operation we should be able to come to the global Minima and this this specific thing that you are able to see is called as derivative this is called as derivative derivative basically means I'm trying to find out the slope derivative which I can also say it as slope this equation will definitely work guys trust me this will definitely work why it will work I'll just draw it show it to you let's say that this is my cost function let's say that I've got this gradient descent and let's say that my first point is somewhere here but I have to reach somewhere here right now when I reach this this is my Theta 1 and this is my J of theta 1 suppose I reach at this specific point and I will also have another gradient descent which looks like this let's say that in the initial time I reach the point over here how we will be coming to this minimal Global Minima by using this equation I'll talk about Alpha also don't worry now this is also my Theta 1 this is also my J of theta 1 now let's say suppose I came to this particular point right after coming to this particular point I will basically apply this derivative on this J of theta 1 okay now when I find out a derivative that basically means we are trying to find out the slope and in order to find the slope we just create a straight line like this which will look like this I'll just try to create so I'll try to create a slope like this this slope so if you try to find out with respect to this this is a positive slope how do we indicate it because understand the right hand side of the line of this is pointing on the top wordss Direction this is the best easy way to find out whether it is a positive slope or negative slope now in this particular case this is a positive slope now when I get a positive slope that basically means I will update my weights or Theta 1 as Theta 1 let's say I'm writing it over here so I will just apply this convergence algorithm see Theta 1 colon Theta 1 minus this learning rate which is called as Alpha this is my my learning rate I'll talk about learning rate don't worry then this derivative value in this particular case since I'm having a positive slope I will be getting a positive value let's say that for this Theta value I got this slope initially now I need to come to this location so for that I have to reduce Theta 1 so that I come to this main point now here you can see that I am I subtracting Theta 1 with something which is a positive number right this is a positive number so definitely I know that after some n number of iteration I will be able to come to the global Minima similarly if I take the right hand side and if I try to draw the slope in this particular case my slope will be negative so similarly I can write the equation as Theta 1 = to Theta 1 minus learning rate multiplied by a negative number so minus into minus will be positive right suppose initially my 1 was here my Theta 1 was here now I'll keep on updating the weight to come to this Global Minima so minus into minus is positive so I will basically get Theta 1 + Alpha by a positive number because minus into minus is plus so this will definitely work so that we will be able to come over here to the global Minima whether it is a positive slope or a negative slope now what is this learning learning rate now learning rate based on this learning rate suppose I want to come from this point to the global Minima by what speed I should be coming what speed if my learning rate value is bigger what speed I may be coming suppose if I say usually we select learning rate as 01 if I select a small number then it'll start taking small small steps to move towards the optimal Minima but if I take a alpha value a huge value if it is a huge huge value then what will happen this uh this updation of the Theta 1 will keep on jumping here and there and the situation will be that it will never meet it will never reach the global Minima so it is a very very good decision to take a alpha small value it should also not be a very very small value if it becomes a very very small value then what will happen very tiny steps it will take forever to reach the global Minima that basically means my model will keep on training itself so definitely this Al is going to work now let me talk about one scenario one scenario will be that what if my my cost function has a local Minima what if I have a local Minima because here if I come here if I come this is a local Minima suppose one of my points come over here and finally I'm reaching over here what will happen in this particular case because in this case you'll be seeing that what will be my equation my equation will be simply Theta 1 Theta 1 minus Alpha in this point in this local Minima slope will be zero so in this particular case my Theta 1 will be equal to Theta 1 now you may be thinking what is if this is the scenario then we will be stuck in local Minima this is called as local Minima but usually with respect to the gradient descent and the equation that we are using here we do not get stuck in local Minima because our gradient descent in this particular scenar iio will always look like this but yes in deep learning when we are learning about grade in descent and a Ann at that point of time we have lot of local Minima and because of that we have different different G decent algorithm like RMS prop we have Adam optimizers which will solve that specific problem so this one point also I wanted to mention because tomorrow if someone asks you as an interview question that what if in your uh do you see any local Minima in linear regression you can just that the cost function that we use will definitely not give us local Minima but if in deep learning techniques with that we are trying to use like Ann we have different different kind of optimizers which will solve that particular problem so that is the answer you basically have to give now let me go ahead and write with respect to the gradient descent algorithm so here again I'm going to write the gradient descent algorithm so this will be my gradient descent algorithm and remember guys gradient descent is an amazing algorithm and you you will definitely be using it so please make sure that you know this perfectly now some questions are that when will convergence stop convergence will stop when we come to near this area where my uh J of theta will be very very less now in gradient descent algorithm I will again repeat it so what did I say I said repeat until convergence I told you right here we have written this algorithm and now let's take it for Theta 0 and Theta 1 so here I will write Theta 0 J equal to Theta J minus learning rate of derivative of theta J J of theta 0 and Theta 1 so this is my repeat until convergence now we really need to find out what we'll try to equate we'll try to first of all find out what is this now if I really want to find out derivative of derivative of derivative of theta J with respect to J of theta 0 and Theta 1 so how do I write this I can definitely write this in a easy way okay so this will be derivative of theta J and remember J will be 0 and 1 right because we need to find out for 0 Theta 0 and Theta 1 so this will be 1 by 2 m what is what is J of theta 0a Theta 1 obviously my cost function so I will write summation of IAL 1 to M and here I will basically write J of theta of X of I minus y of I whole squar so if my J is equal to Z so what will happen for this so here I can specifically say that derivative of derivative of theta 0 J of theta 0a 1 now it's simple here what I will be doing is that I will be simply applying derivative function see guys what is this derivative let's consider this is something like this 1X 2 m x² so if I try to find out the derivative this will be 2x 2 MX so 2 and 2 will get cancel so similarly I'll have 1 by m and here I will specifically be writing summation of I = 1 2 m h Theta of x X of I which will be my x - y of i² so this will be my derivative with respect to Theta 0 this is what I got now the second thing will be that when J is equal to 1 derivative of derivative of theta 1 J of theta 0 comma Theta 1 in this particular case I will be having 1 by m summation of I = 1 to M then again see in this particular case Theta of 1 is there right Theta of 1 basically means what if I try to replace this let's say that I'm trying to replace this H Theta of X with something else what is s Theta of X I know that right it is Theta 0 + Theta 1 * X so Theta 0 + Theta 1 * X so after this if I'm trying to find out the derivative with respect to Theta 0 this will obviously become I will be able to get this much right now with respect to the second derivative what I will be writing I will again be writing H thet of X of i - y of i s multiplied X of I so this Square also went off understand this H Theta of X is what see they H Theta of X is nothing but Theta 0 + Theta 1 * X so if I'm trying to find out derivative with respect to Theta 0 nothing will be going to come okay Theta 1 of X will become a constant in this particular case in this case because Theta 1 of X is there so if I try to find out derivative of theta 1 into X only I'll be getting X Y Square will not be there it's easy right X squ means 2x this is the derivative of x square right so that square went and 1X 2 1 2 by two got cancelled so this will be now my convergence algorithm so here we have discussed about linear regression oh sorry I have to remove Square here also so let me write it again okay repeat until conver con let me write it down again repeat until convergence finally your two updates will be happening one is Theta 0 so here it will be Theta 0 minus Alpha that is my learning rate 1 by m summation of IAL 1 to M and this will basically be H Theta of X of I minus y of I and similarly if I want to update Theta 1 it will be - alpha 1 by m summation of I = 1 to m h Theta of X of I oh my God y of I uh multiplied by X of I Alpha is your learning rate guys Alpha is nothing but it is learning rate here we have to initialize some value like 0.1 see what is s Theta of X Theta 0 + Theta 1 into X right if I do derivative of theta 1 into x what is derivative of theta 1 with Theta 1 x it is nothing but X so this x will come over here now let's discuss about two important thing one is R square and adjusted R square now similarly what will happen you will have lot of convex functions now see if I talk about uh like if you have multiple features like X1 X2 X3 x4 at that point of time you will be having a 3D curve curve which looks like this gradient decent which will be something like this gradient it's just like coming down a mountain now let's discuss about two performance metrics which is important in this particular case one is R square and adjusted R square we usually use this performance metrix to verify how our model is and how good our model is with respect to linear regression so R square is basically given R square is a performance Matrix to check how good the specific model is so here we basically have a formula which is like 1 minus sum of residual divided by sum of total now this is the formula of R squ now what is this sum of residual I can basically write like this summation of y i Min - y i hat whole Square this Yi hat is nothing but H Theta of X just consider in this way divided by summation of Y of i - y mean y mean y s to formula this is the formula I'll try to explain you what this formula definitely says okay so first thing first let's consider that this is my this is my problem statement that I'm trying to solve suppose these are my data points and if I try to create the best fit line This Yi hat Yi hat basically means this specific point we are trying to find out the difference between this things difference between these things let's say that these are my points I'm trying to find out a difference between this predicted this is my predicted the point in green color are my predicted points which I have denoted as y i hat and always understand this is what Su sum of residual is sum of residual is nothing but difference between this point to this point this point to this point this point to this point this point to this point and I doing the all the summation of those now the next point which is very much important here is my X and Y what is this y IUS y y bar Y Bar is nothing but mean mean of Y if I calculate the mean of Y then I will probably get a line which looks like this I'll get a line something like this and then I will probably try to calculate the distance between each and every point and this specific point with respect to the distance between this point and this point the denominator will definitely be high right this value obviously this value will be higher than this value right the reason why it will be higher because the mean of this particular value distance will obviously be higher so this 1 minus high this will be a low value and this will be a high value when I try to divide Low by High Low by high then obviously this entire number will become a small number when this is a small number 1 minus small number will be a big number so this basically shows that our R square has fitted properly right it has basically got a very good R square now tell me can I get this entire R square a negative number let's say that in this particular case I got 90% can I get this R square as negative number there will be situation guys what if I create a best fit line which looks like this if I create this best fit line which looks like this then this value will be quite High it is only possible when this value will be higher than higher than this value okay but in the usual scenario it will not happen because obviously we'll try to fit a line which will be at least good it's not just like pulling one line somewhere we don't want to create a best fit line which is worse than this right worse than this so in this particular scenario you'll be saying that in R square now here you'll be able to see one one amazing feature about R square is that let's say let's say one scenario suppose I have features like let's say that my feature is something like uh let's say I have a price of a house okay so suppose this is my bedrooms how many bedrooms I have and this is basically the price of the house now if I if I probably solve this Pro problem I'll definitely get an R square value let's say the R square value is 85% let's say that my R square is 85% now what if if I add one more feature the one more feature basically says that okay if I add location location of the house will be definitely correlated with price so there is a definite chance that the R square value will increase let's say that R square will become 90% if I probably have this two specific feature and obviously it is basically increasing the R square because this is also correlated to price and let me change the example see first case I got by R square as 85% let's say now as soon as I added location I got 90% now let's say that I added one more feature which gender is going to stay gender like male or female is going to stay you know that gender is no way correlated to price but even though I add one feature there is a scenario that my R square will still increase and it may become 91% even though my feature is not that important even gender is not that important the R square formula Works in such a way that if I keep on adding features and that are not nowhere correlated this is obviously nowhere correlated this is not correlated with price then also what it does is that it is basically increasing my r² so this specific thing should not happen whether a male will stay or female will stay that does not matter at all still when you do the calculation the R square will still increase so in order to not impact the model because see now right now with this particular model where I have got 90% now as soon as I see R square as 91% because it is considering this particular gender so this model will be picked right because it is performing well and is giving you a better R square value but this should not happen because that is not at all corelated this model should have been picked so in order to prevent this situation what we do we basically Ally use something called as adjusted R square now what is this adjusted R square and how it will work I'll also show it to you very very nice concept of adjusted R square so adjusted R square R square adjusted is given by the formula is given by the Formula 1 - 1 - r² * N - 1 where n is the total number of samples n minus P minus 1 this p p is nothing but number of features or predictors we'll also say or predictors suppose initially my number of predictors were in this particular scenario in this scenario where I saw this my number of predictors was two and in this particular case my number of predictor was three now if my predictor is 2 I got the r squ as 90% so in this particular scenario what all the calculation will happen okay all the calculation will happen and let's say that my R square adjusted it'll be little bit less it'll be little bit less let's say it8 is 6% let's say that my R square adjusted is 86% based on this predictor 2 now when I use my predictor 3 predictor basically means number of features that I'm going to use and now in this one one feature is nowhere related like gender but what we are getting we are basically getting R square increased to 91% now for the R square adjusted this will not increase this will in turn decrease right now it will become 82% how it will become I'll show you I've just considered some value 8682 here you can see that there is an increase here an increase is there here decrease is there now how this is basically happening see this P value that I will be putting okay if I put a p isal 3 obviously with n minus P minus 1 this will become a little bit smaller number or sorry little bit uh smaller number right so now in this particular case if it is not correlated obviously this will be high when I'm increasing this so this will also be high let me write the equation something like this just a second so this will basically be okay now why probably this value may have decreased let me talk about this one what is r squ I hope everybody understood n is the number of data points p is the number of predictors if p is increasing then what will happen as P keeps on increasing this value will keep on decreasing this value will keep on decreasing if this values keep on decreasing this will be a bigger number this will obviously be a big number a big number divided by a small number what it will be obviously this will be a little bit bigger number 1 minus bigger number we will basically get some values which will be decreasing if my P value is two in this particular case it will be less smaller than this right at least it will be greater than this this particular value right when p is equal to 3 so with the help of P obviously R square is there to support you okay whether it is correlated or not always remember when the features are highly correlated your R square value will increase tremendously if it is less correlated then it will be there will be a small increase but there will not be a very huge increase now if I consider p is equal to 2 obviously when I'm trying to find out this uh calculation n minus P minus 1 it will obviously be greater than p is equal to 3 when p is equal to 3 then this value will be still more smaller and when we are dividing a bigger number by a smaller number obviously we are subtracting with one so that basically means even though my R square is 86 over here there may be a scenario since this is nowhere correlated I'm basically getting an 82% because of this entire equation so I hope you are understanding this this is very much important to understand a very very important property simple way to define is that as my P value keeps on increasing the number of predictors keeps on increasing my R squ gets adjusted whatever R square I'm getting with respect to this it will always be less than this particular R square there was one interview question that was asked one of my student between R square and adjusted R square which will always be bigger definitely the student said R square then he told him to explain about adjusted R square why does that specific happen agenda one is about Ridge lasso regression second is assumptions of linear regression the third point that we are probably going to discuss about is logistic regression then the fourth thing that we are going to discuss about is something called as confusion Matrix the fifth thing that we are going to consider about is practicals for lead lineer Ridge lasso and logistic so first topic uh that we are probably going to discuss is something called as Ridge and lasso regression so let's understand about Ridge and lasso regression if you remember in our previous session what all things we discussed linear regression and then we had discussed about the cost function we have discussed about R square adjusted adjusted R square sorry R square and adjusted R square we have discussed about it gradient descent we have discussed about it it was nothing but 1 by 2 m summation of I = 1 2 m h Theta of x i - y - y i s so this is the cost function that we had discussed right yesterday and this cost function was able to give us a gradient descent with respect to the J of theta J of theta Zer or Theta not so I can also write this as J of theta comma Theta 0 comma Theta 1 now let me give you a scenario let's say that I have a scenario over here and I have this specific scenario let's say that I just have two points which looks like this okay now if I have these two specific points what will happen I will probably try to create a best fit line the best fit line will definitely pass through all the points like this if I try to calculate the cost function what will be the value of J of theta 0 comma Theta 1 let's say that in this particular case since it is passing through the origin my Theta 0 will be zero okay so what will be the value of theta 0 comma Theta 1 so here obviously you can see that there is no difference so it will obviously become zero Now understand this data that you see right right this data is basically called as training data so this data that I have actually plotted with two points these are specifically called as training data now what is the problem in this data right now see right now exactly whatever line is basically getting created over here which is through the uh hypothesis over here you can see that it is passing through every point so that is the reason your cost is zero and our main aim is to basically minimize the cost function that is absolutely fine now in this particular case in which my model this if this model is getting trained initially this data is basically called as training data now just imagine that tomorrow new data points comes so if my new data points are here let's consider that I I want to basically uh come up with this new data point now in this particular scenario if I want to predict with respect to this particular Point let's say my predicted point is here is this the difference between the predicted and the real Point quite huge yes or no so this is basically creating a condition which is called as overfitting that basically means even though my model has given or trained well with the training data or let me write it down properly over here so this condition since since you can see that over here my each and every point is basically passing through the best fit line so because of that what happens it causes something called as overfitting so you really need to understand what is overfitting now what does overfitting mean overfitting basically means my model performs well with training data but it fails to perform well with test data now what is the test data over here the test data is basically this points the real test data answer was this points but because the my line is like this I'm actually getting the predicted point over here so this distance if I try to calculate it is quite huge so in this scenario whenever I say my model performs well with training data and it fails to perform well with test data then this scenario we say it as overfitting so this scenario when the model performs well with training data I have a condition which is called as low bias and when it fails to perform with the test data then it is basically called as high High variance very important okay I will make each and everyone understand one by one if it is performing well with the training data that is basically low bias and whenever it performs well with the test sorry fails to perform well with the fails to perform well with the test data then it is basically High variance now similarly I may have another scenario which is called as underfitting so let's say that I have something called as underfitting now in this underfitting what is the scenario the model fails to perform it gives bad accuracy I say that model always remember whenever I talk about bias then you can understand that it is something related to the training data whenever I talk about test data at that point of time you talk about variance and that specifically whenever you talk about variance that basically means we are talking about the test data so for an overfitting you will basically have low bias and high variance low bias with respect to the training data and high variance with respect to the test data now if the model accuracy is bad with training data and the model accuracy is also bad with test data in this scenario we basically say it as underfitting so these are the two conditions that are with respect to underfitting that basically means that both for the training data also the model is giving bad accuracy and again for the test data also it is basically having a bad accuracy so in this particular scenario we can definitely say two things out of underfitting one is high bias and high variance so this is the condition with respect to underfitting very super important let me just explain you once again suppose let's consider I have one model I have model two this is model one this is model one this is model two and this is model 3 okay guys so suppose let's say that I have my model my training accuracy is let's say 90% And my let's say that my test accuracy is 80% now in this particular case let's say that my training accuracy is 92% and my test accuracy is 91% and let's say my model three is basically having training accuracy as 70% and my test accuracy is 65% so if I take this particular case it is basically overfitting if I take this particular thing this basically becomes my generalized model and when I talk about this this is my I'll just say that okay I'll also put nice color so that uh you'll be able to understand this this becomes our generalized model and this finally becomes our underfitting right under under fitting so here is my red color I will just say it as underfitting what are the main properties of this overfitting as I said in this scenario since it is performing well with the training data so it will be low bias High variance in this particular case it will be low bias low variance and this particular case it will be high bias and high variance understand in this terminology in this particular way you'll be able to understand so why do we require always a generalized model because whenever our new data will definitely come generalized model will be able to give us very good output let's go back to this particular example here you'll be able to see this straight line the red line that I have actually created is basically overfitting so that whenever I probably get the new points which is having this real value and the predicted points here you'll be able to see the difference is quite huge so because of this it will definitely be a scenario of overfitting where it has low bias and high weight so again let me go ahead and take this example so this was my line which I have actually drawn I had two points and when I draw this line which was a best fit line to which is passing through both the points this scenario is basically causing a overfitting problem and I've also shown you my J of theta 1 will be zero in this scenario since it is passing exactly and the predicted point is also over there now understand one thing is that what can can we take out from this what assumptions we can take out from this definitely if I talk about our cost function our cost function here is nothing but 1X 2 m summation of I = 1 2 m h Theta of X of i - y of I whole s now let's consider that I am going to use this H Theta X and I'm going to basically write it as y hat okay let's focus on this specific point so when I take this I'm I'm just going to focus on this particular point so here I will definitely write it as y hat minus y of I whole squ so this is my y y hat of I minus y hat y i whole Square so this is nothing but the difference between the predicted value and the real value okay this is what I'm actually trying to get now in this scenario if I am adding this values obviously I'm going to get the value as zero now I have to make sure that this value does not come to zero because this is still over fitting so that is where your Ridge regression will come into picture Ridge and lasso will come into picture now when I use Ridge and lasso suppose if I use Ridge now in Ridge what we say this this is also called as L2 regularization now L2 regularization what it does is that it basically adds a unique parameter add a One More Sample value which is like Lambda multiplied by slope Square now what is this slope whatever slope of this particular line it is we are just going to square it off now suppose if I take my equation which looks like this H Theta of X is equal to Theta 0 + Theta 1 x now in this particular case my Theta 0 was zero so my H Theta of X is nothing but Theta 1 what is Theta 1 this is specifically called as slope and I am basically taking this Theta 1 I'm actually making it as a square Square so always understand I don't want to make this as zero because if it becomes zero it may lead to overfitting condition now what will happen if I add this particular equation if I add this particular equation this will obviously come as zero let's consider my Lambda value over here my Lambda value is one I'll talk about how do you set up Lambda value okay let's consider that I'm initializing it to one let's say my Lambda value is 1 now what I will do is that this l Lambda value is 1 Let's consider our slope value initially is two and because of this two I got this best fit line I'm just going to consider it so if I do the total sum over here if I'm just considering this this value is three now the cost function will not stop over here because still it has to minimize it has to reduce this three value so what it will do it will again change the Theta 1 value and let's say that my Theta van value has changed now it got another best fit line which looks something like like this this is my next best fit line I'll talk about Lambda Lambda is a hyper parameter guys what exactly is Lambda I'll just talk about it now when I basically change this line now see why I'm getting this line let's consider I have changed my Theta 1 value since we need to minimize now when we need to minimize what it will do we'll again calculate the slope of this particular line and then we will try to create a new line when we sorry it is two two not three just a second guys 0 + 1 multiplied by 2 s which is nothing but 4 so now my cost function will not stop over here so we are going to still reduce this now in order to reduce this again Theta 1 value will get changed and then we will get a next best fit line for this point now what will happen in this scenario once we have this best fit line we will definitely get a kind of small difference so now if I go ahead and consider the new equation my y hat I minus y i² + Lambda of slope squar this value will be a small value now because I have some difference and then plus again 1 multiplied by now understand whether the slope will increase in this particular case or whether it will decrease in this particular case there will be some slope value let's say that I have got some slope of this particular line in this particular scenario again your slope will definitely decrease so let's say in the case of two initially it was now it is basically 1.36 whole squ now this small Value Plus 1 + 1.3 squ or let me consider that my slope is now one simple value that is 5 so if I get this it is 2.25 2.25 plus small value it will be less than three only right it will obviously be less than three or equal to 3 but understand what is happening the value is getting reduced from 4 to 3 so this is is the importance of Ridge now what will happen is that you will try to get a generalized model which has low bias and low variance instead of this overfitting condition you know why specifically we are adding Ridge L2 regularization it is basically to prevent overfitting because here you are not stopping here you are trying to reduce it unless and until you get a line you get a line which will be able to handle the which will be able to handle as a uh generalized model now here you can see now if I have my new points like how I drew over here now the distance will be less so now you'll be able to see that it will be able to create a generalized model guys this will be a small value only see initially when we have this line obviously we have zero if we try to slightly move here and there so here you'll be able to see that it will just a slight movement but what this movement is basically specifying it is specifying that the slope should not be steep if we probably have a steep slope it obviously leads to most of the time overfitting condition it should not be steep it should be very very it should be less steeper but it should actually help you to create a generalized model so you will be seeing that after playing for some amount of time this value will not reduce after some point of time it'll get almost it'll be a minimal value it'll be a smaller value and for this also you have to specify iterations how many times you probably have to train them now this iterations is also a hyperparameter based on number of iterations you will probably see your R square or adjusted R square over here so this iterations based on the number of iterations it will never become zero guys understand because zero it is not possible if it becomes zero trust me it is an overfitting model you cannot get that is something zero now what is Lambda coming to this Lambda this Lambda is a hyperparameter this is basically to check how fast you want to lessen the steepness or how fast you want to make a steepness grow higher right and this Lambda will also be selected by using hyper parameter and this also I'll show you today in Practical what do you mean by iterations iteration basically means how many time I want to change the Theta 1 value how many times you want to change the Theta value that is the convergence algorithm right convergence algorithm over here L2 regularization or Ridge is basically used in such a way that you should never overfit why we assume Theta 0 is equal to 0 because I'm considering that it passes through a origin right origin over here Lambda is a hyper parameter steep basically means how steep the line is if I have this line this line is quite steep if I have this line This is less steep now if I go to the next regularization which is called as lasso raso R lasso regression this is also called as L1 regularization now here the formula will be changing little bit here you will be having y hat of minus of Y whole Square here you'll be adding a parameter Lambda but understand here you'll not be adding slope Square no here you'll be adding mode of slope here you'll be adding mode of slope and this mode of slope will work is that it will actually help you to do feature selection now you may be thinking how feature selection crash let's consider a equation over here let's say that I have many many features I have many many many features okay so my H Theta of X which I'm indicating here as y hat let's say that I'm I'm writing this equation apart from preventing for overfitting it will also help you to do feature selection here let me just show you over here with an example this H Theta of X which I'm probably writing as y hat will basically be indicated by something over here you'll be able to see that it is nothing but let's say that I have multiple features like this now in this particular features obviously there are so many coefficients over here so many slopes over here now mod of slope will be what it will be nothing but mod of X1 plus X2 plus X3 plus X4 plus X5 like this up to xn now in this particular case how it is basically helping you to sorry not X1 sorry just a second this mod of I have taken the data point this is not data points this should be your mod of theta 0 + Theta 1 + Theta 2 + theta 3 + Theta 4 + Theta 5 like this up to Theta n so here you'll be able to see that this is how I will basically uh I'll basically be calculating the slope now as we go ahead guys whichever features are probably not playing an amazing role the Theta value the coefficient value the slope value will be very very small it is just like that entire feature is neglected that entire feature is neglected now in this particular case we were doing squaring because of the squaring that value was also increasing but here because of the mode that value will not increase instead it will be a condition wherein we are basically neglecting those features that are not at all important in this specific problem statement so with the help of L1 regularization that is lasso you are able to do two important things one is preventing overfitting and the second case is that if you have many features and many of the features are not that important okay in basically finding out your slope or your line or the best fit line in that particular case it will also help you to perform feature selection so this is the importance of the entire what is the importance of this this is the importance of the uh Ridge and the lasso regression that we are doing here I'm just going to write L1 regularization and obviously we have discussed about L2 regularization also now you have probably understood Lambda is one hyperparameter okay which we will specifically using okay and based on this Lambda this will be found out through cross validation cross validation is a technique wherein we will try to probably train our model and try to find out the specific things okay what should be the exact value and there also we play with multiple values in short what we are doing we just trying to reduce the cost function in such a way that uh it will definitely never become zero but it will basically reduce based on the Lambda and the slope value in most of the scenario if you ask me we should definitely try both the regularization and see that wherever the performance Matrix is good we should use that what is cross validation basically means I will try to use different different Lambda value and basically Ally use it so in a short let me write it down again for Ridge regression which is an L2 Norm here I'm simply writing my cost function in this particular case will be little bit different here I can definitely write my cost function as H Theta X of i - y of I S Plus Lambda multiplied slope Square what is the purpose of this the purpose is very simple here we are preventing overfitting this was with respect to the Ridge Recreation that is L2 nor now if I go ahead and discuss about the next one which is called as lasso regression which is also called as L1 regularization in the case of lasso regression your cost function will be H Theta of X of IUS y of i² plus Lambda ultied mode of flow so here you have this specific thing and what is the purpose the purpose are two one is prevent overfitting and the second one is something called as feature selection so these two are the outcomes of the entire thing see with respect to this lasso right you have slopes slopes here you'll be having Theta 0 plus Theta 1 plus Theta 2 plus theta 3 like this up to Theta n now when you'll have this many number of thetas when you have many number of features and when you have many number of features that basically means you'll have multiple slopes right those features that are not performing well or that has no contribution in finding out your output that coefficient value will be almost nil right it will be very much near to zero in short you neglecting that value by using modulus you're not squaring them up you're not increasing those values now I will continue and uh probably I will also discuss about the assumptions of linear regressions so what are the assumptions of linear regression in this particular scenario so assumption is that number one point linear regression if our features are in normal or gion distribution if our features follows this particular distribution it is obviously good our model will get trained well so there is one concept which is called as feature transformation now in future transformation always understand what will happen if a model does not fall follow a gan distribution then we apply some kind of mathematical equation onto the data and try to convert them into normal orian distribution the second assumption that I would definitely like to make is that standard scalar or standard digestion standard dig is nothing but it is a kind of scaling your data by using Z score I hope everybody remembers Z score this is what we basically apply there your mean is equal to zero and standard deviation equal to 1 see guys wherever you have gradient descent involved it is good to basically do standardization because if our initial point is a small Point somewhere here then to reach the global Minima or training will happen quickly otherwise what will happen if your values are quite huge then your graph may be very big and the point can come over any over there and the third point is that this linear regression works with respect to linearity it works if your data is linearly separable I'll not say linearly separable but this linearity will come into picture if your data is too much linear it will obviously be able to give a very good answer like logistic regression also which we are going to discuss today this also has the same property now you may be asking is it compulsory to do standardization guys if you want to increase the training time of your model or if you want to optimize your model I would suggest go ahead and do standardization now coming to the fourth Point here you really need to check about multicolinearity this is also one kind of check we basically do what is multicol linearities let's say I have X1 I have X2 and this is my output feature I have let's say X3 also now let's say that if I try to see the colinearity of this two feature how how correlated these two feature are let's say that these two feature are 95% correlated is it is it a wise decision to use both the features and let's say that let's let's say that these two features are 95% correlated but it is highly correlated with Y is it necessary that we should use both the feature in this particular scenario the answer should be no we can drop this particular feature okay we can drop this particular feature any one of the feature we can definitely drop it and based on that I can just use one single feature and basically we do the prediction there is also a concept which is called as variation inflation factor I will try to make a dedicated video about this multical is also solved with the help of variation inflation Factor one more term is there homos orc so that kind of terminologies also we use one more condition in this but if you almost satisfied with this assumptions you will definitely be able to outperform in linear regression so you have got an idea of the assumptions you have also got an idea of multiple things okay now let's go towards something called as logistic regression now logistic regression what logistic regression is the first type of algorithm that we are going to learn in classification let's say that in classification I have one example you know so suppose I have say number of hours study hours and number of play hours based on this I want to predict whether a child is passing or failing suppose these two are my features I want to predict whether it is pass or fail so here you'll be able to see that I have some fixed number of categories specifically in this particular scenario I have two categories binary logistic regression works very well with binary classification now the uh question comes that can we solve multiclass classification using logistic the answer is simply yes you can definitely do it so let's go ahead and let's try to discuss about uh logistic regression now what is the main purpose of the logistic regression first of all let's let's uh understand one scenario okay suppose I have a feature which basically says um number of study hours and this is like 1 2 3 4 5 6 7 and let's say that I have pass this point is basically pass and this point is basically fail so I have this two conditions these are my outcomes now what I'll do I will just try to make some data points let's say that if I study Less Than 3 hours I will probably be fail if I study more than 3 hours then probably I will pass this I'll make it as fail and this I will make it as pass so I will be having points over here this 1 2 3 let's say that this is my training data set now the first question says that okay Chris fine you have some data over here whenever it is less than three you are basically the person is failing if it is greater than five greater than three it is basically showing data points points with respect to pass now can't we solve this problem first with linear regression now with the help of linear regression here the first point will be that yes I can definitely draw a best fit line my best fit line in this particular scenario may be something like this it may it may look something like this so here fail is nothing but zero pass is one the middle point is basically 0.5 so obviously with the help of linear regression I'm able to create this best fit line and I'll put a scenario that whenever the value is less than5 whenever the value is less than 0.5 whenever the output is less than5 let's say that new data point is this and based on this I'll try to do the prediction I'm actually able to get the output over here now when I'm getting the output over here this basically is 0.25 now in this particular scenario obviously I'm able to say that yes the person I'll write a condition over here saying that if my H Theta of x value is less than 0.5 then my output should be zero let's say less than 0.5 I'll say not less than or equal to less than5 then my output will be zero right so in this particular case Zero basically means fail similarly I'll have a scenario where I'll say that when if my S of theta of X is greater than or equal to 5 then this will basically be one which is nothing but pass so this two condition I can definitely write over here this is my center point so that any point that will probably come over here let's say that this point is coming over here right let's say new data point is somewhere coming over here with this red point now what I'll do I'll basically draw a straight line it will come over here I will just extend this line long I will extend this line over here and I will extend this line over here and here you can see that based on this I'm actually getting this particular prediction which is greater than 0.5 so I will say that okay the person has passed obviously this is fine this is obviously working better this is obviously working better so what what is the problem why we are not using linear regression okay in order to solve this particular problem why you are specifically having logistic regression the answer is very much simple guys the answer is that whenever let's say that if I have an outlier which looks something like this suppose I have an outlier which comes like this over here what is this value let's say that this value is nothing but 7 8 9 10 let's say that the number of study hours and I'm studying for nine it is obviously pass now in this particular scenario when I have an outlier this entire line will change now I will probably get my line which looks something like this okay my line will basically move something like this it will now get moved something like this now when it gets moves completely like this now for even five or even at any point that I am actually predicting let's say that at this particular point if I try to find out it'll be showing less than 0. five so here this particular value or answer will be wrong right because if we are studying more than 5 hours OB viously B based on the previous line the person had to pass but in this scenario it is failing it is coming less than 0.5 but the real value for this is basically passed so I hope you are understanding because of the outlier the entire line is getting changed so how do we fix this particular problem now in this two scenarios are there first of all obviously because of just an outlier your entire line is getting shifted here and there the second point is that over here sometimes you're also getting greater than one you you're also getting less than one suppose if I try to calculate for this particular point if I project it in behind I'll be getting some negative value so we have to squash this function if I squash this function then it'll become a plain line right how do we squash it and for this we use something called as sigmoid activation function or sigmoid function if somebody ask you why don't you use linear regession in order to solve this classification problem then your answer should be very much simple you should say this to specific points so we will try to go ahead and solve some linear regression now with the help of cost function everything as such and we'll try to understand how the cost function will look for logistic regression second reason I told you right it is greater than zero over here the line is going greater than zero right greater than zero I have only Z and one and it is becoming greater than zero but I have already told that our maximum and minimum value are 1 and zero so I hope you have understood why linear Reg cannot be used okay I showed you all the scenarios why linear regression should not be used now we'll continue and probably discuss about the other things over here and uh we will now try to understand fine what exactly logistic regression is all about and how the decision boundaries basically created now we'll go ahead and discuss about that specific thing so let's go ahead our values should be always between 0 to one over here in this particular case because it is a binary classification problem only this should be the answer so let's go ahead and let's define our decision boundary so my decision boundary decision boundary in the case of logistic regression first of all as usual in logistic regression we defined our hypothesis okay guys first of all let's see if I'm writing my my h of theta my H Theta of X as Theta 0 + Theta 1 into x + Theta 2 into X like this X1 X2 + Theta n into xn now in this scenario can I write this entire equation as Theta transpose X obviously I can definitely write this way right and this is what is the notation that you will probably seeing in many places so with respect to the decision boundary of logistic regression our Theta see like this we can write I'm saying okay but since we have to consider two things one is squashing the line okay how that squashing will basically happen see if I have this if I have this line we saw in the above right if I have this line suppose I have some data points over here and I have some data points over here if I want to create the best fit line how will I create I will basically create like this but I have to also do two things one is squash over here and squash over here right squash over here and squash over here now in order to squash I'm saying squash squash means okay now in order to do this I use a function which is called as sigmoid activation function that basically means what happens obviously you know this line is basically denoted by H Theta of x equal to how do you denote this straight line let me write it down nicely for you so how do you denote this straight line the straight line is obviously denoted by Theta 0 + Theta 1 * X1 let's say now on top of this on top of this I have to apply something on top of this value I have to apply something so that I can make this line straight instead of just expanding in this way so my hypothesis will basically be now G of G is basically a function on Theta 0 and Theta 1 * X1 so here I'm trying to basically what I'm trying to do I will apply a mathematical formula on top of this linear regression to squash this line now let's go ahead and let's try to find out what is this G okay what is this G I will say let Z equal to Theta 0 + Theta 1 * X I'm just initializing this now my H Theta of X is nothing but G of Z now we need to understand what is this z g of Z and how do we basically specify what is the G function so my G function is nothing but H Theta of x equal to 1 by 1 + e ^ of minus Z which in short if I try to initialize Zed now it is 1 ^ of e ^ of minus Theta 0 + Theta 1 * X so this is what is my H Theta of X which is my hypothesis and this obviously works well because it is being able to squash the function so this is basically my hypothesis which I am definitely trying to use it and this function that you are actually able to see is called as sigmoid or logistic function now you need to understand what does this sigmoid function look like in graph in graph it looks something like this so this this is my Zed value and this is my G of Z this is my 05 your sigmoid function will have this curve so this is your one this is zero your value when now from this we can make a lot of assumptions what are the assumptions that we can basically make your G of Zed your G of Zed is greater than or equal to 5.5 is obviously greater than or equal to 0.5 when your Zed value is greater than or equal to zero this is the major assumptions that we can basically make that is whenever your G of Z is greater than your G of Z is greater than or equal to 0.5 whenever your Zed is greater than or equal to Z so obviously whenever your Zed value is greater than Z it is greater than 0.5 if your Zed value is less than zero what it will become it will basically be less than 0.5 so you can write that specific condition also you want so this is the most important condition over here why it is called as logistic regression see guys with the help of regression you creating this straight line and with the help of the concept of sigmo you are able to squash it so they have probably combined that name and uh basically have written in this way will squashing of the best fit L line help to overcome the outlier issues yes obviously it'll be able to help you so let's go ahead and let's try to solve the problem statement now usually let's consider my training set let's consider my training set suppose I have some training points like this x of 1 comma y of 1 let's say x of 2A y of 2 okay X of 3A y of 3 like this I have lot of training points and finally X of n comma y of n let's say that this is my training data so here uh my y y will belong to what zero or 1 because I will only have two outputs since we are solving a binary classification problem here is my training set with two outputs and I hope everybody knows about J Theta of Z it is nothing but 1 + e ^ of minus Z here your Z is nothing but Theta 0 + Theta 1 * X1 so this is your Theta 0 now what we have to do we have to select this Theta now in this particular case let's consider that my Theta 0 is 0 because it is passing through the origin just for time pass sake suppose my Z is Theta 1 into X so now I need to change what is my parameter my parameter is Theta 1 I have to change parameter Theta 1 in such a way that I get the best fit line and along that I apply this sigmoid activation function now let's go ahead and let's first of all Define our cost function because for this we definitely require our cost function now everything will be same obviously you know the cost function of linear regression because the first best fit line that you are probably creating is with the help of linear regression now in this particular case in the case of linear regression so here you can basically write J J of theta 1 is nothing but 1 by m summation of I = 1 2 m 1X 2 and here you have H Theta of x minus y of I I whole Square so this is your entire thing of if you remember linear regression whatever things we have discussed yesterday okay so this is the cost function let's consider that for linear regression for this is for the linear regression now for the logistic regression what will happen for your logistic regression I will take the same cost function H Theta of X now you know what is s Theta of X it is nothing but 1 + 1 + e ^ of minus Theta 0 + Theta sorry Theta 1 multiplied by X right this is my with respect to logistic regression this is my entire equation now similarly I will try to only put this H Theta of X let's consider that this is my cost function only only my H Theta of X is changing in this particular case so if I go ahead and write my cost function I can basically say 1x2 h Theta of X of i - y of i² and in this particular scenario what is h Theta of X it is nothing but 1 + 1 + e ^ minus Theta 1 x so this is what this is getting replaced and this is my logistic regression cost function I'm just considering this cost function part this part later on if you replace this to this see if I replace this to this and if I replace this to this it becomes a logistic regression cost function intercept I'm considering it as zero guys now when I'm replacing this to this this to this then it becomes a logistic uh regression cost function but there is one problem we cannot we cannot use we cannot use this cost function there is a reason for this because this equation that you're seeing 1/ 1 + e^ of minus Theta 1 * X this is a non-convex function now you may be considering what is a non-convex function so let me write it down so here this this term this terminology right it is a non-convex function now what is this non-convex function let me show you and let me differentiate it with convex function okay we'll try to understand what is the difference between non-convex function and convex function this is related to gradient descent very important this is related to gradient desent if you remember with the help of linear regression whatever gradient Dent we are actually getting it is a convex function like this this is the convex function which looks like a parabola curve Parabola curve because of this Parabola curve whenever we use this linear regression cost function specifically because here my H Theta of X is what it is nothing but Theta 0 + Theta 1 into X because of this this equ will always give you a parabola curve this kind of cost function or convex function you can say but here your s Theta of X is changing so in the case of if I use that cost function you will be getting some curves which looks like this now what is the problem with this curve here you have lot of local Minima if local Minima is there you will never reach This Global Minima so that is the reason we cannot use that c function now mathematically you can also go and probably search in the Google what is the what is the graph or what is a convex or non-convex function but always remember whenever we updates Theta 1 with this within this particular equation by finding the slope then this way it will not be differentiable and here you have lot of local Minima and because of this local Minima you will never be able to reach the global Minima this is your Global Minima right in case of in case of linear regression you'll reach This Global Minima but in this case you will never reach never never you'll be stuck over here or you may get stuck over here you may get stuck over here okay so this has a local Minima problem so how do we solve this understand in local Minima these are my points right I have to come over here this is my deepest point in this particular case I don't have any local Minima now in local Minima also you'll get slope is equal to Z so that is the reason your Theta 1 will never get updated so in order to solve this problem you can see this diagram we have something called as logistic regression cost function so I can now write my logistic regression cost function in a different way so this researcher researcher thought of it and basically came up with this proposal that the logistic cost function should look something like this so the entire cost function of logistic regression that is specifically H Theta of X of I comma y this should be written something like this and it should be written like this see here I'm just going to write cost function of J of theta 1 let's say that I'm writing J of theta 1 okay so J of theta 1 what are the different different output that I'll be getting I'll be get I'll be getting yal 1 or y equal to 0 So based on this two scenarios our cost function will look something like this minus log of H of theta of X and I know I hope you all know what is h Theta of x h Theta of X is nothing but 1 + 1 ^ of - Theta 1 x so this is what is my H Theta of X and whenever Y is Zer then you basically have minus log * 1 - H Theta of X of I of I okay so this is how you basically write your cost function in this particular scenario now with the help of this cost function it is always possible since it is getting log log is basically getting used in this scenario you'll always get a global Minima that is the reason why they have completely neglected this cost function and utiliz this cost function now what does this cost function basically mean two scenarios if Y is equal to 1 Let's consider this is my cost function graph I have H Theta of X and you know that H Theta of x value will be ranging between 0 to 1 since it is a classification problem so it will be ranging between 0 to 1 and this is basically of J of theta 1 which is my cost function so if Y is equal to 1 this specific equation will be used and whenever this equation is is basically used you get a you get a curve see minus log s of X of I you get a curve which looks something like this okay which you'll get a curve which looks like this now what does this curve basically specify the curve come up with two assumptions the cost will be zero if Y is = 1 and H Theta of x equal to 1 that basically when your s Theta of X is 1 and the Y is output is one that basically means you're going to assign over here one right so in this particular case you will be seeing that your cost function will be zero cost is zero so here is my zero it is meeting over here if you of x equal to 1 and Y is equal to 1 so this is this is again a convex function only then the next point that you can probably discuss over here is with respect to Y is equal to 0 if your Y is Z then what kind of curve you will be getting you'll get a different kind of curve which will look like this H Theta of x here your value will be 0 to one and here you'll be having a curve which looks like this so when you combine this two you'll be able to see that you are able to get a kind of gradient descent so this will definitely help us to create a cost function so I hope everybody is able to understand till here with respect to this and this will definitely work so finally I can also write my cost function in a different way the cost function that I will probably write over here so this will be my J of theta 1 so I can come up with a cost function which looks like this cost of H of theta of X of I comma Yus log of H Theta of x if Y is equal 1 and then minus log 1 - H Theta of x if Y is equal 0 now I can combine this both and probably write something like like this I can combine this both and I can basically write cost of H Theta of X of IA Y is equal to - y log H Theta of X of I minus log 1 - y okay 1 - y log of 1 - H Theta of X so this will be my final cost function and here also you can see that if I replace if I replace y with one then what will remain only this particular value will remain right this value when Y is equal to 1 this thing only will come you see over here replace y with one probably replace y with one and then you'll be able to see so here I can now write if Y is equal to 1 my cost function will Rook something like this which is nothing but see Y is 1 then what will happen my log of H Theta of X of I will come and this 1 - 1 is 0 so 0 multili by anything will be 0 if Y is equal to 0 then what will happen my cost function will be so when it is zero this will - y will become 0 0 multili by anything is z so here you'll be able to see that I am I'll be having minus log 1 - H Theta of x i so this both the condition has been proved by this cost function so this is my cost function yes cost function and loss function with respect to the number of parameters will be almost same so finally if I try to write J of theta because I have that 1X 2 m also right so 1X 2 m also I have so what I'm actually going to do here you will be able to see that I can write J of theta 1 is equal to 1 by 2 m summation of IAL 1 to M and then write down the entire equation that you have probably over here so here you have minus y or I I'll just remove this minus and put it over here and this will become plus sorry y of I * log H Theta of X of I 1 - y of i y log 1 - H Theta of X of I so this becomes my entire first function and obviously you know what is h thet of x H Theta of X of I is nothing but 1 + 1 e^ minus Theta 1 * X and finally my convergence algorithm I have to repeat this to update Theta 1 repeat until this updation that is Theta Theta J is equal to Theta J minus learning rate derivative with respect to Theta J and this will be my J of theta 1 this is my repeat until conversion so this is my cost function this is my repeat algorithm and here I will be updating my entire Theta 1 and this solves your problem with respect to logistic regression simple simple questions may come like how it is different from linear regression how it is not different from linear regression can we say log likelihood a topic from probabilistic yes this is uh this is log likelihood if now I will discuss about performance metrics and this is specific to classification problem and binary classification I'm talking let's consider let's consider I have a data set which has X1 X2 and this is y and obviously in logistic uh classification you have outputs like 0 1 0 1 1 0 1 and your y hat y hat is basically the output of the predicted model now in this particular scenario my y hat will probably be 1 1 0 uh 1 1 1 Z so in this particular scenario this is my predicted output and this is my actual output so can we come to some kind of conclusions wherein probably we will be able to identify what may be the accuracy of this specific model with respect to this many data points because confusion Matrix is all dealt with this is called as we will first of all have to create a confusion Matrix now for a binary classification problem the confusion Matrix will look like this so here you have 1 0 1 0 Let's say that this is prediction let's say that these are my actual value and these are my prediction value okay these both are prediction value these are my output value when my actual value is zero my predicted value is one does this what does this mean wrong prediction right so when my actual value is zero my predicted value is 1 so here my count will increase to one let's go to the second scenario when the actual value is one and my predicted value is one that basically means one and one so here I'm going to increase my count similarly when my actual value is zero my predicted value is zero so that basically mean when my actual value is z my predicted value is zero I'm going to increase the count by one if I go over here 1 one again it is so instead of writing one now this will become two I'm going to increase the count similarly I'll go over here one more one is there so I'm going to increase the count three then I have 01 01 basically means when my actual value is zero I'm actually getting it as one so I'm also going to increase this particular value as two and then finally I have 1 and zero where I'm going to increase like this now what does this basically mean now what does this basically mean see with respect to this kind of predictions whenever we are discussing this basically basically says so this is my actual values and I have Z 1 and zero and this is my predicted values I also have 1 and zero this value when one and one are there this is called as true positive this value when 0 and Zer are there this is called as false negative whenever your actual value is zero and you have predicted one this becomes false positive and whenever your actual value is one you have predicted zero this becomes false negative now coming to this I really need to find out the accuracy of this model now if I really want to find out and this is what is called as confusion Matrix now in this confusion Matrix if I really want to find out the accuracy the accuracy of this model it is very much simple this middle elements that you are able to see will basically give us the right output so this and this if I add it up it will give us the right output so here I'm going to get TP + TN divided by TP + FP + FN + TN so once I calculate this so I have 3 + 1 / 3 + 2 + 1 + 1 so this is nothing but 4 by 7 what is 4 by 757 so am I getting 57 percentage accuracy so I'm actually getting 57% accuracy over here with respect to the accuracy so this is how we basically calculate with respect to basic accuracy with the help of uh the confusion Matrix okay so this is specifically called as confusion Matrix now there are some more things that you really need to specify always remember our model aim should be that we should try to reduce false positive and false negative now let's say that I want to discuss about two topics what one is suppose in our data set I have zeros and one category let's say in my output if I say Zer are 900 and ones are 100 this becomes an imbalanced data very clear right so this become an imbalanced data set it is a biased data suppose if I say zeros are probab 600 and ones are probably 400 in this particular scenario I will say that this is the balance data because yes you have 100 less but it's okay the it may not impact many of the algorithm now see guys most of the algorithm that we will be probably discussing imbalanced if we have an imbalanced data set it will obviously affect the algorithms let me talk about this let's say that I have number of zeros as 900 and number of ones is 100 now let's say that my model I have created which will directly predict zero it'll I'll just say that all my inputs that it is probably getting with respect to this training data it'll just output zero now in this particular scenario what will be my accuracy my accuracy will be 900 divid by 1,000 right so this is nothing but 90% so is this a good accuracy obviously it is a good accuracy but this is a biased data if my model is basically just outputting 00000000 0 if it is outputting 00 00 0 obviously most of the answer will be zeros but this will be a scenario like you know where it is just outputting one thing then also it is able to get 90% accuracy so you should only not be dependent on accuracy so there are lot of terminologies that we will basically use one terminology that we specifically use is something called as Precision then we'll also use recall what is precision what is recall I'll write the formula over here in Precision what do we need to focus and then finally we will discuss about f score so we have to use different kind of parametrics of sorry different kind of formulas whenever you have an imbalanced data set you can also do oversampling but again understand in most of the scenarios in some of the scenarios oversampling may work but we have to focus on the type of performance metrics that we are focusing on right now I'll not say F1 score I'll say F score the reason why I'm saying I'll just let you know so let's talk about recall recall formula is basically given by true positive divided by true positive plus false negative Precision is given by true positive divided by true positive plus false positive and then I will probably discuss about F sore also or we basically say fbaa also now I'll just draw this confusion Matrix again okay which is having true positive true negative so let me draw it over here so this is my ones and zeros these are my actual values and these are my predicted values I have true positive I have true negative false positive and false negative now in this particular scenario when I'm actually discussing understand what is recall and what focus it is basically given on so here whenever I talk about recall recall basically says that TP TP divided by TP plus FN so I'm actually focusing on this so what does this basically say true uh recall out of all the actual true positives how many have been predicted correctly that is basically mentioned by TP out of all the positive values how many of them have predicted as positive so this is what it is basically saying and this scenario is called as recall in this the false negative is basically given more priority and our focus should be that we should try to reduce false positive false negative sorry we should try to reduce this now let's go ahead and let's discuss about Precision in Precision what we are doing we are basically taking out of all the predicted values out of all the predicted positive values how many of them are actual true or positive okay this is what Precision basically means now suppose if I consider spam classification suppose this is my task tell me in this particular case should we use Precision or recall and one more use case I'm saying that whether the person has cancer or not in which case we have to support recall and in which case we have to go ahead with Precision has cancer or not in spam what is important okay guys the recall is also called as true positive rate I can also say recall as sensitivity so if I go with Spam classification it should definitely go with Precision why it should go with Precision if I probably get a Spam ma the main aim should be that whenever I get a Spam Mill it should be identified as spam okay in that specific scenario my positive false positive we should try to reduce and in this scenario my false pository talks about the spam classification a lot in a better way in the case of cancer I should definitely use recall let's let's focus on the recall formula tp/ by TP plus FN if a person has a cancer see one actually he has a cancer it should be predicted as one otherwise if we have FN it is basically predicting it does not have a cancer that is really a big situation in this case if a person does not have a Cancer and if he's predict if the model predicts okay fine he has a cancer he may go and further do the test and then he'll come to know whether he has a cancer or not but this scenario is very dangerous if a person has a cancer but he is being indicated that he does not have that cancer so here false negative is given more priority over here in the case of spam classification false positive is given more priority so this is something important over here and you really need to understand with respect to different different problem statement let me give you one more example tomorrow the stock market is going to crash in this what we need to focus on should we focus on Precision or should we focus on recall now here two things are there who is solving what kind of problem see many people will say recall or Precision but here two things are there on whose point of view you are creating this model are you creating this model for the industry or are you creating this model for the people for the people he should definitely get identified that okay in this particular scenario you need to sell your stock because tomorrow stock market is going to crash but for companies this is very bad okay I hope everybody is able to understand for companies it is very very bad so in this particular case sometime we need to focus both on false positive and false negative and again I'm telling you for which problem statement you are solving that indicates if you are solving for people then they should be able to get the notification saying that it is going to crash if you're probably uh doing it for companies at that time your Precision recall may change but if I consider for both the scenarios at that point of time I will definitely use something called as F score F score or I'll also say it as F beta now how is fbaa Formula given as I will talk about it and here in the F score you have three different formulas the first Formula I will say basically as when your beta value is 1 okay first of all I'll just give a generic definition of f s or F beta here you are basically going to consider 1 + beta squ Precision multiplied by recall divided beta Square * Precision plus recall whenever your both false positive and false negative are important we select beta as one so if I select beta as 1 it becomes 1 + 4 Precision multiplied by recall then you have Precision plus recall so here sorry 1 + 1 so this becomes 2 multiplied by Precision into recall divided by Precision plus recall so here you have this is basically called as harmonic mean harmonic mean probably you have seen this kind of equation where you have written 2x y / x + y same type you are able to see this this is called as harmonic mean here the focus is on both false positive and false negative let's say that your false positive is more important than false negative at that point of time you will try to decrease or you will try to decrease your beta value let's say that I'm decreasing my Beta value to 0.5 then what will happen 1 +5 whole s and then you have P * R Precision recall and here also you have 25 p + r now in this particular scenario I'm decreasing my Beta decreasing the beta basically means that you are providing more importance to false positive than false negative and finally you'll be able to see that if I consider beta value as let me just say my notes if I consider beta value as two that basically means you are giving more importance to false negative than false positive so with this specific case you can come up to a conclusion what value you basically want to use now whenever I use beta is equal to 1 it becomes fub1 score if I use beta as .5 then this basically becomes f.5 score and this becomes your F2 score So based on which is important okay which is important whether your Precision or false positive or false negative is important you can consider those things F score will have different values if you're using beta is equal to 1 that basically means you are giving importance to both precision and recall if your false positive is more important then at that point of time you reduce beta value if false negative is greater than false bet uh false positive then your beta value is increasing beta is a deciding parameter to decide your F1 score or F2 score or F Point score now first thing first what is the agenda of today's session first of all we will complete practicals for all the algorithms that we have discussed these all algorithms that we have discussed we will cover the practicals probably we will be doing hyper parameter tuning everything the second thing and again here we are going to take just simple examples so yes uh so today's session I said practicals with simple examples where I'll probably discuss about all the hyper parameter tuning then the second one the second algorithm that I'm going to discuss about is something called as n bias this is a classification algorithm so we are going to understand the intuition and the third one that we are going to probably discusses KNN algorithm so KNN algorithms is definitely there so this our today's plan I know I've written very less but this much maths and involved in na bias right we'll understand the probability theorem again over there there is something called as bias theorem we'll try to understand and then we'll try to solve a problem on that so let's proceed and let's enjoy today's session how do we enjoy first of all we enjoy by creating a practical problem so I am actually opening a notebook file in front of you so here uh we will try to Sol solve it with the help of linear regression Ridge lasso and try to solve some problems let's see how much we will be able to solve it but again the aim is that we learn in a better way okay uh so that everybody understands some basic basic things okay so first of all as usual uh everybody open your jupyter notebook file the first algorithm that I'm going to discuss about is something called as SK learn linear regression so everybody I hope everybody knows about this SK learn let's see what all things are basically there in this we will be using fit intercept everything as such but here the main aim is to find out the coefficients which is basically indicated by Theta 0 Theta 1 and all the first thing we'll start with linear regression and then we will go ahead and discuss with r and lassor I'm just going to make this as markdown how many different libraries of for linear regression you can do with stats you can do with skyi you can do with many things okay so first thing first let's first of all we require a data set so for the data set what we are going to do is that we are going to basically take up some smaller smaller data just let me do this so for this uh we are going to take the house pricing data set so we are going to solve house pricing data set problem a simple data set which is already present in SK learn only now in order to import the data set I will write a line of code which is like from SK learn dot data sets data sets import load uncore Boston so we have some Boston house pricing data set so I'm just going to execute this I'm also going to make a lot of Sals so that I don't have to again go ahead and create all the sales again some basic libraries that I probably want is pro import numai as NP import pandas SPD okay import cbon as SNS and then I will also import Matt Matt plot lib do p plot as PLT and then percentile matplot lib matlot lib do inline and I will try to execute this see this my typing speed has become a little bit faster by writing by executing this queries again and again and uh let's go ahead uh so I have imported all the necessary libraries that is required which which will be more than sufficient for you all to start with now in order to load this particular data set I will just use this Library called as load uncore Boston and I'm going to just initialize this so if you press shift tab you will be able to see that return load and return the Boston house prices data set it is a regression problem it is saying and then probably I'm just going to execute it now once I execute it I will go and probably see the type of DF so it is basically saying skarn dos. bunch now if I go and probably execute DF you'll be able to see that this will be in the form of key value pairs okay like Target is here data is here okay so data is here Target is here and probably you'll be able to find out feature names is here so we definitely require feature names we require our Target value and our data value so we really need to combine this specific thing in a proper way in the form of a data frame so that you will be able to see so what I'm actually going to do over here I'm just going to say PD do data frame I'll convert this entirely into a data frame and I will say DF do data see this is a key value pair right so DF do data is basically giving me all the features value so if I write DF do data and just execute it you'll be able to see that I you will be able to get my entire data set in this way my entire data set in this way this is my feature one feature two feature three feature 4 this feature 12 I have 12 features over here and based on that I have that specific value now the next thing thing that I'm going to do probably I should also be able to add the target feature name over here so what I will do I will just convert this into DF and then I will also say DF do columns and I'll set it to DF do Target okay and let me change this to data set so I'm going to change this to data set and I'm going to say data set. columns is equal to DF do Target so if I execute this and now if I probably print my data set do head you will be able to see this specific thing okay it is an error let's see expected axis has 13 element new values has 506 so Target okay I should not use Target over here instead I had a column which is called as features feature names like if I go and probably see DF DF over here you'll be able to see there is one thing which is called as feature names so I'm going to use DF do feature names over here so here it is DF do feature names I'm just going to paste it over here and now if I go and write here you can see print DF data set. head if I go and execute without print you'll be able to see my entire data set so these are my features with respect to different different things and this is basically a house pricing data set so initially I have this features CRM ZN indust CH nox RM age distance radius tax PT ratio b l stack that so I have my entire data set over here the same data set I have basically put it over here now here also you'll be able to see what all this feature basically means this is showing wasted weighted distance to five do uh Five Boston employment center rad basically means index of accessibility to radial Highway tax basically means full value property tax rate this much PT rate basically means pupil teacher ratio I don't know what the hell it means but it's fine we have some kind of data over here properly in front of you so these are my independent features what are these these all are my independent features if you want the features detail here you can see it right everything what is CRM this basically means per capita crime rate by town which is important ZN it is proportional of residential land zone for Lots over 25,000 Square ft so this is my DF I did not do much I'm just using data frame DF do data column features name I'm getting this value very much simple now let's go a little bit slowly so that many people will be able to also understand now this is my data set. head now the thing is that I obviously have taken all these particular values but this is my independent feature I still have my dependent feature so what I'm actually going to do I will create a new feature which is like data set of price I'll create my feature name price price of the house and what I will assign this particular value this value will be assigned with this target this target value this target value is basically the sale the price of the houses right it is again in the form of array so I'm going to take this and put it as a dependent feature so here you'll be able to see that my price will be my dependent feature so here I'll basically write DF do Target so once I execute it and now if I probably go and see my data set do head you'll be able to see features over here and one more feature is getting added that is price now this price may be the units may be in millions somewhere Target should be here or there it should be probably in millions or I cannot see it but it should be somewhere here it should have definitely said that it is probably in millions or okay but that is not a problem I think but mostly it'll be in millions somewhere I think it should be here okay I cannot see it but probably if I put more time I'll be able to understand it okay so over here what is the thing main thing this all are my independent features and this is my dependent feature right so if I'm trying to solve linear regression I have to divide my independent and dependent features properly now let's go to the next step that is dividing the data set dividing the oh my God dividing the data set into train into first of all I'll try try to divide into independent and dependent features so I want my entire features data set divided into independent and dependent features X I will be using as my independent featur so I will write data set dot I will use an iock which is present in data frames and understand from which feature to which feature I will be taking as my independent feature to this feature till lat so the best way that basically means that I just need to skip the last feature in order to skip the last feature what I'm actually going to do from all the columns I will just skip the last column so this is how you basically do an indexing with respect to just skipping the last feature and this will basically be my independent features and here I will basically say Y is equal to data set do iock and here I just want the last feature so I will write colon all the records I want and see the first term that we are probably WR writing over here this basically specifies with respect to records here this specifies with respect to columns from all the columns I'm taking the last column here I will just take the last column and this will basically be my dependent features dependent features so here I have basically executed now if you can go and probably see x. head here you'll be able to find all my independent features in y do head you'll be able to find the dependent feature now let's go to the first algorithm that is called as linear regression always remember whenever I definitely start with linear regression I'll definitely not go directly with linear regression instead what I will do is that I'll try to go with Ridge regression and uh lasso regression because there you are lot of options with respect to hyper pment T but I'll just show you how linear regression is done so basically you really really need to use a lot of libraries okay over here and based on this libraries this libraries will try to install okay and what are these libraries these are basically the linear regression Library so here I'm basically going to use two specific thing one is linear regression Library so I will just use from SK learn do linear uncore model import linear regression do you need to remember this the answer is no because I also do the Google and I try to find out where in escal and it is present okay so here is my linear regression so I will try to initialize linear reg is equal to initialize with linear regression and then here what I'm actually going to do I'm going to basically apply something called as cross validation cross validation is very much important because in Cross validation we divide out train and test data in such a way that every combination of the train and test data is basically taken by care is taken by the model and whoever accuracy is better that all entire thing is basically combined so here what I'm going to do I'm going to say mean square error is equal to here I will import one more Library let's say from SK learn dot model selection I'm going to import cross Val score so cross Val score cross validation score basically means it is going to do a lot of train and test split it's something like this one example I will show it to you here only so what does cross validation basically do okay so in Cross validation what happens what you do suppose this is your entire data set suppose this is 100 records if you do five cross validation then in the first this will be your test data and remaining all will be your training data if in the second cross validation this will be your test data and remaining all will be your test uh training data like this five times you'll be doing cross validation by taking the different combination of train and test but I'm not going to discuss much about it in the future if you want a separate session I will include that in one of the session itself so this was uh basically the plan with respect to cross validation or cross Val score so here I'm going to basically take cross Val score and here the first parameter that I give is my model so linear regression is my model and here I will take X and Y I'm not doing a train test split specifically over here I'm giving the entire X and Y and probably based on that I'm going to do a cross validation over here you can also do train test plate initially and then just give the X train and Y train over here to do the cross validation it is up to you but the best practices will be that first you do the train test split and then only give the train data over here to do the cross validation I'm just going to use scoring is equal to you can use mean squared error negative mean squared error let's say that I'm going to use negative mean squ error again where do you find all these things you will be able to see in the SK learn page of L uh cross Val score and then finally in the cross Val score you give cross validation value as 5 10 whatever you want so after this what I'm actually going to do I'm just going to basically from this how many scores I will get the mean squar error will be five since I'm doing five cross validation if you don't believe me just see over here print msse so here you'll be able to see five different values 1 2 3 4 5 right five different mean values because we are doing cross five five cross validation so here what I'm going to write I'm just going to say np. mean I want to take the average of all the five so here will basically be my meanor msse okay and then probably I'll print I will print my Ms meanor MSC so this will be my average score with respect to this the negative value is there because we have used negative mean squ error but if you just consider mean square error then it is only 37.1 3 okay so this I have actually shown you how to do cross validation see with respect to linear regression you can't modify much with the parameter so that is the reason why specifically in order to overcome overfitting and do the feature selection we use uh R and lasso regression so here I will show show you how to do ridge ridge regression now now in order to do the prediction all you have to do is that just go over here take the model okay what is the model linear R and just say do predict so here you can see uh you'll be getting a function called as do predict and give the test value whatever you want to predict automatically the prediction will be done so I'm just going to remove this and focus on Ridge regression right now because I I want to show how hyperparameter tuning is done in R regression so for R regression the simple thing is that I'll be using two different libraries from skarn do linear linear uncore model I'm going to import Ridge so for the ridge it is also present in linear underscore model for doing the hyperparameter tuning I will be using from SK learn do modore selection and then I'm going to import grid SE CV so these are the two libraries that I'm actually going to use grid SE CV will be able to help you out with the um okay will be able to help you out with Hyper parameter tuning and then probably you'll be able to do that uh difference between MSE and negative MSE not big thing guys if you use MSE here mean squ error you'll be getting 37 I've just used negation of MSE it's okay anything is fine you can go with MSE also means square error there is also another uh another scoring area which is like which focuses on square root square mean Square uh sorry root means Square eror okay so there are different different things which you can basically focus on okay now in order to give you this specific good value I'm actually going to do hyper Peter tuning now let's go ahead with uh grid s CV so here what I'm going to do again I'm going to basically Define my model which will be Ridge okay so this this is what I have actually imported now uh let me open the ridge skarn so SK learn Ridge we need need to understand what all parameters are basically used do you remember this Alpha value guys do you remember this Alpha value why do we use Alpha I I told you now Alpha multiplied by slope square if you remember in Ridge we specifically use this right Ridge and lasso regression Alpha so this is the alpha the this is probably the best parameter we can perform hyper parameter tuning the next parameter that we can probably perform is basically uh this Max iteration okay Max iteration basically means how many number of iteration how many number of times we may probably change the Theta 1 value to get the right value so we can do this so what I'm actually going to do I'm going to select some Alpha values I'm going to play with this apart from that if I want I can also play with the other parameters which are uh like kind of uh you know probably you can you can also play with the iteration parameter it is up to you try whichever parameter you want to change you can go ahead and change it now let me show you how do we write this and how do we make sure that this specific thing is done now uh before doing grid s CV uh let me do one thing I will Define my parameters okay so here is my Ridge now what I'm going to do I'm going to say parameters and in this parameter two important value that I'm probably going to take is this one that is my C value and I will try to Define this in the form of dictionaries so here the C value that I sorry not C just a second guys my mistake it is not C it is Alpha let's see so how do I Define my Alpha value we'll try to see so here the parameters will be Alpha C is basically for uh logistic regression I'll show you so the alpha value I will just mention some values like 1 e to the power of -5 that basically Me 00000000 0 0 0 1 similarly I I can write 1 E to the^ of - 10 that again means 0 0 0 0 0 0 0 0 10 * 0 1 I'm just making fun okay so that you will also get entertained 1 E to the^ of minus 8 okay similarly I can write 1 E to the^ of minus 3 from this particular value now I'm increasing this value see 1 E to the^ of minus 2 and then probably I can have 1 5 10 um 20 something like this so I'm going to play with all this particular parameters for right now because in grit or CV what they do is that they take all the combination of this Alpha value and wherever your uh your your model performs well it is going to take that specific parameter and it is going to give you that okay this is the best fit parameter that is got selected so here I have got all these things now what I'm going to do I'm going to basically apply the grid C TV so here I have uh gridge uh sorry Ridge GD I'm saying ridore regressor so I'm going to use git s CV git s CV and here I'm basically going to take the parameters regge okay Ridge is my first model and then I will take up all this params that I have actually defined see in git CV if I press shift tab I have to first of all execute this then only it will be able to press shift tab so here if I press shift tab here you'll be able to see estimator and parameter grid is my second parameter then scoring and then all the other parameters so here the first thing that goes is your model then your parameters which what you are actually playing then the third parameter is basically your scoring scoring and again here I'm going to use negative mean squ error some people are saying that mean squared error is not present so that is the reason why negative mean squ error is done why it may not be present because uh they try to always create a generic Library probably this kind of uh scoring parameter may also get used in other algorithms so that is the reason they may not have created but if you want to Deep dive into it Google Google then what is r regress dot fit on X comma y again I'm telling you you can first of all do train test split on X and Y and then probably only do this on X train and Y train parameter is not oh sorry okay I get this okay parameter is not and why it is not and oh yeah it has become a list I'm going to make this as dictionary right now I'm fully focused on implementing things if I get an error I'll definitely make sure that it'll get fixed anyhow if I get that error I will not say oh Kish why why this error came you know why this error came I I'll not get worried I'll get the error down only you cannot give this as the one okay so try to understand okay so this is your gitar CV I've also done the fit and let's go and select the best parameter so what I can do I will write print ridore regressor dot params sorry there will be a parameter called as best params I'm going to print this and I'm going to print ridore regressor Dot best score so these are all the values that are got selected one is Alpha is equal to 20 and the best score is - 32 so initially I gotus 37 but because of Ridge regression you can see that our negative mean square error has definitely become better there is a minus sign don't worry but from 37 it has come to 32 cross validation guys over here inside grids s CV also when it is probably taking the entire combination over there the CV Value Cross validation also we can use so probably if I am probably considering all these things many people has a question Chris is this minus value increased that basically means you cannot use Ridge regression you are right in this particular case Ridge regression is not helping you out so guys let me again write it down everybody don't worry yeah previous I got minus 32 right now I'm getting - 37 right sorry previously I got what - 37 - 37 now I got - 32 so here you can see this I got it from linear regression this I got it from what Ridge which one should I select I should select this model only because it is performing well than this but again understand Ridge also tries to reduce the overfitting so probably in this particular scenario we cannot use Ridge because the performance is becoming more bad so what I will do I will go and try with lasso regression now I'll copy and paste the same thing so linear model import lasso then this will basically be my lasso let's see with lasso whether it will increase or not let's see this is my parameter that got selected now let me write lasto regressor dot best params so this is Alpha is equal to one is got selected over here I'm just going to print it okay and then I'm going to print with last one regression DOT score will be the best so here I'm actually getting - 35 - 35 here I'm actually getting - 32 so minus 35 still I will focus on linear regression now see what will happen if I add more parameters if I add more parameters see what will happen so now I'm going to take Alpha different different values see this I'm just going to remove this and probably add Alpha value in this way see here I have added more values 5 10 20 30 35 40 45 100 okay let's see whether we our performance will increase or not so here uh first of all let me remove from here in Ridge just take it down guys I'm I'm adding more parameters like this just take it down yeah CV is equal to 5 nobody okay you're not able to see it um CV is equal to 5 now here it is uh what you can basically focus on so here you can see I have added some values like this you can also add and just try to execute and now if I go and probably see this is my see first I have tried for Ridge I'm getting minus 29 do you see after adding more parameters what happened in Ridge after adding more parameters what happened in Ridge you can see om minus 29 and the alpha value that is got selected is 100 if you want try with cross validation 10 and just try to execute now now so these are are some hyper parameters that we will definitely play with here you can see - 29 so here you can see minus 29 you can also increase the cross validation value over here also and probably execute it but with lasso I don't know whether it is improving or not it is coming to minus 34 you just have to play with this parameters as now for a bigger problem statement the thing is not limited to here right we try to take multiples and many parameters multiples and many parameters and try to do these things it is up to you we play with multiple parameters whichever gives us the best result we are basically taking it it's okay error is increased I know that no error is increasing definitely error is increasing even though by trying with different different parameters but about most of the scenario see here I gotus 37 probably what I can actually do is that uh try to get better one with respect to this now the best way what I can also do is that I can basically take up train and test split also and probably do these things let's see let's see one example so how do we do train and test from SK scalar dot I think model selection import train test split okay it's okay guys you may get a different value okay let's do one thing okay let's make your problem statement little bit simpler now what I'm going to do just tell me in train test plate what we need to do so I'm going to take the same code I'm going to paste it over here or let me do one thing let me insert a cell below and let me do it for train test split so in train test plate what we can do so I'm just going to take the syntax paste it over here let's say that I'm taking XT train y train and then I'm using train test split with 33% now if I execute with respect to X train and Y train so here is my you can see this I have written this code from SK learn. model selection uh train test plate random State can be anything whatever you write it is fine then you basically give X and Y with test sizes 33 uh this is basically saying that the test will have 33% and the train data will be 77% so this is what I'm actually getting with respect to X train and Y train here what I'm going to do I'm going to basically take X train comma y train and now if I go and probably see this here you can see minus 25 understand this value should go towards zero if it is going towards zero that basically means the performance is better now similarly I do it for Ridge in Ridge what I'm actually going to do here I'm going to write X train and Y train and if I go and probably select the best score than this here you'll be able to see I'm getting how much I'm getting minus 2.47 okay here I'm getting 25.8 here 25. 47 that basically means now still the Improvement is little bit bad because here we are not going towards zero so the next part again here also you can basically do it for X train and Y train X train and Y train so here you have this one and let's go and execute this so here you can see minus 2.47 now what you can also do is that you can use this lasso regressor do predict and you can basically predict with respect to X test so this is your white test value suppose let's say that this is my y PR Yore PR then what I can do from SK learn I will be using R square and adjusted R square if you remember SK learn R square r² so this is my R2 score so where it is present in SK learn. Matrix so I'm going to write from SK learn import let's say I'm saying from skarn do Matrix import r² R2 score now what I'm going to do over here I'm basically going to say my R2 score which is my variable I'll say this is nothing but R2 score here I'm just going to give my y PR comma Yore test so if I go and probably see the output here I will be able to see print R2 score this is all I have discussed guys there is also adjusted rant score is there where is R2 R2 score one adjusted r² okay R2 score is there but adjusted R square should be here somewhere in some manner so this is how your output looks like with respect to by using this lasso regressor okay which is very good okay it should be I told it should be near 100% right now I'm getting 67% if I want to tie with the ridge you can also try that so you can say Ridge regressor do predict and here you can see 7 68% then you can also try linear regressor and predict what is the error saying the regression is not fitted yet why why it is not fitted why it is not fitted let's say that I have fitted here linear regression dot fit on X train and Y train X train and comma y train so I'm just going to fit it now if I go and probably try to do the calculation so if I go and see my R2 score it is also coming somewhere around 68% 67% now since this is just a linear regression you won't be able to get 100% because you're drawing a straight line right so for that you basically have to other use other algorithms like XG boost and all n bias so many algorithms are there it's okay see you give y test over here y PR over here both are same right they're comparing by see at one limit you can you can increase the performance after that you cannot see again I'm telling you in linear regression what we do these are my points right I will be only able to create one best line I cannot create a curve line right over here so obviously my accuracy will be only limited let's go and do it logistic practical quickly and here uh in logistic also we can do git SE CV now what I'm actually going to do first of all let's go ahead with the data set so I will quickly Implement logistic so from LC learn. linear model I'm going to import logistic regression so I'm going to use logistic regression and apart from that we know that let's take a new data set because for logistic we need to solve using classification problem so this is basically my logistic regression I'll take one data set so from SK learn. data sets import we'll take a data set which is like uh breast cancer data set so that is also present in SK learn with respect to the breast cancer data set I'm just going to use this see load best cancer data set I'm loading it and all the independent features are in data and my columns are feature names the same thing like how we did previously okay so this will basically be my complete uh complete independent feature so if I go and probably see this x. head here you'll be able to see that based on this input features the independent feature we need to determine whether the person is having cancer or not these are some of the features over here and this is like many many features are actually present so next thing I this that was my independent feature now I'll take my dependent feature dependent feature will already present in DF Target okay this particular data set that we have taken in DF in DF do Target we will basically have all our dependent feature these are my independent features so what I'm actually going to do I'm going to create Y and I'm going to say PD do data frame and here I'm going to say DF do Target Target and this column name should be Target right so this will be my column name and now if I go and see my y y is basically having zeros and one in the target feature now the next thing that we are going to do is that uh apply basically apply the first of all we need to check whether this data set is uh this particular y column is balanced or imbalanced okay in order to do that I will just write F Target if the data set is imbalanced definitely we need to work on that and try to perform upsampling so if I write y target. Valore counts if I execute this so here you'll be able to see that value SC counts will basically give that how many number of ones are and how many number of zeros are so now total number of ones are 357 and total number of zeros are 22 so is this a imbalanced data set probably this is a balanced data set so here I'm actually going to now do train test spit train test spit I will try to do again train test spit how do we do we can quickly do copy the same thing entirely I'll copy this entirely over here and then I will get my X and Y so here is my X train X test y train y test so train test plate obviously I'll be doing it now in logistic regression if I go and search for logistic regression escalar I will be able to see this what all parameters are there this is basically the L1 Norm or L2 Norm or L1 regularization or L2 regularization with respect to whatever things we have discussed in logistic and then the C value these two parameter values are very much important if I probably show you over here the penalty what kind of penalty whether you want to add L2 penalty L1 penalty you can use L2 or L1 the next thing is C this is nothing but inverse of regularization strength this basically says 1 by Lambda something like that this parameter is also very much important guys class weight suppose if your data set is not balanced at that point of time you can apply weights to your classes if probably your data set is balanced you can directly use class weight is equal to balanced other than that you can use other other weight which you basically want so this is specifically some of this right no this is not Ridge or lasso okay this is logistic in logistic also you have L1 norm and L2 Norms understand probably I missed that particular part in the theory but here also you have an L2 penalty norm and L1 penalty Norm I probably did not teach you in theory because if you look see logistic regression can be learned by two different ways one is through probabilistic method and one is through geometric method if you go and probably see my video that is present with respect to logistic regression right now in my YouTube channel there I have explained you about this L1 and L2 Norms also over there so in this also it is basically present it is a kind of penalty again just for uh using for this kind of classification problem so what I'm actually going to do let's go and play with the parameters that I am looking at so I will play with two parameters one is params C value here I'm defining 1 10 20 anything that you can Define one set of values you can Define and there was one more parameter which is called as Max iteration this is specifically for grits or CV okay that I'm specifically going to apply so I will just try to execute this this will be my params now I'm going to quickly Define my model one which will be my logistic regression model so my logistic regression here by default one value I'll give for C and Max itra let's say I'm giving this value later on what I will do for this model I'll apply it to grid sear CV so I'm just going to say grid s CV and I'm going to apply it for model one param grid is equal to params this parameter that I'm specifically trying to apply since this is a classification problem and I am not pretty sure that whether true positive is important or true negative is important I'm going to use F1 scoring okay F1 scoring is basically again the parametric term which we discussed yesterday which is nothing but performance metrics and then I'm going to use CV is equal to 5 so this will be entirely my model with respect to grid s CV and I'll be executing this then I will do model. fit on my X train and Y train data so once I execute it here you can see all the output along with warnings a lot of warnings will be coming I don't know because this many parameters are there and finally you can see that this has got selected now if you really want to find out what is your best param score model dot best params so here you can see Max iteration as 150 and what you can actually do with respect to your best score model do best score is 95 percentage but still we want to test it with test data so can we do it yes we can definitely do it I'll say model do core or I'll say model dot predict on my X test data and this will basically be my y red so this will be my y red all the Y prediction that I'm actually getting so if you go and see y red so these are my ones and zeros with respect to the Y prediction at finally after getting the prediction values I can apply confusion Matrix I hope I have taught you about confusion Matrix so from sklearn do confusion Matrix sorry sklearn do metrix I'm going to import confusion metrix classification report and the next thing that I would like to do is this two I will try to import confusion Matrix and classification report now if you want to see the confusion Matrix with respect to your I can just write Yore frad or Yore test whatever you want go ahead with it and this is basically my confusion Matrix if I put this forward no difference will be there only this thing will be moving that also I showed you 63 118 3 and 4 now finally if I want to accuracy score I can also import accuracy score over here so here you can see accuracy score is imported I can also find out my accuracy score which is my the total accuracy with respect to this I we can give y test and Yore PR which we have discussed yesterday this is giving 96% if you want detailed Precision recall all the score then at that point of time I can use this classification report and here I can give white test and wied here is what I'm actually getting so here you can see with respect to F1 F1 score Precision recall since this is a balanced data set obviously the performance will be best yes you can also use Roc see I'll also show you how to use Roc and probably you'll be able to see this you have to probably calculate false positive rate two positive rate but don't worry about Roc I will first of all explain you the theoretical part now let's go ahead and discuss about n bias n bias is an important algorithm so here I'm just going to go ahead so now let's go ahead and discuss about na bias and here we are going to discuss about the intuition so na bias is an another amazing algorithm which is specifically used for classification and this specifically works on something called as base theorem now what exactly is base theorem first of all we need to understand about base theorem let's say that guys I have base theorem let's say that I have an experiment which is called as rolling a dis now in rolling a dis how many number of elements I have have so if I say what is the probability of 1 then obviously you'll be saying 1X 6 if I say probability of two then also here you'll say 1X 6 if I say probability of three then I will definitely say it is 1x 6 so here you know that this kind of events are basically called as independent events now rolling a dice why it is called as an independent event because getting one or two in every experiment one is not dependent on two two is not dependent on three so they are all independent that is the reason why we specifically say is an independent event but if I take an example of dependent events let's consider that I have a bag of marbles okay in this marble I basically have three red marbles and I have two green marbles now tell me what is the probability of suppose I have a event in the first event I take out a red marble so what is the probability of taking out a red marble so here you can definitely say that it is 3x5 okay so this is my first event now in the second event let's say that in this you have taken out the red marble now what is the second second time again you are taking out the second red marble or forget about second Rand marble now you want to take out the green marble now what is the probability with respect to taking out a green marble so here you'll be definitely saying that okay one red marble has been removed then the total number of marbles that are left are four so here you can definitely write that probability of getting a green marble is nothing but 2x4 which is nothing but 1x2 so here what is happening first first element you took out first marble that you took out first event from from the first event you took out red marble from the second event you took out green marble this two are in these two are dependent events because the number of marbles are getting reduced as you take out from them so if I tell you what is the probability of taking out a red marble and then a green marble so it's the simple the formula will be very much simple right which we have already discussed in stats it is nothing but probability of probability of red multiplied by probability of green given Red so this specific thing is called as conditional probability here understand what is happening probability of green marble given the red marble event has occurred here both the events are independent now let me write it down very nicely so I can write probability of A and B is equal to probability of a multiplied probability of B divided by probability of a let's go and derive something can can I write probability of A and B is equal to probability of b and a so answer is yes we can definitely say we can definitely say if you go and do the calculation you'll be able to get the answer you should not say no now what is the formula for probability of A and B so here you can basically write probability of a multiplied by probability of B given a if I take out probability of green what is probability of green in this particular case 2x 5 what is probability of red 3x 4 for right now let's consider this now this part I can definitely write as this part I can definitely write as probability of B multiplied by probability of B probability of B this one probability of B and this will be probability of a given B so I can definitely write this much with respect to all this information now can I derive probability of a is equal to probability of B multiplied by probability of a / B me probability of a given B divided by probability of sorry I'll write this as probability of B given a divided by probability of a and this is specifically called as base theorem and this is the Crux behind na bias understand this is the Crux behind the base theorem now let's go ahead and let's discuss about how we are using this to solve let's take some examples and probably make you understand let's say that I have some features like X1 X2 X3 X4 X5 like this till xn and I have my output y so these are my independent features these all are my independent features these all are my independent features so here I'm going to write independent features and this is my output feature which is also my dependent feature now what is happening if I say probability of b or a what does this basically mean I need to really find what is the probability of Y and you know that guys I will have some values over here and basically I'll have some output value over here so based on this input values I need to predict what is the output initially on a training data set I will have your input and then your output initially my model will get trained on this now let's consider what this entire terminology is I will try to write in terms of this equation so I will say probability of Y given x1a x2a X3 up till xn then this equation will become probability of Y see probability of Y given X X1 X2 X3 xn this a is nothing but X1 X2 X3 xn and I'm trying to find out what is the probability of Y and then I will write probability of b b is nothing but y but before that what I'll write probability of a / B right a given b or probability of B probability of B is nothing but y multiplied by probability of a given B probability of a given B basically means probability of x1a X2 comma xn and given b b is given right so I'm able to find this entire value now just a second I made some mistakes I guess now it is correct sorry I I just missed one term that is this given y this is how it will become and this will be equal to probability of a that is X1 comma X2 like this up to XL so probability of Y multiplied by probability of a given y now if I try to expand this then this will basically become something like this see probability of Y multiplied by probability of X1 given yes a given y sorry given y multiplied by probability of X2 given y probability of x3 given Y and like this it will be probability of xn given y so this will also be y1 Y2 Y3 YN this I can expand it like this and then this will basically become probability of X Y 1 multiplied by probability of X2 multiplied by probability of x3 like this up to probability of xn so this is with respect to all the probability y will be different see here for this particular record y will be different for this y will be different for this y will be different but why output it may be yes or no right it may be yes or no okay I I'll solve a problem it will make everything understand and this will probably be probability of Y it can be binary multiclass whatever things you want I'll solve a problem in front of you now let's say that I have my y as let's say that I have a lot of features X1 X2 X3 X X4 with respect to this let's say in my one of my data set I have this many x1s this many features and this is my y so these are my feature number and this is my y let's say that in y I have yes or no so how I will probably write we really need to understand this okay I will basically say what is the probability of Y is equal to yes given this x of I this is my first record first record of X of I this is my second record of X of I so I may write like this what is the probability of Y being yes if x of I is given to you X of I basically means X1 X2 X3 X4 so here you'll obviously write what kind of equation you'll basically say probability of yes multiplied by probability of yes multiplied by probability of X of 1 given yes multiplied by probability of X2 given yes probability of x3 given yes and probability of X4 given yes divided by probability of X1 multiplied by probability of X2 multiplied by probability of x3 multiplied by probability of X4 Y is fixed it may be yes or it may be no but with respect to different different records this value may change similarly if I write probability of Y is equal to no given X of I what it will be then it will be probability of no multiplied by probability of X1 given no then probability of X2 given no probability of x3 given no and probability of X4 given no so here because every any input that I give any input X of I that I give I may either get yes or no so I need to find both the probability so probability of X1 multiplied by probability of X2 multiplied by probability of x3 multiplied by probability of X4 see with respect to Any X of I the output can be yes or no and I really need to find out the probabilities so both the formula is written over here what is the probability of with respect to yes and what is the probability with respect to no now in this case one common thing you see that this this denominator is fixed this is definitely fixed it is fixed it is it is not going to change for both of them and I can consider that this is a constant so what I can do I can definitely ignore so here I can definitely ignore these things ignore this also ignore this Al because see this is constant so I don't want to consider this in the next time I'll just use this specific formula to calculate the probability now let's say that if my first probability for a specific data set yes of X of I is let's say that I'm getting as13 and similarly probability of no with respect to X of I if I get 05 you know that in a binary classification any values if it get greater than or equal to 5 we are going to consider it as 1 and if it is less than 0.5 I'm going to consider it as zero now I'm getting values like this 13 and .1 05 obviously I'm getting .13 05 so we do something called as normalization it says that if I really want to find out the probability of X with X of I if I do normalization it is nothing but .13 divided by .13 + 05 72 this is nothing but 72% and similarly if I do for probability of no given X of I here obviously it will say 1 - 72 which will be your remaining answer that is 28 which is nothing but 28% so your final answer will be this one this formulas you have to remember now we'll solve a problem let's solve a problem this will be a very very interesting problem let's say I have a data set which has like this feature day let me just copy this data set okay for you all now in this data set I want to take out some information let's take out Outlook table now based on this output Outlook feature see over here Outlook my day outlook temperature humidity wind are the input features independent feature this is my output feature this one that you are probably seeing play tennis is my output feature which is specifically a binary classification so what I'm actually going to do I'm basically going to take my Outlook feature and based on this Outlook feature I will just try to create a smaller table which will give some information now based on Outlook first of all try to find out how many categories are there in Outlook one is sunny one is overcast and one is rain right three categories are there so I'm going to write it down over here Sunny overcast and rain so these three are my features with respect to Sunny uh with Outlook I have three categories one is sunny one is overcast and one is RA here I'm going to basically say with respect to Sunny how many yes are there and how many no are there and what is the probability of yes and probability of no so I'm going to again write it over here so this is my Outlook feature and then I have categories first yes no Sunny overcast rain yes no then probability of yes and probability of no now the next thing that we need to find out is that with respect to Sunny how many of them are yes see yes we have so when we have sunny over here the answer is no so I will increase the count over here one then again I have sunny again answer is no so I'm going to increase the count to two with this sunny this is basically no okay so again I'm going to increase the count to three now with sunny how many of them are yes one and two so I have this one and this one so I have two so I'm going to say with respect to Sunny I have two yes understand Outlook is my X1 X1 feature let's consider now the next thing is that let's see with respect to overcost with overcast how many of them are yes so this overcast is there yes 1 2 3 and four so total four yes are there with respect to overcast then with respect to overcast how many are on no you can go ah and find out it is basically zero NOS then with respect to rain how many of them are yes so here you can see with respect to one rain yes yes no no so this is nothing but 3 2 let's try to find out there are three is two or not one here also one yes is there right so 3 yes two NOS so the total number of yes and NOS if you count it there are nine yes and five NOS this is my total count so if you totally count this 9 + 5 is 14 you'll be able to compare that there will be 9 yes and five NOS what is the probability of yes when Sunny is given so here you have 2X 9 here you have 4X 9 here you have 3x 9 now if if I say what is the probability of no given Sunny now see probability of yes given Sunny probability of yes given forecast probability of yes given rain so it is basically that I will just try to write it in a simpler manner so that you'll not get confused okay so this is my probability of yes and this is my probability of no but understand what does this basically mean this terminology basically means probability of yes given Sunny probability of yes given overcast probability of yes given rain similarly what is probability of no probability of no obviously you know that 3x 5 is my first probability then you have 0x 5 and then you have 2X 5 now with respect to the next feature let's consider that I'm going to consider one more feature and in this feature I will say let's consider temperature okay let's consider temperature now in temperature how many features I have or how many categories I have I have hot you can see hot mild and and cold now with respect to hot mild cold here also I will be having yes no probability of yes and probability of no now try to find out with respect to hot how many are yes so here no is there here also no is there two NOS uh 1 yes uh 2 yes so two yes and two NOS probably then similarly with respect to mild mild how many are there 1 yes 1 No 2 yes 3s 4s 4S and two knows okay so here you basically go and calculate 4 yes and two knows with respect to cold how many are there cool cool or cold 1 yes 1 No 2 yes 3 S 3 S and 1 no so here I have specifically have 3s and 1 no again the total number is 9 and five which will be equal to the same thing that what we have got now really go ahead with finding probability of yes given hot so it will be 2x 9 over here then here it will be how much 4X 9 here it will be 3x 9 again here what will be the probability of no given given hot so it'll be 2x 5 2x 5 1X 5 so this two tables has already been created and finally with respect to play the total number of plays are yes is 9 no is five and the answer is total 14 if if I say what is the probability of yes only yes then it is nothing but 9 by4 what is the probability of no it is nothing but 5x4 okay so this two values also you require now let's say that you get a new data set you need get a new data set let's say you get a new test data where it says that suppose if you are having sunny and hot tell me what is the output so this is my problem statement so let me write it down so here I will write probability of yes given Sunny comma hot then here I will write probability of yes multiplied by probability of so here I will write probability of Sunny given yes multiplied by probability of hot given yes divided by what is it probability of Sunny multiplied by probability of hot equation because it is a constant because probability of no also I'll be getting the same value 9 by4 so probability of yes I'm going to replace it with 9 by4 multiplied by 2x 9 then probability of hot given yes so I am going to get 2 by 9 so here 99 cancel or 2 1 7 then this is nothing but 2 by 6331 I read this statement little bit wrong it should be probability of Sunny given yes now go ahead and calculate go ahead and calculate what is probability of no given sunny and hot so here you have probability of no multiplied by probability of Sunny given no multiplied by probability of hot given no divided by probability of Sunny multiplied by probability of heart this will get cancelled denominator is a constant guys this is a constant so what is probability of no so probability of no is nothing but 5 by4 so I will write over here 5 by4 multiplied by probability of Sunny given no what is probability of Sunny given no what is probability of Sunny given no is nothing but probability of Sunny given no is nothing but 3x 5 so here I'm going to get 3x 5 multiplied probability of H given no that is nothing but 2x 5 so 2x 5 is here 3x 5 is there five and five will get cancelled 2 1 2 7 and then I'm getting 3x 35 which is nothing but calculator uh if I'm actually getting three ID by 35 it's nothing but 857 I will write it down again probability of yes given Sunny comma hot which is my independent feature is nothing but 031 031 and this is probability of no given Sunny comma hot 85 now we'll try to normalize this 85 + Point divided by 031 + 085 73 this is nothing but 73% and here I can basically say 1 -73 which is my27 which is nothing but 27% if the input comes as sunny and hot if the weather is sunny and hot what will the person do whether he will play or not the answer is no okay now my next question will be that if your new data is overcast and Mild now tell me what will be the probability using name bias now you can add any number of features let's say that I will say that okay let's let's say that I will I will probably say we can consider humidity mind wind also you basically create this kind of table to find it out but this will be an assignment just do it overcast and Mild if it is with respect to NB try to solve it so the second algorithm that we are going to discuss about is something called as KNN algorithm KNN algorithm is a very simple problem statement okay which can be used to solve both classification and regression so KNN basically means K nearest neighbor let's first of all discuss about classification problem number one classification problem let's say that I have a binary classification problem which looks like this I have two data points like this one and this is another one suppose a new data point suppose a new data point which comes over here then how do I say that whether this belongs to this category or whether it belongs to this category if I probably create a logistic regression I may divide a line but in this particular scenario how do we Define or how do we come to a conclusion that whether this will belong to this category or this category so for here we basically use something called as K nearest neighbor let's say that I say that my K value is five so what it is going to do it is going to basically take the five nearest closest point let's say from this you have two nearest closest point and from here you have three nearest closest point so here we basically see from the distance the distance that which is my nearest point now in this particular case you see that maximum number of points are from Red categories from Red from Red categories I'm getting three points and from White categories I'm getting two points now in this particular scenario maximum number of categories from where it is coming we basically categorize that into that particular class just with the help of distance which all distance we specifically use we use two distance one is ukan distance and the other one is something called as Manhattan distance so ukan and Manhattan distance now what does ukan distance basically say suppose if this is your two points which is denoted by X1 y1 X2 Y2 ukine distance in order to calculate we apply a formula which looks like this X2 - X1 s + Y2 - y1 s whereas in the case of magetan distance suppose this are my two points then we calculate the distance in this way we calculate the distance from here then here right this is the distance we calculate we don't calculate the hypothenuse distance so this is the basic difference between ukan and magetan distance now you may be thinking Chris then fine that is for classification problem for regression what do we do for regression also it is very much simple suppose I have all the data points which looks like this now for a new data point like this if I want to calculate then we basically take up the nearest Five Points let's say my K is five k is a hyper parameter which we play now suppose let's say that K it finds the nearest point over here here here here and here so if we need to find out the point for this particular output with respect to the K is equal to 5 it will try to calculate the average of all the points once it calculates the average of all the points that becomes your output so regression and classification that is the only difference because this K is actually an hyper parameter we try with K is equal to 1 to 50 and then we probably try to check the error rate and if the error rate is less then only we select the model now two more things with respect to K nearish neighbor K nearest neighbor works very bad with respect to two things one is outliers and and one is imbalanced data set now if I have an outlier let's say I have an outlier over here this is one of my categories like this and this is my another category let's consider that I have some outliers which looks like this now if I'm trying to find out the point for this you can see that the nearest point is basically blue only and it belongs to the blue category but because this outlier you know it'll consider that the nearest neighbor is this so then this will be basically treated in this group only formula for Manhattan distance it uses modulus X2 - X1 + Y2 - y1 mode X2 - X1 Y2 - y1 uh this was it from my side guys and yes I've also made detailed videos about whatever topics we have discussed today you can directly go and search for that particular topic so this is the agenda of this session we will try to complete this all things again here we are going to understand the mathematical equations and all uh so today's session we are basically going to discuss about uh decision tree okay and uh in this session we are going to basically understand what is the exact purpose of decision tree with the help of decision tree you are actually solving two different problems one is regression and the other one is classification so we'll try to understand both this particular part well we will take a specific data set and try to solve those problems now coming to the decision tree one thing you need to understand I'll say that if age is less than 8 let's say I'm writing this condition if age is less than or equal to 18 I'm going to say print go to college here I'm printing print college and then I'll write else if age is greater than 18 and pag is less than or equal to 35 I'll say print work then again I'll write else if age is let me let me put this condition little bit better then I'll write here L if if age is greater than 18 and age is less than or equal to 35 I'm going to say print work basically people needs to work in this age else I'm just going to consider print retire so here is my ifls condition over here now whenever we have this kind of nested if Els condition what we can do is that we can also represent this in the form of decision trees we'll also we can actually form this in the form of decision and the decision tree here first of all we will have a specific root node let's say this is my root node now in this root node the first condition is less than or equal to 18 so here obviously I will be having two conditions saying that if it is less than or equal to 18 and one condition will be yes one condition will be no so if this is yes and if this is no right if this condition is true that basically means we'll go in this side if it is true then here we will basically have something like college so this is your Leaf node similarly when I have no okay no no in this particular case we will go to the next condition in this next condition I will again create a node and I'll say that okay this is less than 18 and greater than sorry less than or equal to 35 so if this is also there then again I'll have two conditions which is basically yes or no now when I create this yes or no over here you'll be able to see that basically means here again two condition will be there if it is yes I will say print work so this will again be my leaf node and again for no again I will do the further splitting which is retire so here you can see that this entire algorithm this entire code that I have actually written you can see that it has got converted to this kind of trees where you specifically able to take decisions yes or no so can we solve a classification problem sorry this is greater than 18 again if it is greater than 18 and less than or 35 so can we solve a regression and a classification problem regression and classification problem using this decision trees by creating this kind of nodes so in short whenever we talk about decision trees whenever we talk about decision trees you will be seeing that decision trees are nothing but decision trees are nothing but by using this nested if El condition we can definitely solve some specific problem statement but here in the visualized way we will specifically create this decision tree in the form of nodes now you need to understand that what type of maths we will probably use okay so let's do one thing let's take a specific data set which I will definitely do it over here in front of you okay and we will try to solve this particular data set and this will basically give you an idea like how we can probably solve these problems so uh let me just open my snippet tool so this is my data set that I have let's consider that I have this specific data set now this data set are pretty much important because this probably in research papers also probably people who have come up with this algorithm they usually take this they take this thing but but right now this particular problem statement if I talk about this is a classification problem statement okay but don't worry I will also help you to explain I'll also explain you about regression also how decision tree regression will definitely work so let's go ahead and let's try to understand suppose if I have this specific problem statement how do we solve this this is my output feature play tennis yes or no okay whether the person is going to pay tennis or not yesterday or there after yesterday or whenever you want so if I have this input features like Outlook temperature humidity and wind is the person going to play tennis or not this is what my model should predict with the help of decision tree so how decision tree will work in this particular case first of all let's consider any any any specific uh feature let's say that Outlook is my feature so this will be my first feature which is specifically Outlook now just tell me how many are basically having no and how many are basically having yes in the case of Outlook there you'll be able to find out there are nine yes see 1 2 3 4 5 6 7 8 9 and how many NOS are there 1 2 3 4 5 I think 1 2 3 4 5 so nine yes and five NOS what we are going to do in this specific thing now we have N9 yes and five Nos and the first node that I have actually taken is basically Outlook so Outlook feature now just try to find out we are focusing on this specific feature now in this feature how many categories I have I have one Sunny category you can see over here I have Sunny one category then I have another category called as overcast then I have another category as rain so I have three unique categories So based on these three categories I will try to create three nodes so here is my one node here is my second node here is my third node so these are my three categories so this category is basically called as Sunny this category is basically called as overcast and this category is basically called as rain based on these three categories so I'm splitting it now just go ahead and see in Sunny how many yes and how many no are there how many yes with respect to Sunny are there see in sunny I have two NOS see one and two no uh one more no is there three NOS so here you can see this is my one no then this is my two no this is my three no and yes are two so this one and this one so how many total number of yes so here you can see that there are 1 2 2 yes and three no let's say that I have randomly selected one feature which is Outlook why can't I when like see it is up to it it is up to the decision tree to select any of the feature here I have specifically taken Outlook later on I'll explain why it it can basically select how it selects the feature okay I'll I'll talk about it don't worry so in the Outlook we have two yes sorry in the case of Sunny we have two yes and three NOS now the next thing is that let's go and see for overcast in overcast I have 1 yes uh 2s um 3s and 4 yes I don't have any no in overcast so over here my thing will be that four yes and Zer Nos and then finally when we go to the Rain part see in Rain how many features are there in rain if you go and probably see it how many number of yes and NOS are there go and see in one one yes in row rain two yes then one no then again you have one yes and one no right so here you can basically say that in rain in the case of rain if I take a as an example how many number of yes and NOS are there it will be 3 yes and two NOS understand understanding algorithm then everything will you'll be able to understand now let's go ahead and try to cease for sunny sunny definitely has 2 yes and three NOS this has four yes and zero NOS here you have three Y and two NOS now if I probably take overcast here you need to understand understand about two things one is pure split and one is impure split now what does pure split basically mean pure spit basically means that now see in this particular scenario in overcast in overcast I have either yes or no so here you can see that I have four yes and Zer NOS so that basically means this is a pure split anybody tomorrow in my data set if I just take this Outlook feature suppose in one day in day 15 the Outlook is Outlook is basically overcast then I know directly it is the person is going to play so this part is already created and this node is called as pure node understand this why it is called as pure node because either you have all Yes or zeros NOS or zero yes or all NOS like that in this particular case I have all yes so if I take this specific path I know that with respect to overcast my final decision which is yes it is always going to become yes so this is what it basically says so I don't have to split further so from here I will probably not split I will definitely not split more because I don't require it because I have it is a pure leaf node okay you can also say that this is a pure leaf node so I'm just going to mention it again this one I'm specifically talking about now let's talk about sunny in the case of Sunny you have two yes and three NOS so this is obviously impure so what we do we take next feature and again how do we calculate that which feature we should take next I'll discuss about it let's say that after this I take up temperature I take up temperature and I start splitting again since this is impure okay and this split will happen until we get finally a pure split similarly with respect to rain we will go ahead and take another feature and we'll keep on splitting unless and until we get a leaf node which is completely pure I hope you understood how this exactly work now two questions two questions is that Kish the first thing is that how do we calculate this Purity and how do we come to know that this is a pure split just by seeing definitely I can say I can definitely say by just seeing that how many number of yes or NOS are there based on that I can def itely say it is a pure split or not so for this we use two different things one is entropy and the other one is something called as guine coefficient so we will try to understand how does entropy work and how does Guinea coefficient work in decision tree which will help us to determine whether the split is pure split or not or whether this node is leaf node or not then coming to the second thing okay coming to the second thing one is with respect to Purity second thing your first most important question which you had asked why did I probably select Outlook how the features are selected and here you have a topic which is called as Information Gain and if you know this both your problem is solved so now let's go ahead and let's understand about entropy or guinea coefficient or Information Gain entropy or guine coefficient oh sorry Guinea coefficient I'm saying guine impurity also you can say over here I'll write it as guine impurity not coefficient also I'll just say it as Guinea impurity but I hope everybody is understood till here let's go ahead and let's discuss about the first thing that is entropy how does entropy work and how we are going to use the formula so entropy here I will just write guine so we are going to discuss about this both the things let's say that the entropy formula which is given by I will write h of s is equal to so h of s is equal to minus P plus I'll talk about what is minus what is p plus log base 2 p +- p minus log base 2 p minus so this is the formula and in guine impurity the formula is 1 minus summation of I equal 1 2 N p² I even talk about when you should use guine impurity when you should not use guine impurity when you should use entropy you know by default the decision tree regression or classific sorry decision tree classification uses Guinea impurity now let's take one specific example so my example is that I have a feature one my root node I have a feature one which is my root node and let's say that in this root node I have six yes and three NOS very simple let's say that this has two categories and based on this two categories of split has happened that is a C1 let's say in this I have 3 S3 Nos and here I have 3 s0 Nos and this is my second category always understand if I do the sumission 3s and 3s is 6s see this this sumission if I do 3 + 3 is obviously 6 3 + 0 is obviously so this you need to understand based on the number of root nodes only almost it'll be same now let's go ahead and let's understand how do we Cal calculate let's take this example how do we calculate the entropy of this so I have already shown you the entropy formula over here now let's understand the components I will write h of s is equal to minus sign is there what is p+ p+ basically means that what is the probability of yes what is the probability of yes this is a simple thing for you all out of this what is the probability of yes yes out of this so obviously how you'll write if you want to find out the probability of yes out of this see when I say plus that basically means yes when I say minus that basically means no so what is the probability of yes so it is be nothing but yes plus and minus are specifically for binary class this can be positive negative so the probability with respect to yes can I write 3x 3 only for this what is the probability out of this total number of this is there 3x3 similarly if I go and see the next term log to the base 2 p+ so again if I go ahead and write over here log to the base 2 p+ p+ is again 3x3 so then again we have minus and this is now P minus what is p minus 0 by 3 log base 2 0 by 3 this obviously will become zero this will obviously become 0 because 0 divid by anything is zero what will this be 1 log to the base 1 what is this this is nothing but zero log to the base 1 is nothing but zero tell me whether this is a pure split or impure split so this is a pure split whenever we have a pure split the answer of the entropy is going to come to zero so here I'm going to Define one graph this is H of s and let's say this is p+ or P minus if my probability of plus see when I say probability of plus is 0.5 what will be probability of minus it will also be 0. five right because it's just like P is equal to 1 - Q right if p is .5 then Q will be 1 - P same thing right so when it is5 obviously my h of s will be 1 let's say so this is this is the graph that will basically get formed let's go ahead and try to calculate the entropy of this guys what will be the entropy of this node so here I'm going to just make a graph h of s minus what is p+ p+ is nothing but 3x 6 log base 2 3x 6 minus three no are there 3x 6 log base 2 3x 6 so if you compute this log base 2 to the^ of 1 if you do the calculation here I'm actually going to get one so when I'm getting one when I'm actually getting one when you have three yes and three NOS what is the probability it is 50/50% right so when your p+ is5 that basically means your h of s is coming as one so from this graph you can see that I'm getting one if this is zero this is one this is zero and this is one I hope everybody is able to to understand guys 0o and one if your p+ is zero or if your p+ is one that basically means it becomes a pure split so in h of s you are going to get zero so always understand your entropy will be between 0 to 1 if I have a impure this is a completely impure split because here you have 50% probability of getting yes 50% probability of getting no h ofs is entropy this is entropy for the sample H ofs notation that I'm using is H ofs so if whenever the split is happening the first thing is done the purity test the purity test is done with the help of entropy right now I'll also show guinea guinea impurity don't worry so with the entropy you'll be able to find if I am getting one that basically means it is a impure split and if I'm getting zero it is pure split so this is the graph okay this is the graph and this graph is basically the entropy graph again understand if your probability of getting yes or no is 0.5 that basically means 50/50 is there 3s and three NOS then your entropy is going to be 1 h of s if your probability is completely one that basically means either you're getting completely yes or completely no so your your entropy will be zero that basically means it is pure split so in the case of probability .5 you're getting plus one then it'll keep on reducing now let's go ahead and let's try to understand so here you have understood about purity test definitely you'll use entropy try to find out whether it is pure or impure if it is impure you go ahead with the further shift further division of the categories again you take another feature divide it because here from this two which split you will do further you will do this split as further if you are getting 6 6 is this specific value then you probably go and draw over here this is your entropy if your probability is here which is.3 then you will go here and create this this may be0 4 or3 something like this it will be between 0 to 1 let's go ahead and discuss about the second issue I hope everybody is discussed about we have discussed about checking the pure split or not and we have understood this much but the next thing is that okay fine chish this is very good we have explained well I know many people will say that but there are some people I can't help let's say that I have some features okay now coming to the second problem how do we consider which node to cap which which feature to take and split because here I may have one one split so again let's see that what is the second problem which feature to take to split right this is the second problem that we are trying to solve let's say that I have one feature one over here and I have two categories let's say this is there C1 and C2 here let's say that I have 9 years 5 Nos and then I have 6 years 2 NOS here I have basically three yes and three NOS let's say and in my data set I have features like F1 FS2 F3 now let's say that another split I can actually start with feature two also and in feature two I may have probably three categories like C1 C2 C3 so with respect to the root node and all the other features because after this also I may have to split right I may have to take another feature and keep on splitting right based on the Pure or impure split how do I decide should I take fub1 first or F2 first or F3 first or any other feature first how should I decide that which feature should I take and probably do the split that is the major question so for this we specifically use something called as Information Gain so here I'm just going to say here we basically use Information Gain now what is this Information Gain I'll talk about it so Information Gain first of all I will write the formula we basically write gain with sample first with feature one I will compute so first with feature one I will compute suppose this is my first split of my data and probably I'm Computing over here this can be written as h of s I'll discuss about each and every parameter don't worry summation of V belong to values s of V don't worry guys if you have not understood the formula I will explain it then the sample size H of SV I'll discuss about each and every parameter let's say that I'm taking this feature one split I have you have already seen what is feature one so this is my feature one I have two categories C1 C2 this has 9 yes 5 NOS this has 6s and two Nos and this has 3 yes and three NOS now I will try to calculate the information gain of this specific split now I will go ahead and probably take this up now see over here we'll try to understand what is this now if I want to compute the gain of s of F1 first is first first thing that I need to find out is H of s now this h of s is specifically of the root node so I need to first of all calculate what is h of s h ofs is nothing but entropy entropy of the root node so if I want to compute the entropy of the node node tell me how should I compute h of s is equal to minus p + log base 2 p+ calculate guys along with me - P minus log base to P minus so I hope everybody knows this so here I'm going to compute by what is ability of plus over here in this specific root node it is nothing but 9 by4 then I have log base 2 again 9 by4 then I have P minus what is p minus 5x4 log base 2 5 by4 so this calculation I will probably get it as 94 approximately equal to 94 just check it whether you're getting this or not again you can use calculator if you want now now I have definitely found out this this is specifically for the root node now let's see the next thing the next important thing which is this part what is s of v and what is s and what is h of SV now very important just have a look everybody see this graph okay see this graph I will talk about h of SV first of all I'll talk about h of SV okay this one this is the entropy of category one you need to find and entropy of category 2 you need to find so if I write h of SV of category 1 so what is category 1 for this I'll write SC1 let's say I'm going to write like this quickly calculate the H of SV of this and this separately you need to calculate so h of SV of C1 okay so here again you'll write - 6X 8 log base 2 6X 8us 2x 8 log base to 2x 8 I hope everybody knows this how we got it so h of SV basically means I'm going to compute the entropy of this category and this category so for that I will basically write h of so here I will write - 6 by8 log base 2 6X 8 - 2x 8 log base 2 2x 8 so if I get it I'm actually going to get 81 and similarly if I if I calculate h of C2 quickly calculate how much you are going to get guys 6X 8 6X 8 with respect to this we need to find out so now we have all these values we'll start equating them to this equation so here we have finally gain of s comma fub1 so let's say that here I'm going to basically add 94 minus see minus summation of okay summation of what is s s of V understand s of V basically means that how many samples I have over here let's say for category one how many samples I have for category one over here simple if you really want to just calculate it is nothing but eight and total number of sample is how much if I go and see over here there are 9 years five NOS okay 9 years and five NOS that basically means 14 total sample here you have eight sample Okay so this will become 8x4 then you multiply by what see see from this equation you multiply by h of SV so h of SV is nothing but the entropy of category 1 so entropy of category 1 is nothing but 81 plus then you go again back to the graph and try to see that for C2 how much how many total number of samples are there 3 + 3 is 6 so 6 by 14 it will become multiplied by 1 right so this is your entire thing so here after all the calculation you are going to get 0.041 so this is my gain with s comma F1 so here I have got this value amazing I did this with feature one only what about feature two let's say that this was my split for feature two and suppose I get the gain for S comma feature 2 as .51 if I get this now tell me in using which feature should I start splitting first whether it should be fub1 or whether it should be FS2 based on this value you know that over here the gain the information gain of s comma F2 is greater than gain of s comma fub1 so your answer is very much simple we will definitely use feature 2 to start the split the thing over here you are trying to understand that if I really want to select which feature to select to start my splitting then I have to basically calculate the information gain and go throughout the all the paths and whichever path has the highest Information Gain then we will select that specific thing now the question Rises Kish obviously this is good but you had written about guinea impurity what is the purpose of that please explain us and why Guinea impurity is basically used so let me go ahead with guine impurity I told that yes you can obviously use you can obviously use entropy but why Guinea impurity so guine impurity formula which I have specifically written as 1 minus summation of IAL 1 2 N p² now what is this p² suppose let's say that in my n n is the number of outputs right now how many outputs I have I have two outputs yes or no so I will expand this 1 minus since this is summation I equal to 1 to n I'm basically going to basically say that okay fine I will write probability of plus whole Square uh plus probability of minus whole Square so this is the formula for guinea impurity now you may be thinking okay fine the calculation will be obviously very much equal easy right suppose if I have a node sorry if I have a node which which has 2 yes two NOS now in this particular case how do I calculate my this probability if I have two yes or two NOS suppose let's say that I have a node over here which is my split and this is having two yes and two no so how do I calculate I will write 1 minus what is probability of square 1X 2 square sorry not 1 by two yeah 1X 2 squ + 1 by 2 squ right then I will say 1 by 1X 4 + 1X 4 is nothing but 2x 4 which is nothing but 1X 2 so I will be getting 0.5 now here here you understand this is a complete impure split right if you have an impure split in entropy the output you getting it as one whereas in the case of Guinea impurity as Z sorry 0.5 so if I go ahead with the graph that I probably had created here so my Guinea impurity line will look something like this so it will be looking something like this for zero obviously I'll be getting zero but whenever my probability of plus is 0.5 I'm going to get 0.5 over here and that is the difference between Guinea impurity and entropy but the re but you may be seeing Kish when to use what now let's understand that when to use Guinea and when to use entropy tell me guys if I consider this formula of guine impurity and if I probably consider if I consider entropy this formula where do you think more time will take for execution for this particular formula whether for entropy it will take or for guinea impurity it will take more time where it will probably take for the execution purpose see understand decision tree is having a worst time complexity because if you have 100 features probably you'll keep on comparing by dividing many many features then probably compute a Information Gain like this if you have just 100 features so which is faster entrop or guine impurity understand in entropy you have log function here you have log function here you have simple maths the more amount of time out of entropy and guine impurity the more amount of time basically is taken by entropy so if you have huge number of features like 100 200 features and you are planning to apply decision Tre I would suggest try to use Guinea impurity then entropy if you have small set of features then you can go ahead with entropy so over here definitely with respect to fast Guinea is greater than entropy now let's go ahead and understand with respect to you may be thinking Kish okay fine you have basically explained us about categorical variables over here see over here you have you have explained about categorical variables what if I have numerical feature let's say I have F1 over here which is a numerical feature I have an F1 feature which is numerical feature and I may have values let's say that I have sorted all the values over here okay let's say that I have F1 and output okay so this F1 let's say that I have values like ass sorted order values I'm sorting this features I'm basically doing this let's say that initially I have this features like this and let's say I have values like 2.3 1.3 4 5 7 3 let's say I have this features now this is a continuous feature this is a continuous feature so for a continuous feature how probably the decision tree entropy will be calculated and the Information Gain will get calculated so here you'll be able to see that I will first of all sort these values so in F1 the decision tree will B basically first of all sort this values so I have 1.3 then you have 2.3 then you have four then you have three three then you have four then you have five and then you have six now whenever you have a continuous feature so how the continuous feature will basically work in this case first of all your decision tree node will say that we'll take this one only one first record and say that if it is less than or equal to 1.3 okay if it is less than or equal to 1.3 so you here you'll be getting two branches yes or no so yes and no definitely your output over here will be put over here right and then for the no here you'll be having another node over here how many number of Records you'll be having in this particular case you'll be having one record in this particular case you will be having around five to six records and here also you'll be able to see right how many yes and NOS are there definitely this will be a leaf node so in the first instance they will go ahead and calculate the information gain of this then probably once the Information Gain Is got then what they'll do they will take the first two records and again create a new decision tree let's say that this will be my suggestion where they'll say it is less than or equal to 2.3 so I will get one and one over here so in this now you'll be having two records which will basically say how many yes and no are there and remaining all records will come over here then again Information Gain will be computed here then again what will happen they'll go to the next record then then again they'll create another feature where they'll say less than or equal to three and they will create this many nodes again they'll try to understand that how many yes or no are there and then they'll again compute The Information Gain like this they'll do it for each and every record and finally whichever Information Gain is higher they will select that specific value in that feature and they'll split the node so in a continuous feature whenever you have a continuous feature this is how it will basically have and then it will try to compute who is having the highest Information Gain the best Information Gain will get selected and from there the splitting will happen now let's go ahead and understand about the next topic is that how this entirely things work in decision tree regressor because in decision tree regressor my output is an continuous variable so suppose if I have one feature one feature two and this output is a continuous feature it will be continuous any value can be there so in this particular case how do I split it so let's say that f1c feature is getting selected now in this f1c feature what value will come when it is getting selected first of all the entire mean will get calculated of the output mean will get calculated so here I will have the mean and here here the cost function that is used is not Guinea coefficient or guinea impurity or entropy here we use mean squared error or you can also use mean absolute error now what is mean squared error if you remember from our logistic linear regression how do we calculate 1 by 2 m summation of I = 1 to n y hat minus y whole Square y hat of i y - y whole Square this is what is mean square error so what it will do first based on F1 feature it will try to assign a mean value and then it will compute the MSE value and then it'll go ahead and do the splitting now when it is doing splitting based on categories of continuous variable I will be having different different categories now in this categories what will happen after split some records will go over here then I will be having a mean value of this over here that will be my output and then again the MSC will get calculated over here as the msse gets reduced that basically means we are reaching near the leaf note and the same thing will happen over here so finally when you follow this path whatever mean value is present over here that will be your output this is the difference between the decision tree regressor and the classifier here instead of using entropy and all you use mean squar error or mean absolute error and this is the formula of mean square error now let's go to the one more topic which is called as the hyperparameters tell me decision tree if I keep on growing this to any depth what kind of problem it will face regressor part you want me to explain okay let's see okay let's let's do the regression decision tree regressor let's say I have feature F1 and this is my output let's say I have values like 20 24 26 28 30 and this is my feature one with category one category one let's say some categories are there let's say I have done the division by F1 that is this feature initially tell me what is the mean of this that mean value will get assigned over here then using msse that is mean squar error here you will try to calculate suppose I get an msse of some 37 47 something like this and then I will try to split this then I will be getting two more nodes or three more nodes it depends then that specific nodes will be the part of this again the mean will change again the mean will change over here suppose this two is there this two records goes here right then again MC will get calculated I'm just taking as an example over here just try to assume this thing now if I talk about hyper parameters see this is what is the formula that gets applied over MSC now let's see in this hyper parameter always understand decision tree leads to overfitting because we are just going to divide the nodes to whatever level we want so this obviously will lead to overfitting now in order to prevent overfitting we perform two important steps one is post pruning and one is pre- pruning so this two post pruning and pre pruning is a condition let's say that I have done some splits I have done some splits let's say over here I have seven yes and two no and again probably I do the further split like this now in this particular scenario you know that if 7 yes and two NOS are there there is a maximum there is more than 80% chances that this node is saying that the output is yes so should we further do more pruning the answer is no we can close it and we can cut the branch from here this technique is basically called as post pruning that basically means first of all you create your decision tree then probably see the decision tree and see that whether there is an extra Branch or not and just try to cut it there is one more thing which is called as pre-pruning now pre-pruning is decided by hyperparameters what kind of hyper parameters you can basically say that how many number of decision tree needs to be used not number of decision tree sorry over here you may say that what is the max depth what is the max depth how many Max Leaf you can have so this all parameters you can set it with grid SE CV and you can try it and you can basically come up with a pre- pruning technique so this is the idea about decision tree uh regressor yes yes it is possible your guinea value will be one no this graph is there no Guinea value are you talking about this Guinea entropy it will not be one it will always be between 0 to.5 so the first thing first as usual what we should do we should import the libraries so here I will go ahead and import the librar so I'll say import pandas as NP PD import matplot li. pyplot as PLT uh import so this basic things I have with me so I will go and take any data set that I want from SK learn. data sets import let's say that I'm going to take load Iris data set and then I'm going to upload the iris data set so I'm going to write load Iris there is my Iris data set then the next step uh once you get your iris data set so this is my iris. dat okay these are all my features the four features will be there these four features are petal length petal width SLE length and SLE width this is my independent features then if I really want to apply for classifier so decision tree classifier so I can first of all import from skarn do tree import decision let's see where decision tree present in a scalon decision tree classifier the name is absolutely fine but I was not getting over here so so this is got no module SK okay SK skar skn learn so here you have classifier right now I'm just going to overfit the data then I'll probably show you how you can go ahead with uh pruning so by default what are the parameters over here if you probably go and see in in the classifier over here you have Criterion see this the first P parameter is Criterion by default it is Guinea then you have Splitter Splitter basically means how you're going to split and there also you have two types best and random you can randomly select the features and do it okay you should always go with best max depth is a hyper parameter minimum sample lift is a hyper parameter Max Fe features how many number of features we are going to take in order to fix that that is also an hyper parameter so all these things are hyper parameter okay so I will just by default executed whatever is giving me in decision tree and the next thing that I'm actually going to do is create a decision tree so for this I will be using plot. fig size plot. figure inside figure I have this fix size okay and I will probably show in some better figure size so that everybody body will be able to see it so here let me say that I'm going to take an area of 1510 and then probably I'm going to say tree Dot Plot and here I'm going to say a classifier and it should be filled the coloring should be filled with this so tree sorry Tre Tre Tre Tre Tre it should be classifi tree. plot okay I have to also import uh tree so I have to basically import tree so from SK learn import three again I'm getting error has no attribute plot why let me just see the documentation guys so this plot function is like plot uncore tree dot tab plot _ tree now what is the error we are getting okay not fitted yet sorry so I'm going to say classifier do fit on data what data iris. data and then I'm going to fit with Iris dot Target so once this is done I think now it will get executed so this is how your graph will look like guys so here you can see this is how your graph looks like now if I show you the graph over here see you can see some amazing things over here three outputs are actually there in this when you see in this left hand side this become a leaf node so this first one is probably vers color uh versol flower okay if you go on the right hand side here you can see 50/50 is there so based on one feature based on one feature here you'll be able to see that you are getting a leaf node based on another Branch here you are getting 05050 so again you have two more features getting splitted over here so here you have 495 here you have 471 do we require this split anybody tell me from here do we require any any more split just try to think this is after post pruning I want to find out whether more splits are required or not now in this particular case you see this after this do you require any split you do not require right here you are basically getting 47 and one I guess after this also you require no split understand this so this is basically post pruning so you can then decide your level and probably do it gu value is more than 0.5 okay this side H this is coming as 0.5 greater than 0.5 it should not had here it is 0.5 no maximum .5 can come 0 to.5 only should come I don't know why this is coming as 667 I'll have a look onto this guys but anywhere you see other than that you're everywhere you're getting less than5 the plotting graph is very much easy you use SK learn import tree then you basically do this get classify and field is equal to true and you can just do this so the agenda let me Define the agenda what all things are there first we'll understand about emble techniques in this assemble techniques we are basically going to discuss about what is the difference between bagging and boosting second what we are basically going to discuss about is so uh the agenda of this session is emble techniques bagging and boosting then we are probably going to cover random forest and then probably we will try to cover adab boost and if I have more energy I will also try to cover XG boost so all this Al lthms we'll discuss about it so let's go ahead and let's start the topics the first topic that we are going to discuss is about emble techniques now what exactly is emble techniques and we are going to discuss about it okay so emble techniques what exactly is emble techniques till now we have solved two different kind of problem statement one is classification and regression and you have learned about different different algorithms like uh linear regression logistic regression we have discussed about KNN we have discussed about yesterday what disc what did we discuss about n bias different different algorithms we have already finished now with respect to classification regression Problem whatever algorithm we are discussing there was only one algorithm at a time we were discussing one algorithm at a time we are discussing and we are trying to either solve a classification or a regression problem now the next thing is over here is that can we use multiple algorithms mul multiple algorithm to solve a problem multiple algorithms basically means can we I'll just talk about it okay now the if I ask this specific question can we use multiple algorithms to solve a problem at that point of time I will definitely say yes we can because we are going to use something called as emble techniques there now what this emble techniques is okay so emble techniques in emble techniques we specifically use two different ways one is one one way is that we specifically use and the other one I'll just go to write it over here so one that we basically use is something called as bagging technique and the other one we specifically use is something called as boosting technique so in bagging Technique we what exactly we can do and in boosting technique what we can actually do and how we are combining multiple models to solve a problem so let's first of all discuss about bagging now how does bagging work let's say that I have a specific data set so this is my data set with uh with features rows columns everything like this I have this specific data set just imagine I have many many features over here like this fub1 F2 F3 and probably I have my output so this is my data set D let's consider it now what we do in bagging is that we create models and this model can be anything it can be logistic it can be linear for a classification problem let's say that this is logistic model so this is my model M1 let's say I have another model M2 then I may have another model M3 let's say that this is logistic and this is probably the other model which is like decision tree and then probably we use this model as KNN classification and this model can again be decision tree it's fine let's use another decision tree so now here you can see that we have used so many models okay so many models are there now with respect to this particular model what I will do is that the first step that I will do from this particular data set I will just take up some rows so I'll basically do row sampling and I'll take a row sampling of D Dash D Das basically means this D Das is always less than D some of the rows I'll push it to M1 okay I can also use n fine so what I'll do is that some of the rows I'll push it to model one this model one will be training let's say that for out of this 10,000 record th000 rows I'm actually doing a row sampling of th rows and giving it to M1 to train it then what I'm actually going to do over here I'm basically going to give this specific model M2 and again I'm going to do row row sampling and I'm again going to sample some of the rows and give it to model two and again remember some of the rows may get repeated from this D Dash to next dble Dash similarly I will do row sampling and give it to this and again I may have d triple Dash and D4 Dash so different different different different rows data points when I say row sampling basically I'm talking about data points different different data points I will give it to separate separate model and this model will specifically train when I say D Dash that basically means uh suppose I say th 10,000 are my total number of data points when I say D Dash This D Dash may be th000 points then D Double Dash may be another th000 points and some of the rows may get repeated over here dle Dash here also I can basically use so here specifically row sampling will be used now when I have this many specific each and every model will be trained with different kind of data now how the inferencing will happen for the test data so first thing first let's say that I'm going to get a new test data over here now new test data will be passed to M1 and this M1 suppose it gives zero as my output suppose let's say that I'm doing a binary classification it gives a Zer as an output so this is my output of zero next M2 for the new test data gives one M3 gives one and M4 also gives one as the the output now in this particular case in this particular case what will happen now you can see over here it's simple what what do you think the output may be in this particular case now M1 has predicted for this particular test data as zero the model M2 has predicted 1 M3 has predicted 1 and M4 has predicted one so finally all these outputs are going to get aggregated are going to get aggregated and a simple thing that gets applied is majority voting majority voting so tell me what will be the output for with respect to this the output will obviously be one because the majority voting that you can see three people are basically saying it as one so my output over here will be one okay this is the concept of bagging wherein you are providing different different rows with probably all the features in this case and giving it to different different model again which is a classification model and then finally you are combining them based on majority voting and you're getting the answer as one so this step is called as bootstrap aggregator that basically means you're aggregating all the output that is basically coming from all the specific models all the specific models now many people will say Krish what about Tai guys like this kind of situation you know we will be having more than 100 to 200 models so it is very very difficult that it will be a tie who are repeating questions they will be put up in time out so what if you're saying that if the 50% of model says yes 50% of our models says no always understand guys we will be having more than 100 to 200 plus models so in this particular case there will be high probability that always there will be a majority voting available it will always not be in that specific scenario so this was the concept about bagging now some people will be saying that Krish why are you using different different models guys I'm not discussing about random Forest over here random Forest uses only one type of model that is decision tree but if we think as an concept of bagging you can have different different models over here and you can basically combine them so this is a technique of emble techniques and this is basically called as bagging okay now tell me one point I missed out fine this is with respect to the classification problem with respect to the regression problem what will happen in case of a regression problem let's say that I got here 120 here 140 here 122 here 148 as my output so in regression what will happen is that the entire mean will be taken mean will be taken the output mean will be basically taken and that will be your output of the model average or mean very simple right so average or mean will be basically taken up and here based on the average you'll be able to solve the regression problem great now let's go ahead and try to understand with respect to bagging and boosting how many different types of algorithm are but before that I need to make you understand what exactly is boosting now here in bagging you have seen that you have parallel models right one one one independent you have parallel models you're giving some row samples in different different models and basically are able to find out the output now in case of boosting boosting is a sequential combination of models like this you have lot of sequential models like this and one after the model like first I'll give my training data to this particular model then it will go to this data then this model then this model so this will be my M1 M2 M3 M4 and finally I will be getting my output so here you can basically say that boosting is all about and this M1 M2 M3 we basically mention it as weak Learners so this will be weak learner weak learner weak learner weak learner and finally when we go till here it it'll if I combine all these weak ners weak learner weak learner okay once I combine all this weak learner it becomes a it becomes a strong learner finally if I try to combine this this will basically become a strong learner so here you have all the models sequentially one after the other and then you will probably try to provide your uh input from one model to the next model to the next model and these all models will be a very simpler weak learner model which will not be able to predict properly but when you combine all this particular models together sequentially it becomes a strong learner how this specifically works I'll take an example example of AD boost XG boost I will show you that okay week learner basically means the prediction is very bad but as you go sequentially you combine them they become a strong learner okay one example I want to give you let's say that you are a data scientist right let's say that this model one may be a teacher with respect to physics then this model two may be a teacher with respect to chemistry let's say model 3 is basically a teacher of maths and model four is a teacher of geography now suppose if you are trying to solve one problem obviously if the physics teacher is not able to solve that particular problem then probably chemistry can help or maths can help or geography can help or someone can help so when we combine this many expertise together they will be able to give you the output in an efficient way Sumit I'll talk about it where whether all the features are basically passed to all the models or not I'll just talk about it just give me some time okay but I just want to give you an idea about in short if someone asks you in an interview what exactly is boosting okay boosting is you can just say that it is a sequential set of all the models combined together and these all models that I initialized are usually weak Learners and when they are combined together they become a strong learner and based on the strong learner they gives an amazing output and right now if I say in most of the kaggle competition they use different types of boosting or bagging technique so we have basically as I said bagging and boosting in bagging what kind of algorithm we specifically use we use something called as random forest classifier and the second model that we specifically use is something called as random Forest regress so we specifically use these two kind of models which I'm actually going to discuss right now after this and then in boosting we basically use techniques like ad boost gradi Boost number three is Extreme gradient boost which we also say it as XG boost extreme gradient boost so let's go ahead and let's discuss about the first algorithm which is called as random forest classifier and regressor now first thing first let's understand some things from the yesterday's class I hope uh what is the main problem with respect to decision tree whenever we create a decision tree without any hyperparameter it does it not lead to overit does it not lead to overfitting uh whenever you probably have a decision tree right it leads to something like overfitting why overfitting because it completely splits all the feature till it's complete depth overfitting basically means for training data the accuracy is high for test data the accuracy is low so training data when the accuracy is high I may basically say it as high bias and then I may basically say it as sorry not high bias low bias and high V variance so low bias and high variance yes obviously we can do pruning and all guys but again understand pruning is an extensive task probably if your if you have 100 features if you have data points which is like 1 million to do pruning also it is very much difficult yes pre pruning can be done but again we cannot confirm that it may work well or not so right now with respect to decision tree you have this specific problem that is low bias and high variance now in low Biance and high variance you know that my model is basically the generalized model that I should get it should have low bias and low variance so if somebody asks you why do you use random Forest you can basically explain about decision trees like this now my main aim is to convert this High variance to low variance now I will be able to convert this High variance to low variance using random forest classifier or random Forest regressor now what does random Forest do random Forest is a bagging technique similarly I have a data set over here let's say that I have this data set and then here I will be having multiple models like M1 M2 M3 M4 let's say I have this four models like this we have many many models now with respect to this models this models all the models are actually decision Tree in random forest all are decision trees you don't have a different model over there so over here you can see that all the models are decision trees that is going to get used used in random Forest so decision trees always gets used in random Forest the first thing that you should know now whenever we are using decision trees you know that decision tree if I by default if we try to create it it may lead to overfitting and because of that every decision tree will basically create low V low bias and high variance but if we combine in the form of bootstrap aggregator this High variance will be getting converted to low variance because why because majority of voting we will be taking from this particular decision trees like there will be many many decision tree so they lot of outputs will be coming and with the help of majority voting classifier this High variance will get converted to low variance now in random Forest how it works in the first case if I talk about random Forest over here two things basically happen with respect to the D- data set let's say in first model we do some kind of row sampling plus Feature Feature sampling that basically means we have to select some set of rows and some set of features and give it to M1 similarly you do row sampling and feature sampling and give it to M2 then you do row sampling and feature sampling you give it to M3 and then you do row sampling and feature sampling you give it to M4 now when you do this so what will happen independently you're giving some features along with some rows now there may be a situation that your features may also get repeated it may also get repeated your records or data points may also get repeated so when you are probably training your model with this specific data sets and specific features this model become expert in predicting something right as I said one example over here I'm giving a physics model some data I'm giving chemistry data chemistry model with some data similarly here I'm giving some information to some model so the model will be an expert with respect to that specific data So based on all this particular data whenever I get a new test data so what will happen suppose let's say that this this is a classification problem the M1 model will be predicting zero this will be predicting one this will be predicting zero and this will be predicting zero now in this particular case again the majority voting classifier or majority voting will happen in the case of classification problem and then here you will be specifically able to get the output as zero so I hope everybody is able to understand all the models over here are decision trees and based on that you will be doing see when in I interview should be very very uh things the things that I'm telling you over here is all all the points are very much important and similarly if you tell the interviewer definitely your interview is cracked in this kind of algorithm I've seen some of my students saying that okay uh Kish um when the interviewer asked me that which is my favorite algorithm I said random Forest I told why did you say like that because he said that because that person let me let him ask any questions in random Forest I'm very much confident about it and I'm also going to prove him you know why they are very very good so with this specific case here you can basically see that because of the overfitting condition of the decision tree you're combining multiple decision tree so that you get a generalized model which has low bias and low variance so I hope everybody is able to understand boost feature sampling basically means suppose if I have 1 2 3 four feature for the first model I may give two features for the second model I may get three features for the fourth model I may give four features or uh any one feature ALS I can give to a specific model so internally that random Forest it take carees of over here these things are there and this is how random Forest Works only the difference between random Forest classify and regression is that in regression again whatever output you are basically getting you basically do the mean that's it average you just do the average you'll be able to get the output based on all the models output that you are actually getting now let's talk about some of the important points in random Forest the first thing first question is that is normalization required in random Forest then the next question is that in KNN is normalization when I say normalization or standardization I I'll just talk about standardization is standardization is required so this will be my another question so is normalization required in random forest or decision tree you here you can also say it as decision tree is it required so for this the answer will be no because understand decision tree will basically do the splits if you Mini minimize the data also that split won't be that much important but if I talk about KNN whether standardization normalization required over here the answer is yes because here we use two things one is ukan distance and Manhattan distance because of this you definitely have to apply standardization so that the computation or distance becomes easy so this is one of the most common interview questions that is basically asked in random Forest coming to the third question is random Forest impacted by outlier over here the answer will be no just check it out outside basically means Google and check it out check it out in Google okay perfect so I hope I've covered most of the things in random Forest is random Forest impacted by outliers this is the third question is KNN impacted by outliers is this KNN algorithm impacted by outliers is KNN impacted Byers the answer is yes big yes perfect so so these all are the interview questions that needs to be covered now let's go ahead and discuss about adab boost now in bagging most of the time we specifically use random forest or you can also create custom bagging techniques custom bagging techniques means whatever algorithm you want use the combination of them and try to give the output this also you can do it manually with the help of hands okay guys so second thing uh we are going to discuss about is boosting technique in this the first thing that uh first algorithm that we are going to discuss about is adab Boost so adab boost we going to discuss about how does adab Boost uh work now let's solve uh the first boosting technique which is called as adab boost okay and uh this is a boosting technique um in the boosting technique you have heard that we have to basically solve in a sequential way this at least you know I know there is a lot of confusion within you all but we'll try to solve a problem let's say so suppose I have a data set which looks like this fub1 F2 F3 F4 so these are my features and probably these are my output okay so let's say that I'm having this features like this and this is my output like yes or no like this so let's say that how many records I have over here three 4 5 6 and one more is there 7 so this seven records are there now in adab boost the first thing is that specifically with adab Boost uh you really need to understand that what all things we can basically do how do we solve this classification problem that we are going to understand the first thing first is that we Define a weight and the weight is very much simple initially to all the records to all this input records we provide an equal weight now how do we provide an equal weight we just go and count how many number of records are there now in this particular case the total number of records are one 2 3 4 5 6 7 now every record I have to provide an equal weight that is between 0 to 1 so the overall sum should be one so in this particular case what I can do if I make 1X 7 1X 7 1X 7 to everyone this will definitely become a equal weights to all right and if I do the total sum it will obviously be one let's go to the next one now after this what do we do okay after this in adab the first thing that we do is that we take any of this feature how do you decide which feature to take whether we should go with F1 or whether we should go with FS2 or whether we should go with F3 this we can do it with the help of Information Gain and Information Gain and entropy or guinea right based on this we can definitely understand whether we should start making decision here also you specifically make decision trees so here what you do is that you probably have to determine by using which feature I have to start my decision tree so suppose out of all this feature one feature two feature three you have selected that okay the information gain and entropy of feature one is higher so I'm going to use feature one and probably divide this into decision trees now when I divide this into decision tree let's say that I'm dividing like this into decision tree this decision tree depth will be only one one depth and this depth since it has only one depth we basically call it as stumps so what we do over here specifically we will create a decision Tre by taking only one feature and we will only divide it to one level okay one level or one depth that's and this is specifically called as stump what we are going to do next is that from this particular stump okay the stump is basically getting created only one so that is adab Boost right we say it as weak Learners because this is weak learner weak learner why there is a reason we say this as weak learner so only weak learner so that is the first thing with respect to uh this particular adab boost so the first step is that this is a weak learner so for the weak learner we basically create a stump stump basically means one level decision tree that's it based on the information gain and entropy I have selected the feature and then I just made a decision tree with only one level why it is called as it is called as weak learner okay so that is the reason we use only stum that is just a one level decision tree now the next step happens is that we provide all the specific records to this F1 and we train this specific model only with one level decision tree we train them now after we train them let's say that we are going to pass all these particular records to find out how many are correct and how many are wrong this decision this decision tree is basically giving so let's say that out of this entire records one record one record was just given as wrong let's say that this is the this is the record which was given as wrong okay so let's say that this record output was predicted wrong from this particular model only one wrong was there after training the model now what we need to do in this specific case understand a very important thing so let's say that we have done this and probably after this what we are actually going to do we are going to calculate the total error so how many error this particular model made let's say that in this particular case only one was wrong so this was only wrong right one was wrong so if I want to calculate the total error how will I calculate how many how many of them are wrong how many of them are wrong only one is wrong what is the weight of this so I will go and write 1X 7 so this is specifically my total error out of this specific model which is my stump over here okay which is my F1 stop now this is my first step the second step is that I need to see the performance of stump which stump this specific stump and the performance is basically checked by a formula which is 1 by log e 1us total error divided total error why we are doing this everything will make sense okay in just time every every in just a small time everything will make sense the first step that we do in adaab boost is that we try to find out the total error the second step we try to find out the performance of stump now in this particular case it will be 1 by log e 1 - 1 by 7 / 1X 7 so once I calculate it it will be coming as 895 F2 and F3 see again understand out of all these features I found out from Information Gain and entropy that this is the best feature let's say that I have calculated this as895 so this is my second step the first step is find out the total error the second step is performance of stum what is te te basically means total error te basically means total error now see see the steps okay see the steps whenever I'm discussing about boosting I'm going to combine weak Learners together to get a strong learner now what is the next step out of this now what what will be my third step understand over here my third step will be to update all these weights and that is the reason why I'm calculating this total error and performance of Step so my third step will basically be new sample weight from the decision tree one which is my stump so I'll say new sample weight is equal to I need to update all these weights why I need to update all these weights again understand I'll I'll talk about it just a second so if I want to up update the sample weights first update I will do it for correct records see for correct records whichever are correct like these all records are correct these all records are correct now when I update the weights of this update the weights of this particular record it should reduce and when the the the wrong records that I have this update should increase why because because if I increase this weights then the wrong records that are there that record should go to the next week learner that is the reason why I'm doing it now how to update this particular weights for correct records for correct records the formula looks something like this weight multiplied by weight multiplied by E to the^ of minus this specific performance okay this specific performance so e to the power of PS I'll write performance of stump and then I will basically be able to write 1X 7 * e to the^ of minus 895 if I do the calculation everybody try to do it the answer will be 05 now this is for correct records what about incorrect records for the incorrect records the the weights that is going to the formula that we going to apply is multiplied by E to the^ of plus PS not minus PS plus PS so here I'll write 1 by 7 multiplied e to the^ of 895 so if I go and probably calcul this I'm going to get it as 349 so this two are the weights that I have got that basically means all these records now which are correct 1X 7 the new updated weights will be 05 05 05 05 sorry not for the wrong records then this will be 05 then 05 and 05 so let me just see what is 1x 7 so here you can see initially it was. 142 now it has got reduced to 05 because all these records are correct but the wrong record value is 349 so my weights will now become over here as 349 now I will just go and go ahead and write over here my new weight my new weight is nothing but 05 055 05 05 05 1 2 how many 1 2 3 okay fourth record is here fourth record is there 1 2 3 4 05 05 okay how many records are there 1 2 3 4 5 6 7 so my fourth record will basically become the new value that I'm having is something called as 349 now tell me guys if I do the summation of all these weights is this is it one so prob no I don't think so it is one because if I try to add it up it is not one but if I go and see over here these all are one if I combine all the things 1 2 3 4 5 6 7 these all are one so here I have need to find out my normalized weight now in order to find out the normalized weight all I have to do is that what I have to do because the entire sumission should be one so we have to normalize now in order to normalize all you have to do is that go and find out what is the sum of all this things the summation of all these things will be 0 649 all you have to do is that divide all the numbers by 649 divided by 649 649 like this divide all the numbers by 649 and tell me what will be the answer that you'll be getting so here your normalized weight will now look like 077 07 and this value will be somewhere around uh 537 I guess in this case then this will be 07 077 here we are going to divide by all this 64 649 now this is my normalized weight now after you get a normalized weight we will try to create something called as buckets because see one decision tree we have already created which is a stump and you know from this particular stum what you're going to get okay as an output then in the sequential model we will go and combine another model over here now it's the time that I have to create this specific model now in order to create this specific model I need to provide some specific rows only to this model to train because this model is giving one wrong now what I have to do is that whatever is wrong along with other data points I need to provide this specific model with those records so that this model will be able to train on this and probably be able to get the output now let's create buckets now based on buckets how the buckets will be created over here I will take 07 until sorry whatever is the value over here normal we value okay so I will start creating my buckets buckets basically from 0 to 07 what did I say now for this decision tree or stump I need to provide some records so the maximum number of record that should be going should be the wrong records that should go over here now how do we decide that okay there should be a way that we should be able to say that that specific wrong number of Records should go to that decision tree so for that purpose what we do is that this decision tree will randomly create some numbers between 0 to 1 randomly create those numbers between 0 to 1 and whichever bucket it will come in like 07 to 014 014 to 07 basically means 0 2 1 then 0 2 1 2 see how the bucket is getting cre this value is getting added to this so that becomes this bucket 021 +3 537 how much it is it is nothing but 470 747 then 747 to 751 like this you create all the buckets okay you can create all the buckets now tell me which record is basically having the biggest bucket size obviously this record so if I randomly create a number between 0 to one what is the highest probability that the values will be going in so in this particular case most of the wrong records will be passed along with the other records obviously other records there are chances that other records will go to the next decision tree but understand maximum number will go with the wrong records because the bucket is high over here so the bucket is high over here so most of the time this specific record will get create selected and then it will be gone to the second tree now suppose I have this all records so this is my first stump this is my second stump this is my third stump similarly the third stump from the second stump whichever wrong records will be going maximum number of Records will go over here then again it will be trained like this we'll be having lot of stumps minimum 100 decision trees can be added you know that every decision tree will give one output for a new test data new test data this week learner will give one output this week learner will give one output this week learner and this will week learner will be giving one output obviously the time complexity will be more now from this particular output suppose it is a binary classification I will be getting 0 1 1 1 so again over here majority voting will happen and the output will be one in case of regression problem I will be having a continuous value over here and for this the average average will be computed and that will give me an output over here so for regression the average will be done for classification what will happen majority will be be happening so everywhere that same part will be going on buckets is very much simple guys buckets basically means based on this weights normalized weight we are going to create bucket so that whichever records has the highest bucket based on this randomly creating code you know it will select those specific buckets and put it into random Forest understand why this bucket size is Big the other wrong records which are present right suppose they are have more than four to five wrong records their bucket size will also be bigger and because based on this randomly creating num between 0 to 1 most of the wrong records will be selected and given to the second stum similarly this particular decision tree will be doing some mistakes then that wrong records will get updated all the weights will get updated and it will be passed to the next decision tree guys when I say wrong record the output will be same only no zero and one so interesting everyone I hope you understood so much of maths in adab boost and how adab boost actually work three main things one is total error one is performance of stump and one is the new sample weight these things are getting calculated extensive max normalized weight was basically used because the sum of all these weights are approximately equal to one when boosting why not take the last output no no no we have to give the importance of every decision tree output every decision tree output are important okay let me talk about one model which is called as blackbox model versus white box what is the difference between blackbox model and white box if I take an example of linear regression tell me what kind of model it is is is it a white box model or black box if I take an example of random Forest is this a white box or black box if I take an example of decision tree it is a white box of blackbox model if I take an example of a Ann is it a white box of blackbox model linear regression is basically called as an wide Box model because here you can basically visualize how the Theta value is basically changing and how it is coming to a global Minima and all those things in random Forest I will say this as blackbox model because it is impossible to see all the decision tree how it is working so that is the reason the maths is so complex inside this if I talk about decision tree this is basically a white box model because in decision tree we know how the split are basically happening with the help of paper and pen you'll be able to do it in the case of an Ann this is a blackbox model because here you don't know like how many neurons are there how they are performing and how the weights are getting updated so this is the basic difference between the blackbox and uh uh white box model this entire thing is the agenda of today's session so let's start uh the first algorithm that we are probably going to discuss today is something called as K means clustering K means clustering and this is a kind of unsupervised machine learning now always remember unsupervised machine learning basically means that uh the one and the most important thing is that in unsupervised machine learning in unsupervised ml you don't have any specific output so you don't have any specific output so suppose you have feature one and feature two and suppose you have datas different different data you know and based on this data what we do we basically try to create clusters this clusters basically says what are the similar kind of data so this is what we basically do from uh clustering and there are various techniques like K Mains uh it is hierle clustering and all so first of all we'll try to understand about K means and how does it specifically work it's simple uh suppose you have a data points like this okay let's say that this is your F1 feature F2 feature and based on this in two dimensional probably I will be plotting this points and suppose this is my another points so our main purpose is basically to Cluster together in different different groups okay so this will be my one group and probably the other group will be this group right so two groups because obviously you can see from this clusters here you have two similar kind of data which is basically grouped together right this is my cluster one and this is my cluster 2 let me talk about this and why specifically it'll be very much useful then we'll try to understand about math intuition also now always understand guys uh where does clustering gets used okay in most of the Ensemble techniques I told you about custom emble technique right so custom emble techniques in custom assemble techniques you know whenever we are probably creating a model first of all on our data set what we do is that we create clusters so suppose this is my data set during my model creation the first algorithm we will probably apply will be clustering algorithm and after that it is obviously good that we can apply regression or classification problem suppose in this clustering I have two or three groups let's say that I have two or three groups over here for each group we can apply a separate supervis machine learning algorithm if we know the specific output that we really want to take ahead I'll talk about this and uh give you some of the examples as I go ahead now let's go on go ahead and focus more on understanding how does kin's clustering algorithm work so let's go over here the word K means has this K value this K are nothing but this K basically means centroids K basically means centroids so suppose if I have a data set which looks like this let's say that this is my data set now over here just by seeing the data set what are the possible groups you think definitely you'll be saying K is equal to 2 So when you say k is equal to two that basically means you will be able to get two groups like this and each and every group will be having a centroid a centroid Point here also there will be a centroid point so this centroid will determine basically this is a separate group over here this is a separate group over here so over here here you can definitely say that fine this is two groups but but how do we come to a conclusion that there is only two groups okay we cannot just directly say that okay we'll try to just by seeing the data because your data will be having a high dimension data right right now I'm just showing your two Dimension data but for a high dimension data definitely you'll not be able to see the data points how it is plotted so how do you come to a conclusion that only two groups are there so for this there is some steps that we basically perform in K mins the first step is that we try with different K values we try with different K values and which is the suitable K value K is nothing but centroids okay it is nothing but centroids we try with different different centroids in this particular case let's say that I have this particular data point and I actually start with k is equal 1 or 2 or 3 any one you want let's say that I'm going to start with k is equal 2 how to come up with this K is equal to 2 as a perfect value that I'll talk about it we need to know there is a concept which is called as within cluster sum of square so when we try different K values let's say that for K is equal to 2 what will happen the first step we select a we try K values so let's say that we are considering K is equal to 2 the second step is that we initialize K number of centroids now in this particular case I know my K value is 2 so we will be initializing randomly let's say that K is equal to 2 so what we can actually do let's say that this is this is my one centroid I will I'll put it in another color so this will be my one centroid and let's say that this is my another centroid so I have initialized two centroids randomly in this space now after this particular centroid what we have to do is that after initializing this centroid what we have to do is that we have to basically find out which points are near to the centroid and which points are near to this centroid now in order to find out it is a very easy step we can basically use ukan distance to find out the distance between the points in an easy way if I really want to show you that you know like how many points I want to in an easy way what I can do I can basically draw a straight line over here let's say that I'm drawing a straight line over here in another color I can draw a straight line and I can also draw one parallel line like this so This basically indicates that whichever points you see over here suppose if I draw a straight line in between all these points you will be able to see that let's say that I'm drawing one more parallel line which is intersecting together so from this you can definitely find out let's say that these are all my points that are nearer to this green line Green Point so what I'm actually going to do in this particular case all these points that you are seeing near the green it will become green color so that basically means this is basically nearer to this centroid and whichever points are nearer to this particular point that will become red point so that basically means this belongs to this group okay this belongs to this group so I hope everybody's clear till here then what will happen is that this summation of all the values then we initialize the K number of centroids that is done then we try to calculate the distance we try to find out which all points is nearer to the centroid let's say that this is my one centroid this is my another centroid and we have seen that okay these all points belong to this centroid it near to this particular centroid so this is becoming red so that is based on the shortage distance and here it is becoming green now the next step let's see what is the next step after this so I am going to remove this thing now the next step will be that the entire points that is in red color all the average will be taken so here again the average will be taken now third step here I'm going to write here we are going to compute the average the reason we compute the average is that because we need to update the centroid so compute the average to update centroid to update centroids so here you'll be able to see that what I'm actually doing as soon as we compute the average this centroid is going to move to some other location so what location it will move it will obviously become somewhere in Center so here now I'm going to rub this and now my new centroid will be this point where I am actually going to draw like this let's say this is my new centroid now similarly this thing will happen with respect to the green color so with respect to the green color also it will happen and this green will also Al get updated so I'm going to rub this and this will be my new Green Point which will get updated over here then again what will happen again the distance will be calculated and again a perpendicular line will be calculated here you can see that now all the points are towards there okay again the centroid based on this particular distance again it will be calculated and here you can see that all the points are in its own location so here now no update will actually happen let's say that there was one point which was red color over here then this would have become green color but since the updation has happened perfectly we are not going to update it and we are not going to update the centroid right so now you can understand that yes now we have actually got the perfect centroid and now this will be considered as one group and this will be basically considered as the another group it will not intersect but right by default here intersection is happening so I hope everybody's understood the steps that you have actually followed in initializing the centroids in updating the centroids and in updating the points is it clear everybody with respect to K means now let's discuss about one point how do we decide this K value okay how do we decide this K value so for deciding the K value there is a concept which is called as elbow method so here I'm going to basically Define my elbow method now elbow method says something very much important because this will actually help us to find out what is the optimized K value whether the K value should be two whether uh the K value is going to be three whether the K value is going to become four and always understand suppose this is my data set suppose this is my data set initially let's say that I have my data points like this we cannot go ahead and directly say say that okay K is equal to 2 is going to work so obviously we are going to go with iteration for I is equal to probably 1 to 10 I'm going to move towards iteration from 1 to 10 let's say so for every iteration we will construct a graph with respect to K value and with respect to something called as W CSS now what is this W CSS W CSS basically means within cluster sum of square okay this is the meaning of wcss within cluster sum of square now let's say that initially we start with one centroid so one centroid let's say it is initialized here one centroid is basically initialized here if we go and compute the distance between each and every points to the centroid and if we try to find out the distance will the distance value be greater or it will be smaller will it be smaller or greater tell me if you try to calculate this distance from this centroid to every point this is what is within cluster sum of square it will always be very very much greater so let's say that my first point has come somewhere here it is going to be obviously greater let's say that my first point is coming over here find So within K is equal to 1 initially we took and we found out the distance of w CSS and it is a very huge value okay because we're going to compute the distance between each and every point to the centroid now the next thing that I'm actually going to do is that now we'll go with next value that is K is equal to 2 now in K is equal to 2 I will initialize two points okay I will initialize two points and then probably I will do the entire process which I have written on the top now tell me me whichever points is nearer to this green point if we compute the distance and whichever points is nearer to the red point if you compute the distance like this now this summation of the distance will be lesser than the previous W CSS or not obviously it is going to be lesser than the previous W CSS so what I'm actually going to do probably with K is equal to 2 your value may come somewhere here then with K is equal to 3 your value May come somewhere here then K is equal to 4 will come here to 5 6 like this it will go so here if I probably join this line you'll be able to see that there will be an Abrupt changes in the W CSS value in the wcss value there will be an Abrupt changes and this this is basically called as elbow curve now why we say it as elbow curve because it is in the shape of elbow and here at one specific point there will be an Abrupt change and then it will be straight so that is the reason why we basically say this as elbow okay so this is a very important thing see in finding the K value we use elbow method but for validating purpose how do we validate that this model is performing well we use silard score that I'll show you just in some time but understand that in K means clustering we need to update the centroids and based on that we calculate the distance and as the K value keep on increasing you'll be able to see that the distance will become normal or the wcss value will become normal and then we really need to find out which is the phys K value where the abrupt change see over here suppose abrupt change is there and then it is normal then I will probably take this as my K value so obviously the model complexity will be high because we are going to check with respect to different different K values and wcss values and this basically means that the value that we'll probably get first of all we need to construct this elbow curve then see the changes where it is basically happening we'll need to find out the abrupt change and once we get the abrupt change we basically say that this may be the K value so K is equal to 4 as an example I'm telling you so unless and until if you really want to find the cluster it is very much simple we take a k value we initialize K number of centroids we compute the average to update the centroids then again we try to find out the distance try to see that whether any points has changed and continue that process unless and until we get separate groups okay so this is the entire funa of claim in clustering so finally you'll be able to see that with respect to the K value we will be able to get that many number of groups if my K value is four that basically means I will be probably getting four different groups like this 1 two right three like this and four I will be getting four groups like this with K is equal to 4 that basically means K is equal to four clusters and every group will be having its own centroids okay every group will be having okay centroids are very much important yes I'll try to show you in the coding also guys let's go towards the second algorithm the second algorithm that we will be probably discussing is called as hierarchical clustering now hierarchal clustering is very much simple guys all you have to do is that let's say this is your data points this is your data points and this is my P1 let's say P2 now hierle clustering says that we will go step by step the first thing is that we will try to find out the most nearest Value let's say this is my X and Y let's say these are my points like this is my P1 point this is my P2 point this is my P3 point this is my P4 Point P5 Point P6 point p7 point okay so these are my points that I have actually named over here let's say that this may be the nearest point to each other so what it will do it will combine this together into one cluster this we have computed the distance so it will C create one cluster now what will happen on the right hand side there will be another notation which you may be using in connecting all the points one so suppose this is my P1 this is my P2 this is my P3 P4 let's say that I have this many points and probably I will also try to make p7 so these are my points p7 now you know that the nearest point that we are having okay this will probably be distance 1 2 3 this is distance okay 4 5 6 like this we have lot of distance so hierle clustering will first of all find out the nearest point and try to compute the distance between them and just try to combine them together into one what do we do we basically combine them into one group okay so P1 and P2 has been combined let's say then it'll go and find out the other nearest point so let's say P6 and p7 are near so they are also going to combine into one group so once they combine into one group then we have P6 and p7 which will be obviously L greater than the previous distance and we may get this kind of computation and another combination or cluster will form get formed over here then you have seen that okay P3 and P5 are nearer to each other so we are going to combine this so I'm going to basically combine P3 and P5 okay and let's say that this distance is greater than the previous one because we are basically going to sh start with the shortest distance and then we are going to capture the longest distance now this is done now you can see that the next point that is near right to this particular group is P4 so we are going to combine this together into one group so once we combine this into one group this P4 will get connected like this let's say it is getting connected like this P4 has got connected then what is the nearest Point whether it is P6 p7 group or P1 P2 obviously here you can see that P1 P2 is there so I am probably going to combine this group together that basically means P1 P2 let's say I'm just going to combine this group group together again circle is coming so I will make a dot let's say I'm going to combine this group together because these are my nearest groups so what will happen P1 and P2 will get combined to P5 sorry P4 P5 this one so I will be getting another line like this and then finally you'll be seeing that P6 p7 is the nearest group to this so this will totally get combined and it may look something like this so this will become a total group like this so all the groups are combined so finally you'll be able to see that there will be one more line which will get combined like this this is basically called as dendogram dendogram okay which is like bottom root to top now the question arises is that how do you find that how many groups should be here how do you find out that how many groups should be here the funa is very much Clear guys in this is that you need to find the longest vertical line you need to find out the longest vertical line that has no horizontal line pass through it no horizontal line passed through it this is very much important that has no horizontal line pass through it now what this is basically meaning is that I will try to find out the longest line longest vertical line in such a way that none of the horizontal line passes through it what is horizontal line suppose if I consider this vertical line This vertical line over here if you see that if I extend this green line it is passing through this if I extend this line it is passing through this right if I'm extending this line it is passing through this right so out of this the longest line that may be passing in such a way that no horizontal line probably is this line that I can actually see so what you do over here is that you basically just create a straight line over this and then you try to find out that how many clusters it will be there by understanding that how many lines it is passing through if it is passing through this one line two line three line four line that basically means your clusters will be four clusters this is how we basically do the calculation in heral clustering again here it may not be the perfect line I've just drawn with some assumptions but if you are trying to do this probably you have to do in this specific way okay I've already uploaded a lot of practical videos with respect to highill clustering and all now now tell me maximum effort or maximum time is taken by is taken by K means or hierle clustering this is a question for you yes guys number of clusters may be three but here I'm just showing you that how many lines it may be passed by how do you basically determine whether maximum time will be taken by kin or Hier clustering this is an interview question the maximum time that will be taken is by hierarchical clustering why because let's say that I have many many many data points at that point of time hierle clustering will keep on constructing this kind of dendograms and it will be taking many many many time lot time right so hierle clustering will take more time maximum time that it is going to basically take so it is very much important that that you understand which is making basically taking more time so if your data set is small you may go ahead with hierle clustering if your data set is large go with K means clustering go with K means clustering in short both will take more time but K Min will perform better than hle clustering see guys you will be forming this kind of dendograms right and just imagine if you have 10 features and many data points how you're going to do it it will be a cubers some process you'll not be even able to see this dendogram properly and manually obviously you cannot do it so this was with respect to K means clust swing and H mean clust swing I hope everybody's understood now the next topic that we'll focus on is that how do we validate see how do we validate a classification problem we use performance metric like confusion Matrix accuracy um different different true positive rate Precision recall but how do we validate clustering model Model S we are going to use something called as so we are going to basically use something called as Sil score I'll show you what Sid score is I'm going to just open the Wikipedia so this is how a CID score looks like a very very amazing topic okay how do we validate whether my model basically has perfect three or four model perfect three suppose if I find out my K value is three how do we find out now see one more one more issue with K means one issue with K means which I forgot to tell you let's say that I have a data point which looks like this and suppose I have some data points like this I have some data points which looks like this let's say I have like this now in this one issue will be that suppose I try to make a cluster over here obviously you'll be saying my K value will be two okay in this particular case suppose this is one cluster this is my another cluster right because of my wrong initialization of the points okay understand because suppose if I initialize just randomly some centroids like this then what may happen is that there is a possibility that we may also have three clusters like like like this kind of clusters one cluster will be here one cluster will be here one cluster will be here so this initialization of the centroids one condition is that it should be very very far if we initialize our centroids very very far at that point of time we will be able to find the centroid exactly in the center because it will keep on updating it'll keep on going ahead right but if we don't initialize that very far then there will be a situation that probably if I wanted to get only the real thing was to get only two centroids I was probably getting three centroids right so this is a problem so for this there is an algorithm which is called as K means Plus+ and what this K means Plus+ will do which I will probably show you in Practical this will make sure that all the centroids that are initialized it is very very far okay all the in centroids that is basically there it is initialized very very far we'll see that in practical application where specifically those centroids are basically used now let me go ahead and let me show you with respect to Sid clust string now what is the solo color string I'm going to explain you in an amazing way this is important if someone says you how do we validate how do we validate cluster model then at that point of time we basically use this site it will be used in it will be used with respect to it will be used with respect to K means it can be used in hierle mean right if you want to validate how do we validate okay that is what we are basically going to see over here now in C's clustering what are the most important things the first and the most important thing is that we will try to find out we will try to find out a ofi we will try to find out a of I now what is this a ofi see this a ofi that you basically see a ofi is nothing but see three major steps happens in order to validate cluster model with the help of solo first thing is that I will probably take one cluster okay there will be one point which will be my centroid let's say and then what I'm going to do I'm just going to whatever points are there inside this cluster I'm going to compute the distance between them so I'm going to do the summation and I'm also going to do the average of all this distance so here you can see that when I said distance of I comma J I basically means this point J basically means all these points I is nothing but it is the centroid so here is nothing but this this is the centroid let's say that I'm having the centroid so I'm going to compute all the distance over here which is mentioned by this and this value that you see that I'm actually dividing by C of I minus one in Short I am actually trying to calculate the average distance so this is the first point where I'm actually Computing the a ofi now similarly what I will do is that what I will do is that the next point will be that suppose I have computed a ofi the next the next that we need to compute is B ofi now what is b ofi b ofi is nothing but there will be multiple clusters in a k means problem statement we will try to find out the nearest cluster okay suppose let's say that this is the nearest cluster and in this I have all the variety of points then B ofi basically says that I will try to compute the distance between each point and the other point in this centroid sorry in this cluster so this is my cluster one this is my cluster two so what I'm actually going to do is that here I'm going to compute the distance between this point to this point then this point to this point then this point to this point this point to this point this point to this point this point to this point every point I'm actually going to compute the distance once this point is done we will go ahead with the next point and we'll try to compute the distance and once we get all this particular distance what we are going to do we are going to do the average of them average now tell me if I try to find out the relationship between a of I and B of I if my cluster model is good will a of I will be greater than b of I or will B of I will be greater than a ofi if I have a good clustering model if I have a good clustering model will a of I is greater than b of I will be greater than b of I or whether B of I will be greater than a of I out of this if we have a really good model obviously the distance between B of I will be greater than a of I in a good model that basically means if I talk about sloid clustering the values will be between -1 to +1 the more the value is towards +1 that basically means the good the model is the good the clustering model is the more the values towards negative one that basically means this condition is getting applied now what does this condition basically say that basically means that this distance is far than the cluster distance this is what this information is getting portrayed and this is the importance of CID clustering finally when we apply the formula of CID clustering you'll be able to see that sloid clustering is nothing but let me rub this everything guys for you let me just show you what is CID clustering CID clustering formula will be something like this this B of I so here you have solid clustering this is the formula B of I minus a of I Max of a of I comma B of I if C of I is greater than one right so by this you will be getting the value between -1 to + 1 and more the value is towards + one the more good your model is more the values towards minus1 more bad your model is because if it is towards minus1 that basically means your a of I is obviously greater than b of I so this is the outcome with respect to cot crust string if s is equal to zero that basically means still your model needs to be uh per basically the clustering needs to be improved what is I over here I is nothing but one data point you you can just read this guys data point in I in the cluster C of I so I hope everybody's understood this now let's go ahead and let's discuss about the next topic we have obviously finished up solart clustering over here let's discuss about something called as DB scan so for DB scan clustering this is an amazing clustering algorithm we'll try to understand how to actually do DB clustering and probably you'll be able to understand a lot of things from this now in DB scan clustering what are the important things so let's start with respect to DB scan clustering and let's understand some of the important points over here the first point that you really need to remember is something called as score point points I'll also talk about when do you say core points or when do you say other points as such so the first point that I will probably discuss about is something called as Min points the second point that I will probably discuss about is something called as score points the third thing that I will probably discuss about is something called as border points and the fourth point that I will definitely talk about is something called as noise Point okay guys now tell me in C's clustering if I have this kind of groups don't you think with the help of two different clusters I may combine this two like this with the help of two different clusters I may combine something like this right but understand over here what what problem is basically happening with the second clustering this is actually an outliers let's say that let's say one thing very nicely I will put okay let's say I have one point over here I have one point over here here so if I do clustering probably I will get one cluster here and I may get another cluster which is somewhere here now understand one thing this point is definitely an outlier even though this is an outlier with the help of K means what I'm actually doing I'm actually grouping this into another group so can we have a scenario wherein a kind of clustering algorithm is there where we can leave the outlier separately and this outlier in this particular algorithm and this is B basically uh we will be using DB scan to relieve the outlier and this point will be called as a noisy Point noisy point or I can also say it as an outlier so this will be a noise point for this kind of algorithm where you want to skip the outliers we can definitely use DB scan that is density based spatial clustering of application with noise a very amazing algorithm and definitely I have tried using this a lot nowadays I don't use K means or Hier means instead use this kind of algorithm now see this what are the important things over here first of all you need to go ahead with Min points Min points so first thing is that you need to have Min points this Min points is a kind of hyperparameter this basically says what does hyper parameter says and there is also a value which is called as Epsilon which I forgot I will write it down over here this is called as Epsilon now what does epsilon mean Epsilon basically means if I have a point like this and if I take Epsilon this is nothing but the radius of that specific Circle radius of that specific Circle okay so Epsilon is nothing but radius over here in this specific T what does minimum points is equal to 4 mean let's say that I have I have taken a point over here let's say that this is my point and I have drawn a circle which looks like this and let's say that this is my Epsilon value okay this is my Epsilon value if I say my Min point point is equal to 4 which is again a hyper parameter that basically means I can if I have four at least four points over here near to this particular Circle based on this Epsilon value then what will happen is that this point this red point will actually become a core point a core point which is basically given over here if it has at least that many number of Min points inside or near to this particular within this Epsilon okay within this particular cluster suppose this is my cluster with the help of Epsilon I have actually created it is there a particular unit of Epsilon or we simply take the unit of distance no Epsilon value will also get selected through some way I I'll show you I'll show you in the practical application don't worry now the next thing is that let's say let's say I have another another point over here let's say that I have another point over here and this is my circle with respect to Epsilon I have created it let's say that here I have only one point I have only one point inside this particular cluster at that point this point becomes something called as border Point border Point border point also we have discussed over here right so border point is also there so here I'm saying that at least one at least one if it is only one it is present then it will become a border point if it has Force definitely this will become a core Point core Point like how we have this red color so and there will be one more scenario suppose I have this one cluster let's say this is my Epsilon and suppose if I don't have any points near this then this will definitely become my noise point and this noise point will nothing be but this will be a cluster okay so here I have actually discussed about the noise point also so I hope everybody is able to understand the key terms now what is basically happening is that whenever we have a noise Point like in this particular scenario we have a noise point and we don't find any points inside this any core point or border point if you don't find inside this then it is going to just get neglected that basically means this is basically treated as an outlier I hope everybody is able to understand here this point will be treated as an outlier or it can also be treated as a noise point and this will never be taken inside a group okay it will never never be taken inside a group suppose I have this set of points which you see basically over here red core and all and there is also a border Point by making multiple circles over here here you can definitely say that how we are defining core points and the Border points and this can be combined into a single group okay this can be combined into a single group because how the connection is now see this this yellow line is basically created by one sorry this yellow point is basically created by one Epsilon and we have one One Core point over here remember over here it should be at least one core Point okay not one point but one core point at least if it is having one core point then it will become a border point this will become a border point that basically means yes this can be the part of this specific group so what we are doing Whenever there is a noise we are going to neglect it wherever there is a broader and core points we are going to combine it so I'll show you one more diagram which is an amazing diagram which will help you understand more in this a k means clustering and Hier mean clustering now see this everybody now the right hand side of diagram that you see is based on DB scan clustering and the left hand side is basically your traditional clustering method let's say that this is K means which one do you think is better over here you see this these all outliers are not combined inside a group But whichever are nearer as a core point and the broader point separate separate groups are actually created right so this is how amazing a DB scan clustering is a DB scan clustering is pretty much amazing that is basically the outcome of this here in C's clustering you can see this all these points has also been taken as blue color as one group because I'll be considering this as one group but here we are able to determine this in a amazing groups so in I'm saying you guys directly use DB scan with without worrying about anything so now let's focus on the Practical part uh I'm just going to give you a GitHub link everybody download the code guys I've given you the GitHub link quickly download and keep your file ready I'm going to open my anaconda prompt probably open my jupyter notbook we'll do one practical problem I've given you the link guys please open it so this is what we are going to do today this will be amazing here you'll be able to see amazing things how do you come to know that over fitting or underfitting is happening you don't know the real value right so in in clustering there will not be any underfitting or overfitting so uh what all things we'll be importing first is that we'll try cin clustering we'll do silot scoring and then probably we'll see the output and um and we'll do DB scan Also let's say DB scan is also there so uh what are the things we have basically imported one is the cin clustering one is the Sout samples and Sout scores these all are present in the SK learn and it is present in metrics that basically means we use this specific parameter to validate clustering models okay now we'll try to execute this and apart from that mat plot lib we are just trying to import numai we are trying to import and all here we are executing it perfectly the next thing is that here the next step is that generating the sample data from make underscore blobs first of all we are just trying to generate some samples with some two features and we are saying that okay should have four centroids or C centroids itself with some features I'm trying to generate some X and Y data randomly and this particular data set will basically be used in performing clustering algorithms okay forget about range undor ncore clusters because we need to try with different different clusters and try to find out the solid score so right now I just initialized with 2 3 4 5 6 values it is very simple so if I go and probably see my X data so my X data will look something like this so this is my X data with two features and this is my Y data with one feature which is my output which belongs to a specific class okay so that you can actually do with the help of make underscore blobs let's say how to apply kin's clustering algorithm so as I said that I will be using W CSS W CSS basically means within cluster sum of square so I'm going to import K means over here for I in range 1A 11 that basically means I'm going to use different different K values or centroid values and try to C which is having the minimal wcss value and I'll try to draw that graph which I had actually shown you with respect to Elbow method so here I will basically be also using K means number of clusters will be I and initialization technique I will will be using K means Plus+ so that the points the centroids that are initialized those those points are very very far and then you have random state is equal to zero then we do fit and finally we do wcss do upend cins doin inertia okay this dot inertia will give you the distance between the centroids and all the other points and this is what I'm going to append in this wcss value and finally I'll just plot it now here you can see that I'm just plotting it obviously by seeing this graph this graph looks like an elbow okay this graph looks like an elbow so the point that I'm actually going to consider over here see which is the last abrupt change so if I talk about the last abrupt change here I have the specific value with respect to this okay I have one specific value with respect to this this is my abrupt change from here the changes are normal so I'm going to basically select K is equal to 4 now what I'm actually going to do with the help of sart with the help of s CL score we are going to compare whether K is equal to 4 is valid or not so that is what we are going to do valid or not so here we are going to do this now let's go ahead and let's try to see it how we are going to do it so here you can see n clusters is equal to 4 then I'm actually able to find out the prediction and this is specifically my output okay this is done now see this code okay this code is a huge code I have actually taken this code directly from the SK learn page of Silo if you go and see this this code is directly given over there but I'm just going to talk about like what are the important things we need to see over here with respect to different different clusters see see this clusters 2 3 4 5 6 I'm going to basically compare whether the K value should be four or not with the help of solid scoring so let's go here and here you can see that I'm applying this one first I will go with respect to for Loop for ncore clusters in range underscore clusters different different cluster values are there first we'll start with two so here you can see initialize the cluster with and cluster value and a random generator seed of 10 for reproducibility so ncore clusters first I take took it as two and then I did fit predict on X after I did fit predictor on X I'm using this score on X comma cluster label now what this is going to do understand in Solo what did we discuss it will it will try to find out all the Clusters the Clusters over here like this and it'll try to calculate the distance between them which is the a of I then it'll try to compute the B of I then finally it'll try to compute the score and if the value is between minus1 to +1 the more the Valu is towards + one the more better it is right so these all things we have already discussed and that is what this specific function will do and this will give my solo average value over here solid value will be over here okay this we have done and then we can continuously do it for another another things you can actually find it over here and this value that you see this code that you see is nothing nothing so complex okay this is just to display the data properly in the form of graphs okay in the form of graphs so again I'm telling you I did not write this code I've directly taken it from the uh SK learn page of solid okay so just try to see this particular uh plotting diagrams and all that you can definitely figure out but let's see I will try to execute it and try to find out the output now see for ncore cluster is equal to 2 the average solid score is 70 I told you the value will be between -1 to +1 and I'm actually getting 704 which is very very good and then for ncore cluster is equal to 3 588 then ncore cluster is equal to 4 I'm getting 65 which is pretty much amazing and then for ncore cluster equal to 5 the average score is 563 and ncore cluster is equal to 6 you are saying .45 here directly you can actually say that fine for _ cluster equal to 2 I'm getting an amazing score of 704 obviously you're you're getting the highest value over this so should we select ncore cluster isal to two Okay we should not directly conclude from it because here we need to also see that any feature value or any cluster value is also coming as negative value that also we need to check so here we will go down over here you will see the first one over here with respect to the first one you see that I'm get getting the value from 0 to 1 it is not going going to Min -.1 so definitely two clusters was able to solve the problem so I'll keep it like this with me I definitely have a chance that this may this may perform well I may have a chance that this K uh K is equal to 2 May perform well okay so I may have a chance let's see to the next one to the next one over here you can see that for one of the cluster the value is negative if the value is negative that basically means the AI is obviously greater than b ofi so I'm not going to prer this because it is having some negative values even though my cluster looks better but again understand what is the problem with respect to this cluster is that if I take this cluster and probably compute the distance between this point to this point and if I probably compute from this point to this point or this point to this point this point is obviously nearer to this right it is obviously nearer to this so that is the reason why I'm getting a negative value over here okay negative value over here this is my uh output my score this point that you see dotted points this is my score 58 what whatever it is this is basically my score so obviously this basically indicates that this point is near the other cluster point is nearer to this so I'm actually getting a negative value right so this you really need to understand okay now similarly if I go with respect to ncore Cluster is equal to 4 this looks good because here I don't have any negative value and here you can see how cooly it has basically divided the points amazing inly with the help of k equal to 4 right and similarly if I go with five obviously you can see some negative values are here some dotted line negative value are there with respect to six you also have some negative values so definitely I'll not go with six I may either go with four or I may either go with two now whenever you have this options always take a bigger number instead of two take four because four is greater than two because it will be able to create a generalized model so from this I'm actually going to take and is equal to 4 K is equal to 4 now should we compare with this with the elbow method here also I got four right so both are actually matching so this indicates that with the help of this clustering this siluette score we can definitely come to a conclusion and validate our clustering model in an amazing way so I hope everybody is able to understand and this way you basically validate a model and definitely you can try it out you can understand this code definitely I but till here you have understood that here I'm going to get the average value then for iore clusters whatever cluster this is matching it is just mapping over there and it is basically giving so this was the session and uh yes in today's session we efficiently covered many topics we covered kin hierle clustering solid score DB clustering in tomorrow's session the topics that are probably pending is first I'll start with svm and svr second I will go ahead with XG boost and and third I will cover up PCA let's see whether I'll be able to complete this session uh one one amazing thing that I want to teach you guys because many people ask me the definition of bias and variance so guys uh many people get confused when we talk about bias and variance you know because let's say that uh I have a model for the training data set it gives us somewhere around 90% accuracy let's say I'm getting a 90% accuracy for the test data I may probably getting somewhere around 70% accuracy now tell me which scenario is basically this most of the people will be saying that okay fine it is overfitting now when I say overfitting I basically mention overfitting by low bias and high variance right so many people get confused Krish tell me just the exact definition of bias and variance low bias obviously you are saying that because the training is performed like the model is performing well with the help of training data set but with respect to the test data set the model is not performing well with respect to training data set why do we always say bias and with respect to test data set why do we always say variance so for this you need to understand the definition of bias so let me write down the definition of bias over here so here I can definitely write that bias it is a phenomena that skews the result of an algorithm in favor in favor or against an idea against an idea I'll make you understand the definition uh um but understand the understand understand what I have actually written over here it is a phenomena that skewes the result of an algorithm in favor or against an idea whenever I say this specific idea this idea I will just talk about the training data set initially now when we train a specific model suppose if I have this specific model over here and I'm training with this specific training data set so this is my training data set now based on the definition what does it basically say it is a phenomenon that skews the result of an algorithm in favor or against an idea or a this specific training data set so even though I'm training this particular model with this training data set with this data set it may it may be in favor of that or it may be against of that that basically means it may perform well it may not perform well if it is not performing well that basically means the accuracy is down if the accuracy is better at that point of time what will say see if the accuracy is better that time what we'll say we we'll come up with two terms from here obviously you understand okay there are two scenarios of bias now here if it is in favor that basically means it is performing well with respect to the training data set I will basically say that it has high bu if it is not able to perform well with the training data set then here I will say it as low bias I hope everybody is able to understand in this specific thing because many many many people has this kind kind of confusion now similarly if I talk about variance let's say about variance because you need to understand the definition a definition is very much important okay if I if I just talk about the definition of variance I'm just going to refer like this the variance refers to the changes in the model when using when using different portion of the training or test data now let's understand this particular definition variance refers to the changes in the model when using different proportion of the test training data or test data we obviously know that whenever initially if I have a model understand from the definition everything will make sense I am basically training initially with the training data okay because we divide our data set see our data set whenever we are working with we divide this into two parts one is our train data and test data okay because this is a tra test data is a part of that particular data set right and suppose in this particular training data it gets trained and performs well here I'm actually talking about bias but when we come with respect to the prediction of the specific model at that point of time I can use other training data that basically means that training data may not be similar or I can also use test data now in this test data what we do we do some kind of predictions these are my predictions and in this prediction again I may get two scenario I may get two scenario which is basically mentioned by variance it refers to the changes in the model when using when using different portion of the training or test data refers to the changes basically means whether it is able to give a good prediction or wrong predictions that's it so in this particular scenario if it gives a good prediction I may definitely say it as low variance that basically means the accuracy with the accuracy with respect to the test data is also very good if I probably get a bad if I probably get a bad accuracy at that time I basically say it as high variance so if I talk about three scenarios over here let's say this is my model one and this is my model two and this is my model three now in this scenario let's consider that my model one has the training accuracy of 90% and test accuracy of 75% similarly I have here as my train accuracy of 60% and my test accuracy of 55% now similarly if I have my train accuracy of 90% And my test accuracy of 92% now tell me what what things you will be getting here obviously you can directly say that fine your training accuracy is better now you're talking about bias so this basically indicates that this has low bias and since your test accuracy is bad because it is when compared to the train accuracy it is less so here you are basically going to say high variance understand with respect to the definition similarly over here what you'll say high bias High variance because obviously it is not performing well this is another scenario last the last scenario is that this is the scenario that we want because it is low bias and low variance okay many many people have basically asked me the definition with respect to bias and variance and here I've actually discussed and this indicates this gives me a generalized model and this is what is our aim when we are working as a data scientist so I hope you have understood the basic difference between V bias and variance and I was able to give you lot of examples lot of understanding with respect to this so I hope you have actually got this particular uh understanding of this uh two terms which we specifically talk about high bias low bias High variance low variance right so this was it from my side guys uh and uh I hope you have understood this okay so let's take let's consider a data set credit and let's say this is a approval so we are going to take this sample data set and understand how does XG boost work suppose salary is less than or equal to 50 and the credit is bad so approval the loan approval will be zero that basically means he he or she will not get if it is less than or equal to 50 if the credit score is good then probably approval will be one if it is less than or equal to 50 if it is good again then it is going to get one if it is greater than 50 and if it is bad then obviously approval will be zero if it is greater than 50 if it is good we are going to get it as one if it is greater than 50k and probably if it is normal then also we are going to get it so this is this is my data set so how does XG boost classifier work understand the full form of XG boost is Extreme gradient boosting extreme gradient boosting so we will basically understand about extreme gradient boosting now extreme gradient boosting uh will be actually used to solve both classification and the regression problem statement so first of all let's understand how it is basically exib basically how it actually if you if you just talk about XG boost you understand that it is a boosting technique and internally it tries to use decision tree so how does this decision Tre is basically getting constructed in the case of XV boost and how it is basically solved we are going to discuss about it so whenever we start exib boost classifier understand that first of all we create a specific base model suppose if I say this is my base model and this base model will be a weak learner okay and this base model will always give an output of probability of 0.5 in the case of classification problem so suppose if I say this is probability 0.5 then I will try to create a field over here this field is called as residual field so first base model what I'm going to do any data set that you give from here to train it will always give you the output as 0.5 so this is just a dummy base model now tell me if my probability output is is 0.5 if I want to calculate the residual that basically means I need to subtract approval minus this particular value so what will be the value over here 0 -.5 will be -.5 1 -.5 will be5 1 -.5 will be5 and 0 -.5 will be -.5 and this 1 -.5 will be uh 0.5 and this will also be 0.5 let's consider that I have one more record uh and this specific record can be anything uh because I want to keep some more records over here so let's consider that I have one more record which is less than or equal to 50K and if the credit scod is normal you're going to get zero so here also if I try to find out the residual it will be minus5 now the first step I hope everybody's understood we have to create a base model okay this base model is very much important because we have to create all the decision Tree in a sequential manner so the first sequential base tree which is again this is also a decision tree kind of thing you can consider but this is a base model which takes any inputs and gives by default the probability as 05 now let's go ahead and understand what are the steps in constructing decision tree after creating the base model the first step is that create uh binary decision tree so I'm going to write it down all the steps please make sure that you note it down so so create a binary tree binary decision tree using the features second step we basically Define we we we say it as okay Second Step what we do we actually calculate the similarity weight we calculate the similarity weight I'll talk about this similarity weight what exactly it is if I want to use this a formula it is summation of residual Square divided by summation of probability 1 minus probability plus Lambda I'll talk about this what is exactly Lambda it is the kind of hyperparameter again so that it does not overfit the third thing is that we calculate the Information Gain okay Information Gain so these are the steps we basically use in constructing or in solving uh in creating an HD boost classifier the first step is that we create a inary decision tree using the feature then we go ahead with calculating the similarity weight and finally we go ahead and calculate the information gain so how does it go ahead let's understand over here and let's try to find out okay now let's go ahead and let's try to construct the decision tree as I said that let's consider that I'm considering salary feature So based on using salary feature what I'm actually going to do I am going to take this as my node and I'm going to split this up and remember whenever we are creating decision Tree in this particular case it will be a binary decision tree let's say that in salary one is less than or equal to one is greater than 50 so this two you obviously have in the case of binary in case of credit where there are three categories I'll also show you how that further split will happen and how that will get converted into a binary team so here you have less than or equal to 50K and greater than 50k now let's go ahead and understand how many vales are there in this salary so if I see before the split you can definitely see that I'm going to use this residual and probably train this entire model now if I really wanted to find out the residual initially these are my residuals over here so one resid is -.5 then I have 0.5 over here then I have .5 then again I have -.5 then again I have 0.5 then again I have 0.5 and finally I have minus .5 so these are my total residuals that are there suppose if I make this split less than or equal to 50 First less than or equal to 50 the residuals what are things are there so here I'm going to have minus5 then less than or equal to 50 again I'm going to have 05 then again less than or equal to 50 I'm going to have 0.5 and less than or equal to again one more 0.5 is there I'm just going to remove this the last5 which is nothing but Min -.5 so I hope you understood this split so half of the things came over here the remaining half will be greater than or equal to greater than 50 so you have one value here one value here one value here so it will be Min -.5 then you have 0.5 and then finally you have 0.5 residuals how do we get it guys see from the base model which is by default giving 0.5 first my data goes over here by default probability I'm going to get 0.5 so residual is basically calculated from this probability and approval so this probability minus approval so if you subtract 0 -.5 sorry I'm just going to rub this so if you subtract 0 -.5 you're going to get -.5 1 -.5 you're going to get .5 1 -.5 you're going to get .5 so everybody I hope is very much clear with respect to this so this is the first step we constructed a binary tree now in the second step it says calculate the similarity weight now how to calculate the similarity weight similarity weight formula is sum of residual Square now what is residual Square let's say that I'm going to calculate the the the uh I'm going to calculate for this okay similarity weight now in this particular case if I go and calculate my similarity weight it will be summation of residual Square this is my residual values this is my residual Valu so I'm going to do the summation of this Square okay this value square you can see over here sum of residual Square everybody you can see sum of of residual squares so what do you think sum of residual squares will be in this particular case how I have to do it I will just take up this all values like -.5 +5 +5 and -.5 whole square right I'm just going to do the squaring of this divided by understand what it is divided by it is divided by probability of 1 minus probability now where do we get this probability value where do we get this probability value value we get this probability value from our base model right so here I'm basically going to say that we are going to do the summation of probability of 1 minus probability 1 minus probability that basically means for each and every point for each and every Point what is the probability see probability is basically coming from the base model so for each Pro each point I'm going to come compute two things one is the probability and then 1 minus probability and this I'm going to do the summ like this I will do it four times 1 -.5 then .5 * 1 -.5 and finally you'll be able to see one more will be there which is +5 1 -.5 so this will be your total things with respect to this so I hope you have understood till here uh where you are able to understand that what we have done this is summation of uh residual square and this is the remaining probability multiplied by 1 minus probability now tell me what are you able to find out from this if you cancel this and this this and this this value is going to become zero so this entire value is going to become Zer because 0 divided by anything is 0er so here I hope everybody is understood what is the similarity weight of this specific node if I want to write it is nothing but zero now you may be considering where is Lambda value okay we will initially initialize Lambda by 1 I'll talk about this hyper parameter let's consider it as 1 so here + 1 or plus 0 let's let's consider Lambda value 0 let's say for right now okay I'm just going to make it Lambda is equal to0 I'm just going to talk about it because it is a kind of hyper parameter by Z -.5 -.5 +5 +5 if I do the summation if I do the summation here you will be able to see that I'm going to get zero so this calculation we have done and we have got uh the sumission of weight is equal to Z and let's go ahead and calculate the sumission of the weight of the next node no no no it's not first Square it is whole squar so here also if I do so it is5 +5 now let's do it for this if I want to find out the similarity weight again see I'm going to repeat it .5 +5 whole squ and since there are three points so I'm going to basically use probability 1 minus probability for one point then plus probability 1 minus probability second point and then probability and 1 minus probability for the third point and Lambda is zero so I'm not going to write anything now go let's go and do the calculation for this node so - 5 - 5 it becomes zero then .5 whole square right so here I'm going to get 0.25 here if you do the calculation here you are going to get 75 so this value is going to be 1x3 and which is nothing at33 so the similarity weight for this node for this node is33 so here you can see probability of multiplied by 1 minus probability okay now the next step that we do is that calculate the information gain now you know how to calculate the information gain but before that let's do the computation for this also for this root node also go ahead and calculate the similarity weight of this okay they why the base model probability is5 because it is just understand that it is a dummy dummy model I have just put a if condition there saying that it is going to give 0.5 now do it for this one guys root node what it will be see I can calculate from here only minus1 gone this is also gone this is also gone this will be .25 divided by something now tell me guys what should be for the root node what is the similarity similarity weight what is the similarity weight for for this do this calculation everyone up one I know it will be. 25 divided by this will be 1.75 are you getting this similarity weight which will be nothing but 1 by 7 and if I divide 1 by 7 if I say what is 1 by 7 it is42 so it is nothing but .14 if I want to calculate the root node similarity weight over here is4 so I know 0.14 here 0 here 33 now see over here we calculate the Information Gain Next Step the third step what we do is that we calculate the information gain now Information Gain is nothing but in this particular case the root node similarity weight we'll try to add up so I will be getting 0.33 minus this particular Top Root node whatever split has happened that similarity weight I'll take 0 +33 -14 so Point -14 and if I do it it is nothing but just open your calculator again and 33 -14 so it is nothing but .19 I'm getting .19 as my information gain the information gain of this specific tree I got it as19 obviously you know how the features will get selected based on the Information Gain but let's say that the highest Information Gain that is given by salary okay now we will go ahead and do the further split let's go ahead and do the further split so I I know my information gain now it is1 n and Information Gain is basically used to select that specific node through which the split will happen now I'll further go and do the split let's say that I'm going to do the further split with the next feature that is which one credit so I'm going to take credit over here I'm going to take credit over here and again I have to do a binary split again but you may be considering chish here are only three categories how we are going to basically do this particular split right because we don't know how to do the split because we have three categories over here so in this case what I will do is that we what we can definitely do is that in this particular case the split that we are probably going to do is that let's consider two categories like good and normal at one side bad at one side so here it becomes a binary split again now let's go ahead and let's try to see that how many data points will fall here and how many data points will fall here so for writing down the data points let's say if it is less than or see go to the path if it is less than or equal to 50 it'll go this path and if it is B then we are probably going to get how much is the residual we are going to get one residual over here first of all so this is my one residual that is -.5 then similarly if I see less than or equal to 50 good is there right good or normal is there so here again 0. five will come I hope everybody is able to understand see the second record less than or equal to 50 we go in this path but it is good we come over here again less than or equal to 50 good again we are going to get 1 more5 then go with respect to greater than or equal to 50 which is coming over here we'll not worry about it right now again less than or equal to 50 normal again it is -.5 right so this many records definitely coming over here only one record is basically coming over here then again we will start the same process again we will start the same process now for the same process what we are going to do again try to calculate the similarity weight now in order to calculate the similarity weight what I will do I will basically say this is my similarity weight this will become .25 divided 025 why because this whole square right this whole Square residual square right summation of residual square but here I have only one residual so this Square it will become and then what I'm actually going to do I'm going to basically write .5 - 1 -.5 this is nothing for only for one data point so this is nothing but .5 * .5 which is nothing but 0.25 right now in this particular case I will get similarity weight as I hope everybody I'm getting it as one now what about this similarity weight if you want to compute it is again very very simple this and this will get cancelled then again it will be 025 divided by um if I say one like this .25 then again it will be 75 then this will also be 1 by3 that is nothing but 33 so similarity weight will be33 then again I have to calculate the information gain of this node what I will do I will add this up see 1 +33 I'll add like 1 +33 minus 0 why zero because the information gain the similarity weight of this uh the up one is basically 0 right for this particular credit node similarity weight is zero so 1 +33 minus 0 this will be 1.33 so like this further split will again happen over here with different different node and we will only be getting a binary split but we will be comparing based on Information Gain which one is coming good now let's say that I have created this path I have I have designed I have developed my entire binary decision tree which is a speciality in XG boost now what I'm going to do over here is that see everybody what I'm going to do let's consider the inferencing part let's say this record is going to go how we are going to calculate the output so this first of all went to this base model now let's go ahead and see how the inferencing will happen suppose This Record is going right so first of all this record will go to this base model the base model is giving the probability as 0.5 so the first base model is basically giving 0.5 now base based on this 05 how do we calculate the real probability how do we calculate the real probability in this okay so we apply something called as logs so we basically say log of P / 1us P so this is the formula we basically apply in only the case of base model so if we try to see this it is nothing but log of5 / .5 which is nothing but zero log of one is nothing but zero so in the first case whenever any record goes I will be getting the zero value over here okay zero value over here then plus why plus I'm doing because it will now go to the binary decision tree now this record will go to my binary decision Tre whatever value I'm getting from this I'm actually adding that up and now it will go over here now when it goes over here first of all let's see which branch it is following it is following less than or equal to 50 Branch first Branch over here then this is bad it'll go and follow here so here I can see that the similarity weight is one now the similarity weight is basically one in this case so what we do in the case of this we pass it to a learning rate parameter so this specifically is my learning rate multiplied by 1 one because why similarity weight is one over here so this will basically be my first references and Alpha over here is my learning rate it can be a very small value based on the learning parameter that we use like how we have defined learning parameters elsewhere on top of this we apply an activation function which is called as sigmoid since this is a classification problem we apply an activation function which is called as sigmoid and I hope you know what is the use of sigmoid based on this based on the alpha value based on this the output will be between 0 to 1 now I hope you getting it guys this is how the entire inferencing will probably happen now similarly what I will do I will try to construct this kind of decision tree parall so we we can also write our entire function will look something like this Alpha 0 + alpha 1 and this will be your decision tree 1 output then Alpha 2 your decision tree output Alpha 3 your decision 3 output like this Alpha 4 your decision 3 output fourth decision tree like this it will be alpha n your decision tree n output and this will be your output finally when you're trying to inference from any new record now the reason why we say this as boosting because see understand we are going to add each and every decision tree output slowly to finally get our output with respect to the working of the decision tree this is how XG boost actually work don't credit further needs to be simplified yes see like this similarly we can split credit with the help of like we can make blue green one side normal at one side But whichever will be giving the information gain more that will be taken into consideration right and this is how your entire X boost classifier works it is very very difficult to basically calculate all those things so that is the reason we say that XG boost is also a blackbox model so this is basically a blackb model it is it prone to overfitting see at one stage we also need to perform hyperparameter tuning and this we specifically say pre- pruning we tend to do pre pruning and since we are combining multiple decision trees no no this decision tree this decision tree is this one this independent decision tree which I have created now parall after this what I'll do I'll create one more decision tree so it'll be looking like this see finally how it will look so this is my base model then my data then my data will go to this decision tree which I have actually done as a binary split on different different records then again we will make another decision tree which will again be a binary tree the splits will look like this then this is my base model where I'm getting the value as zero this will be alpha 1 multiplied by decision tree 1 which is this then this is Alpha 2 multiplied by decision tree 2 which is this and like this we will keep on continuously adding more decision trees unless and until this entire things becomes a very strong learner so this is how how we basically do the combination of all these things so I hope everybody is able to understand about the XG boost classifier now you may be thinking how does regressor work do you want a regressor problem statement also the decision tree will get constructed based on Independent features and again Lambda value is a hyperparameter we basically set up Lambda value with the help of cross validation now uh let's go ahead and discuss about ex boost regressor the second algorithm that we we will probably discuss about is something called as XG boost regressor and how does X boost regressor actually work some fundamental is follow in random Forest no in random Forest it is completely different there bagging happens bagging happens so over here let's go ahead with the regressor so here I'm going to take some example let's say that I have this many experience this many Gap and based on that we need to determine the salary my salary is my output feature let's say the experience is 2 2.5 3 4 4.5 okay now in this Gap let's say it is yes yes no no yes and let's say that the salary is somewhere around 40K it is 41k 52k and uh let's see some more data set over here 60k and 62k now the first step in classifier we created a base model here also we'll try to create a base model first of all this base model what output it will give it will give the average of all these values what is the average of all these values okay what is the average of all these value 40 81 52 60 62 if I just do the average it is nothing but 51k so by default I will create a base model which will take any input and just give the output as 51 this is the first step now based on this I will try to calculate my residual now how do I calculate my residual I will just subtract 40 by 51k so this will basically be - 11k and uh this will be 10 K - 10 K - 10 and this will be 1 this will be 9 and this will be 11 I hope everybody's able to get this let's say that I I make this as 42k okay for just making my calculation little bit easy so I have 9 over here so this is my residual then again the first step is that I construct my uh decision tree now let's say say that I'm going to use The Experience over here so this is my experience node and based on this experience node I have my features over here so here I will take up all my residuals - 11 99 1 99 11 and then how do I do the split based on experience this is a continuous feature so I have to basically do split with respect to continuous feature which I have already shown you in decision tree how do we do so here is my residual here it is 40 minus this is - 11 K - 9 K uh this is 1 K this is 9 K and 11k - 9k so now I will just create take up my first node here I'm going to use my experience feature I know my values what all things are going to come 11k in the root node - 9 1 9 and 11 now what we are going to do over here is that so I'm going to do again a binary split over here now the binary split will happen based on the continuous feature that is experienced so two types of Records I may get one is less than or equal to two and one is greater than 2 less than or equal to two and one is greater than two now less than or equal to two when I do the split let's see how many values we are getting less than or equal to two I will get only one value that is -1 and here I'm actually going to get all the other values - 9 1 9 11 now what we are going to do after this is that calculate the similarity weight now here the similarity weight will little bit the formula will change with respect to regression so similarity weight is nothing but summation of residual squares divided by number of residuals plus Lambda again here we are going to consider Lambda is zero because this is a hyper parameter tuning more the value of Lambda that basically means more more we are penalizing with respect to the residuals so this will be the formula that we are going to apply okay so let's see for the first number that that we want to apply so how this will get applied again I'm going to write this formula here it'll be better let's say here similarity weight is equal to summation of residual square and here you have number of residuals plus Lambda see previously we were using probability and then all those things we are using so if you want to calculate the similarity weight of this this will become 121 divided by number of residual is 1 plus Lambda is 0 so this is going to be 121 so here we are going to calculate the similarity weight which is nothing but 121 if if we probably take Alpha let's let's do one thing if we probably take uh if if we probably take Alpha is equal to 1 then what will happen if you take Alpha is equal to 1 just think over here what will what may happen we may directly penalize the similarity weight right by just adding one okay so let's do that also suppose I say I'm going to take Alpha is equal to 1 so what will happen this will not be the formula now now what will become 121 divided number of residual is 1 + 1 this is nothing but 65.5 let's say that I now have 65.5 as my similarity weight now similarly I will go ahead and compute the similarity weight for the next one so here it will become - 9 + 9 + 9 + 11 whole Square divided 4 + 1 so this and this will get subtracted 12 squ is nothing but 14 4 144 divid 5 so if I go ahead and calculate 144 ID 5 it is nothing but 28.5 so here I get 28.5 so the similarity weight for this is 28.5 similarly I can go ahead and calculate the similarity weight for this for the top one so it'll be nothing but what it will be 11 + sorry - 11 - 11 - 9 + + 1 + 9 + 11 divided 1 2 3 4 5 5 + 1 is 6 so this is getting subtracted this will be 1X 6 anyhow this will be whole square right so anyhow it will be 1X 6 only so 1X 6 will be my similarity weight over here okay 28.8 hits okay now finally The Information Gain that we need to compute will be very much simple what will be the Information Gain 65.5 + 28.8 minus 1X 6 so try to get it whatever we are trying to get it over here just tell me what will be the output is it 98.34% 60.5 60.5 + 28 88 then this will change just a second 89.1 3 understand you don't have to worry about calculation automatically that things will be doing it okay so you don't have to worry now see we have now further the decision tree can be splitted into any number of times probably the next split what we can do is that we can we can do next split something like this this will be my experience the two splits that may happen with respect to less than or equal to 2.5 less than or equal to 2.5 or greater than 2.5 now if this probably gives the Information Gain better then the split will happen like this otherwise whichever gives the better information again the split will basically happen like this I hope like let's say that this is this is the split that is required - 11 - 11 is 9 is over here and then we have 1 comma 9A 11 okay because less than or equal to 2.5 this two records will definitely go over here and this two This Record will definitely go over here now if I try to calculate the similarity weight for this it will be nothing but - 11 - 9 - 11 - 9 whole S ided 2 + 1 right now in this particular case it will be - 20 s / 3 which is nothing but 400 2 20 into 20 is 400 which is nothing but 3 so if I go and probably use a calculator and show it to you 400 / 3 which is nothing but 133.33 so the similarity weight for this is 133.33 similarly I can go ahead and compute for this it will be 1 + 9 + 11 whole s / 3 + 1 right so it will be 10 + 11 10 + 11 is nothing but 21 whole s/ 4 so what it is 21 whole square if I open my calculator 21 s 21 * 21 which is nothing but 441 divid by 4 divid by 4 so this will probably 110 110. 2.25 and similarly I can go ahead and compute for this so if I want to compute for this what it will be the same thing that we have got over here that is 1x 6 so this will basically be 1X 6 so finally if I compute the information again it will be what it will be 133 1333 + 1.25 - 1X 6 obviously this value will be greater than the previous one what we have got that is 8913 so definitely we are going to use this split which is better than the previous split right let's say that this split has been considered finally how do we see the output okay I hope everybody is able to understand right let's say that this split has worked well so I'm going to rub all these things 11.25 is there now suppose I want to do the inferencing how the inferencing will be done 11.25 here 110.2 now suppose any record comes from here first of all any record that will go it will go to the base model so the base model whenever it goes the value is 51 51 plus alpha 1 this is my learning rate one suppose if it goes in this route then what we have we have - 11 - 9 whenever we go in this rote which has - 11 and - 9 the average of both these numbers will be considered what is average of both these numbers - 11 - 1 9/ 2 this is nothing but - 10 right so - 10 will get multiplied here suppose if it goes in this route then here what will happen here will 1 + 9 + 11 divide by 3 average will be taken so 21 divid 3 7 will be there so this will get replaced by 7 so similarly anything that you are doing this is with respect to decision tree 1 like this we will again construct decision tree separately and again it will become Alpha 2 by decision Tre 2 Alpha 3 by decision 3 3 and like this you will be doing till Alpha and decision 3 n and once you calculate this this will be your specific output in a regression tree so in this particular case what will happen you're just trying to play with parameters and you're trying to use in a different way to compute all this things everybody clear but again it is a blackbox model you cannot visualize all this things now let's go to the third algorithm which is called as s VM see svm is almost like decision uh logistic regression okay so the major aim of svm is that major aim of svm is that suppose if I have a do data points like this okay we obviously use uh logistic regression to split this data points right like this we try to create a best fit line which looks like this and probably based on this best fit line we try to divide the point now in svm what we do is that we not only create a best fit line but instead we also create a point which is called as marginal planes so like this we create some marginal plane so this is your hyper plane and this is your marginal plane and whichever plane has this maximum distance will be able to divide the points more efficiently but usually in in a normal scenario you know whenever we talk about hyper plane or whenever we talk about marginal plane there will be lot of overlapping of points right suppose if I have some specific points I have one point which looks like this I may also have another points which may overlap so it is very difficult to get an exact straight marginal planes and split the point based on this now this specific marginal plane should be maximum because we can create any type best fit line and probably uh use this marginal plane now if we have this overlapping right if for what do we call for this kind of plane this kind of plane is basically called as hard marginal plane so this is basically called as hardge marginal plane okay and similarly if any points are overlapping suppose this yellow points can also get overlapped over here and there may be some kind of Errors so for this particular case we basically say as soft marginal plane because here we will be able to see that errors will be there now in asvm what we focus on doing is that we focus on creating this marginal plane with maximum distance even though there are some errors we consider it in solving it by providing some kind of hyper parameter now how do we go ahead and basically create this all marginal planes and how do we go ahead with this it's very much simple uh just imagine in this specific way that initially let's consider that I have this data point suppose this is my best fit line how do we give this best fit line as equation we basically say yal mx + C right we we basically say this equation as y mx + C no hard hard marginal it is impossible in a normal data set obviously you'll not be able to get it but definitely we go ahead with creating a soft marginal plan now Y is equal to MX plus C what does this m indicate m is nothing but slope and C indicates nothing but intercept can I say that this both equations are same ax + b y + C isal 0 can I also say that this is the equation of a straight line can I say that this is also the equation of straight line I will say that both of them are equal can I say both of them are equal see if I try to prove this to you if I take this equation and try to find out y it will be nothing but minus C Min - c minus a sorry - a x and this will be divided by B this will be divided by B this will be divided by B so here you can see that it is almost the same in this particular case my M value will be - A by B and my C will basically be minus C by B so both the equation are almost same so let's consider that this is my equation and I am actually and whenever I say Y is equal to mx + C can I also write something like this Y is equal to W1 X1 + W2 X2 plus like this plus C or plus b same thing no so here also we can write y w transpose x + B same equation right we are basically using same equation yes we can also write it in a different way but at the end of the day we are also treating something like this let's say that this slope is in this direction if this slope is in this direction then I can basically say that let's consider that the slope is minus one let's say that this slope is minus one see it is in the negative Direction let's say that this slope is minus one I'm just trying to prove that this slope is negative value let's consider this now suppose this is one of my point - 4a 0 and obviously this particular equation is given by this particular line is given by this equation now if I really want to find out the Y value let's say that this is my X1 this is my X1 and this is my X2 let's say that I want to find out I want to find out this W transpose x + b the Y value based on this line if I want to compute the y- value based on this line how will I compute W transpose X basically means what w value what all things will be there one value is B right B is intercept right now intercept is passing from origin can I say my B will be zero obviously I can assume that b will be zero now in this particular case if I talk about w w in this case is minus one which I have initialized over here so if I want to do this matrix multiplication it will be W transpose can be written as like this and this x value can be written as -4 comma - 4 and 0 -4 and 0 right so I can basically write like this now if I do this multiplication what will my value I get I will basically get four right so this is a positive value this is a positive value Now understand since this is a positive value any points that are below this line any points that I consider below this line and if I try to calculate the Y can I say that it will always be positive yes or no similarly if I could probably consider one point over here as 4A 4A 4 now tell me in this 4A 4 if I calculate the Y value what will you get whether you'll get a positive value or a negative value if I try to calculate the Y value in this case because here only positive values will'll be getting right so if I calculate the Y value will the Y value be negative or positive just try to calculate how do you calculate again I will use y equation this time again my slope is minus1 my intercept is zero and here I will have 4 comma 4 now here Min -4 and then this is + 0 this will be Min -4 right so this will be a negative value negative value guys negative see - 4 + 0 negative so any point that I will probably have in top of this any points Above This Plane right and if I try to calculate the Y value it will always be negative so what two things you are able to get positive and negative so you can consider this entirely one category this another category at least these two things you can basically consider guys I hope everybody is able to understand this so this will be my one category and this will be my another category obviously so that basically means I can definitely use a plane and split this point I hope everybody is able to understand now let's go ahead and let's see how this marginal plane will get created and what is the cost function to basically do this or what is the cost function in making sure that the marginal plane will definitely work right it becomes difficult right so suppose let's consider an example suppose I say that this is my lines let's say uh I want to basically create a kind of I have two variety of points one is this point let's say I have all this points like this and the other points I have somewhere here let's consider I am just using directly good number of points so that I can split it okay because I will try to talk about it what I'm actually trying to prove so obviously this is my best fit line that splits and apart from that what I will do is that I'll also create a marginal points so in order to create the marginal point I may use some different color let's see which color this will be my one marginal point remember it will be to the nearest point over here and basically we will construct like like this and similarly here we will be constructing like this I've already told you guys this equation can be mentioned at w transpose x + B = 0 right I can definitely say this because ax + b y + C is equal to 0 so this I can also write it as W transpose x equal to 0 sorry plus b plus b equal to 0 so both are same okay this I don't have to prove it I hope everybody's clear with this now what I'm going to do let's represent this line also with some equation so this line if I want to represent this will be W transpose x + B what value will come over here positive or negative C from this line anything above this plane right any any any distance that we try to find out it will always be negative so let's say that I'm using it as minus one to just read as it is a negative value and this line that I am going to mention it it will be W transpose x + B is equal to + 1 Min -1 above + 1 because we have already discussed from this point if you're trying to calculate the Y value it is always going to be + one this is going to be minus one here I should definitely say this as K okay but I'm not mentioning K in many articles you'll see it as minus one uh many research paper also they use it as minus one but I would like to specify uh minus and plus K but here let's go and write minus1 and plus now my aim is to increase this distance okay this distance I really want to increase this distance now in order to increase this if I increase this distance that basically means my model is performing well so let's say I want to find this distance first of all so if I write w transpose X Plus Bal to 1 and here I will write w transpose x + B isal minus1 so what I'm going to do I'm going to do the computation and subtract it like this so here obviously this will be my X1 this will be my X2 okay because these are my another points X2 and X1 so I can write w transpose X1 - X2 B and B will get cancell and here I will be writing two right so from here we can definitely write two different things let's see what all things we can write so here this is nothing but the difference between my this plane and this plane which is given by like this okay now always understand whenever we consider any any vector vors right any vectors right it also has something called as magnitude so if I want to remove this magnitude I can divide this by W this magnitude of w then only my Vector will remain which is indicated like this so I'm going to basically divide by this particular operation both both the side I'm dividing by this magnitude of w and I don't care about the directions over here right now we just care about the vectors now when I write like this what is our aim our aim is to can I say our aim is to our aim is to maximize 2 byw can I say this guys yes or no what is our aim our aim is to basically maximize this right by updating W comma B value I need to maximize this yes everybody's clear with this can I say that yes I want to maximize this yes or no everybody I want to maximize this if I maximize this that basically means my marginal plane will become bigger my marginal plane will be bigger okay now can I write along with this that such that y of I my output will be dependent on two different things one is I can say that my y y of I is plus of uh is + one when w transpose x + B is greater than or equal to 1 everybody see in this equation what I'm actually trying to specify such that y of I is + 1 when w transpose x + B is greater than 1 and when it is minus 1 that basically means w transpose of X is B is less than or equal to minus now what does this basically mean see all my values whenever I compute W transpose x + B is greater than or equal to 1 I'm obviously going to get this + one when w transpose X+ B is less than or equal to 1 I'm always going to get the output as minus one I hope that is the reason why I have actually written like this so this two we have already discussed why we are specifically writing we want to increase the marginal plane which is this this is my marginal plane and I'm writing one condition that my Yi value will be+ one when w transpose X plus b is greater than or equal to 1 otherwise it when it is less than or equal to minus one it is going to be very much clear with this transpose condition we have already done it everybody clear with this now on top of it we can add one more very important Point instead of writing such that and all you can also say that our major aim our major aim is that if I multiply y i multiplied by W transpose X of I + B If I multiply this two this will always be able greater than or equal to 1 for correct points right for correct points because understand if it is minus one if I'm multiplying with this and if it is a correct Point minus into minus will obviously be greater than or equal to one only right similarly for this it will be greater than 1 so I can also definitely say that my major M If I multiply y of I with this it will be always greater than or equal to + 1 U which is definitely saying that it will be a positive value so this is just a representation guys but understand what is the minimized cost function this is my minimized cost function maximized cost function now I'm going to again write it down maximize W comma B maximize W comma b 2 by magnitude of w I can also write something like this minimize W comma B and I can just inverse this which looks like this are these both are same or not because always understand in machine learning algorithm why do we write minimize things because we are trying to minimize something okay both are equivalent these both are equivalent and why we specifically write minimization because in the back propagation when we we are continuously updating the weights of w and B so we can definitely write like this so here my main target is to minimize this particular value by changing W and B and I will start adding some more parameters over here this is fine till here I think everybody has got it this is our aim and we are going to do this but I'm going to add two more parameters in this Optimizer one is C of I and one is summation of I equal 1 to n and here I will use something called as EA EA of I first of all I'll tell what is C of I see if I have this specific data point let's say if some of my points are over here then is it a right right prediction or wrong prediction if some of my points are over here is it a right prediction or wrong prediction obviously it is a wrong prediction if my points are somewhere here is it a WR prediction wrong wrong incorrect prediction right so this C value basically says that how many errors we can have how many errors we can have if it says that fine we can have six errors or seven errors how many errors we can have even though we are using the marginal plane how many errors we can have so here I'm specifically writing how many errors we can have this is what is specified by C ofi EA of I basically says that what is the summation of I'm going to write it down since we are doing the sumission this entire term basically mentions that sumission of the distance of the values distance of the wrong points and how do we calculate the distance from here to here suppose this is a wrong point I will try to calculate the distance from here to here I will do the sumission of this I'll do the sumission of this I will do the sumission of this similarly for the Green Point another sumission will happen from here to here like this here to here and we going to do that specific sumission so we are telling that fine if you are not able to fit properly try to apply this two hyperparameters and try to make sure that this many errors are also there it is well and good no problem we will go ahead with that try to do the submission of the data points and based on that try to construct the best fit line along with the marginal plane like this even though there are some errors over here or errors over here we are good to go with respect one more thing is there which is called as Al svr svr only one thing is getting changed in svr only this value will get changed so I want you all to explore and just let me know this will be one assignment for you only this value will be changing remaining everything are same so just try to if you change this particular value that becomes an svr just try to explore and just try to find out and just try to let me know so overall uh did you like the entire session everyone okay in this one more thing is there which is called as kernel Matrix svm kernel we say it as svm kernel now in s VM kernel what happens suppose if I have a specific data points which looks like this which looks like this so we obviously cannot use a straight line and try to divide it so what we do we convert this two Dimension into three dimensions and then probably we push our Point like this one point will go like this and the white point will go down and then we can basically use a plane to split it so I uploaded a video around uh around that and uh you can definitely have a look onto that and I have also shown you practically how to do it that is the reason I've created that specific video so great uh this was it from my side I hope you like this session so thank you everyone have a great day keep on rocking keep on learning and never give up
Download Subtitles
These subtitles were extracted using the Free YouTube Subtitle Downloader by LunaNotes.
Download more subtitlesRelated Videos
Download Accurate Subtitles and Captions for Your Videos
Easily download high-quality subtitles to enhance your video viewing experience. Subtitles improve comprehension, accessibility, and engagement for diverse audiences. Get captions quickly for better understanding and enjoyment of any video content.
Download Subtitles for Your Favorite Videos Easily
Enhance your video watching experience by downloading accurate subtitles and captions. Enjoy better understanding, accessibility, and language support for all your favorite videos.
Download Subtitles for Creaconnect V3 Video Easily
Enhance your viewing experience by downloading accurate subtitles for the Creaconnect V3 video. Subtitles improve understanding, accessibility, and ensure you don't miss any important details. Get your captions now and enjoy seamless video content.
Download Subtitles for SMCCC2 Video and Enhance Understanding
Download accurate subtitles for the SMCCC2 video to improve comprehension and accessibility. Enjoy clear captions that help you follow along easily and grasp key content effortlessly.
Download Subtitles for Essentials: Tools to Boost Attention & Memory
Access accurate subtitles for Dr. Wendy Suzuki's informative video on boosting attention and memory. Downloading these captions enhances understanding and allows you to follow along easily, improving retention and learning effectiveness.
Most Viewed
Download Subtitles for 2025 Arknights Ambience Synesthesia Video
Enhance your viewing experience of the 2025 Arknights Ambience Synesthesia — Echoes of the Legends by downloading accurate subtitles. Perfect for understanding the intricate soundscapes and lore, these captions ensure you never miss a detail.
Download Subtitles for Girl Teases Friend Funny Video
Enhance your viewing experience by downloading subtitles for the hilarious video 'Girl Teases Friend For Having Poor BF'. Captions help you catch every witty remark and enjoy the humor even in noisy environments or for non-native speakers.
تحميل ترجمات فيديو الترانزستورات كيف تعمل؟
قم بتنزيل ترجمات دقيقة لفيديو الترانزستورات لتسهيل فهم كيفية عملها. تعزز الترجمات تجربة التعلم الخاصة بك وتجعل المحتوى متاحًا لجميع المشاهدين.
離婚しましたの動画字幕|無料で日本語字幕ダウンロード
「離婚しました」の動画字幕を無料でダウンロードできます。視聴者が内容をより深く理解し、聴覚に障害がある方や外国人にも便利な字幕付き動画を楽しめます。
Download Accurate Subtitles and Captions for Your Videos
Easily download high-quality subtitles to enhance your video viewing experience. Subtitles improve comprehension, accessibility, and engagement for diverse audiences. Get captions quickly for better understanding and enjoyment of any video content.

