LunaNotes

Download Subtitles and Captions for Any Video Easily

null

null

null

9542 segments EN

SRT - Most compatible format for video players (VLC, media players, video editors)

VTT - Web Video Text Tracks for HTML5 video and browsers

TXT - Plain text with timestamps for easy reading and editing

Subtitle Preview

Scroll to view all subtitles

[00:06]

so today's session what all things we

[00:08]

are basically going to discuss so first

[00:10]

of all we going to discuss about

[00:12]

different types of machine learning

[00:13]

algorithm like how many different types

[00:15]

of machine learning

[00:16]

algor understand the purpose of taking

[00:20]

this session is to clear the interviews

[00:23]

okay clear the interviews once you go

[00:25]

for a data science interviews and all

[00:28]

the main purpose is to clear the

[00:29]

interviews I've seen people who knew

[00:32]

machine learning algorithms in a proper

[00:34]

way okay they were definitely able to

[00:36]

clear it because they just explain the

[00:38]

algorithms in a better way to the

[00:40]

recruiter so that they got hired first

[00:42]

of all is the introduction to machine

[00:45]

learning here I'm just specifically

[00:47]

going to talk about AI versus ml versus

[00:51]

DL versus data sign then the second

[00:53]

thing that we are going to talk about

[00:55]

over here is the difference between

[00:58]

supervised MS

[01:00]

and unsupervised ml the third thing that

[01:03]

we are probably going to discuss about

[01:05]

is something called as linear regression

[01:08]

so we are going to clearly understand

[01:10]

the maths and geometric intuition the

[01:13]

next thing that we are probably going to

[01:15]

discuss about is R square and adjusted R

[01:18]

square the fifth topic that we are going

[01:20]

to discuss about is Ridge and lasso

[01:23]

regression the first topic that we are

[01:25]

going to discuss about is AI versus ml

[01:30]

versus DL versus data science so this is

[01:34]

the first topic that we are probably

[01:35]

going to discuss if you really want to

[01:38]

understand the difference between AI

[01:39]

versus ml versus DL versus data science

[01:41]

we will go in this specific format so

[01:43]

just imagine the entire universe so this

[01:46]

entire universe I will probably call it

[01:48]

as an AI now specifically when I say AI

[01:51]

this basically means AI artificial

[01:53]

intelligence whatever role you are in

[01:55]

you are as a machine learning developer

[01:57]

you working as a deep learning developer

[01:59]

Vision developer or a data scientist or

[02:02]

an AI engineer at the end of the day you

[02:05]

are actually creating AI application so

[02:09]

if I really want to Define what is this

[02:11]

artificial intelligence you can just say

[02:13]

that it is a process wherein we create

[02:16]

some kind of applications in which it

[02:19]

will be able to do its task without any

[02:22]

human intervention so that basically

[02:24]

means a person need not monitor this AI

[02:27]

application automatically it'll be able

[02:29]

to make decisions it will be able to

[02:31]

perform its task and it will be able to

[02:34]

do many things so this is what an AI

[02:36]

application is some of the examples that

[02:38]

I would definitely like to consider so

[02:41]

the first example that I would like to

[02:43]

consider AI application AI module

[02:46]

Netflix has an AI module suppose if you

[02:49]

see a kind of action movie for some time

[02:53]

then the kind of AI work or AI work that

[02:56]

is basically implemented over here is

[02:57]

something called as recommendation

[03:00]

so here through this application what

[03:04]

happens is that when you're continuously

[03:06]

seeing the action movies then

[03:08]

automatically the AI module that is

[03:10]

present inside Netflix will make sure

[03:13]

that it gives us recommendation on

[03:15]

action movies second if I take an

[03:18]

example of comedy movie If I

[03:20]

continuously see comedy movie then also

[03:22]

it'll give us the recommendation of the

[03:24]

comedy movie so this through this what

[03:26]

happens is that it understands your

[03:28]

behavior and it is being able to do its

[03:30]

task without asking you anything the

[03:33]

second example that I would like to take

[03:35]

up in is

[03:36]

amazon.in now amazon.in again if you buy

[03:39]

an

[03:40]

iPhone then it may recommend you a

[03:43]

headphones so this kind of

[03:45]

recommendation is also a part of AI

[03:48]

module that is integrated with the

[03:49]

amazon.in website the ads that you see

[03:52]

probably when you opening my channel

[03:55]

through which I get paid a little bit

[03:56]

from my from a from the hard work that I

[03:59]

do in YouTube right so through that ads

[04:02]

how that is recommended to you uh that

[04:05]

is also an AI engine that is included in

[04:07]

the YouTube channel itself which really

[04:09]

plays it is a business-driven goal

[04:12]

understand it is a business driven

[04:13]

things that we basically do with the

[04:15]

help of AI one more example that I would

[04:17]

like to give you is if I consider it

[04:20]

self-driving cars so here you'll be able

[04:23]

to see self-driving cars if you take an

[04:25]

example of Tesla so self-driving cars

[04:27]

what happens based on the road it is

[04:29]

able ble to drive it automatically who

[04:31]

is doing that there is an AI application

[04:33]

integrated with the car itself right so

[04:36]

if I consider all these things these all

[04:38]

are AI application at the end of the day

[04:42]

whatever role you do you are going to

[04:44]

create an AI application this is the

[04:46]

common mistake what people do you know

[04:48]

like our CEO sudhansu Kumar he has

[04:50]

written in his profile that he's an AI

[04:52]

engineer that basically means his goal

[04:55]

is to create an AI application so

[04:57]

probably in a product based companies

[04:58]

you'll be seeing this kind of roles

[04:59]

called as AI engineer now let's go to

[05:01]

the next role which is called as machine

[05:03]

learning so where does machine learning

[05:04]

comes into existence so if I try to

[05:07]

create this machine learning is a subset

[05:10]

of AI and what is the role of machine

[05:12]

learning it provides stats

[05:15]

tools

[05:17]

to analyze the data visualize the data

[05:22]

and apart from that to do

[05:24]

predictions I'm

[05:27]

forecasting so you will be seeing a lot

[05:29]

of machine learning algorithms so

[05:31]

internally those machine learning

[05:33]

algorithm the equation that we are

[05:34]

basically using it is basically using it

[05:38]

is having a kind of stats tool stat

[05:40]

techniques because whenever we work with

[05:42]

data statistics is definitely very much

[05:44]

important so this exactly is called as

[05:47]

machine learning so it is a subset of AI

[05:50]

this is very much important to

[05:52]

understand ml is a subset of AI so here

[05:55]

you can see that it is a part of this

[05:57]

now let's go to the next one which is

[05:59]

called called as deep learning deep

[06:01]

learning is again a subset of ml now

[06:04]

let's consider why deep learning came

[06:05]

into existence because in 1950s 60s

[06:09]

scientists thought that can we make

[06:11]

machine learn like how we human being

[06:13]

learn so for that particular purpose

[06:16]

deep learning came into existence here

[06:18]

the plan is to basically mimic human

[06:21]

brain so when I say mimicking human

[06:24]

brain that basically means we are trying

[06:26]

to mimic the human brain to implement

[06:28]

something to learn something so for this

[06:31]

you use something called as

[06:32]

multi-layered neural networks so this is

[06:35]

what deep learning is it is a subset of

[06:37]

machine learning its main aim is to

[06:40]

mimic human brain so they actually

[06:42]

create multi-layer neural network and

[06:45]

this multi-layered neural network will

[06:47]

basically help you to train the machines

[06:49]

or applications whatever we are trying

[06:51]

to create and deep learning has really

[06:54]

really done an amazing work with the

[06:56]

help of deep learning we are able to

[06:58]

solve such a complex complex complex use

[07:02]

cases that we will be probably

[07:04]

discussing as we go ahead now if I come

[07:06]

to data science see this is the thing

[07:08]

guys if you want to say yourself as a

[07:10]

data scientist tomorrow you given a

[07:13]

business use case and situation comes

[07:15]

that you probably have to solve that use

[07:17]

case with the help of machine learning

[07:18]

algorithms or deep learning algorithms

[07:20]

again the final goal is to create an AI

[07:22]

application right you cannot say that I

[07:24]

am a data scientist and I'll just work

[07:26]

in machine learning I or I'll work in

[07:29]

deep learning or I may I don't know how

[07:31]

to analyze the data no you cannot do

[07:33]

that when I was working in Panasonic I

[07:36]

got various different kind of task

[07:39]

sometime I was told to use W powerbi to

[07:41]

visualize analyze the data sometime I

[07:43]

was given a machine learning project

[07:45]

sometime I was given a deep learning

[07:46]

project so as a data scientist if I

[07:49]

consider where does data scientist fall

[07:51]

into this it will be a part of

[07:53]

everything so if I talk about machine

[07:56]

learning and deep learning with respect

[07:58]

to any kind of problem statement that we

[08:00]

solve the majority of the business use

[08:03]

cases will be falling in two sections

[08:05]

one is supervised machine learning one

[08:08]

is unsupervised machine learning so most

[08:10]

of the problems that you are basically

[08:12]

solving this is with respect to this two

[08:15]

problem statement two different types of

[08:16]

machine learning algorithms that is

[08:18]

supervised machine learning and deep

[08:20]

learning if I talk about supervised

[08:22]

machine learning two major problem

[08:24]

statements that you are basically

[08:25]

solving here also one is regression

[08:28]

problem

[08:30]

and the other one is something called as

[08:31]

classification problem and in the case

[08:34]

of unsupervised machine learning problem

[08:36]

statement you are basically solving two

[08:37]

different types of problem one is

[08:39]

clustering and one is dimensionality

[08:42]

reduction and there is also one more

[08:44]

type which is called as reinforcement

[08:46]

learning reinforcement learning I can I

[08:50]

I will definitely talk about this not

[08:52]

right now right now we are just focusing

[08:53]

on all these things now understand what

[08:56]

happens in supervised machine learning

[08:58]

let's consider consider a data set so

[09:00]

here I have a data set which says this

[09:03]

is my age and this is my weight suppose

[09:07]

I have these two specific features let's

[09:09]

say that I have values like 24 62 25 63

[09:15]

21 72

[09:19]

257 uh 62 and many more data over here

[09:23]

let's say that my task is to basically

[09:25]

take this particular data and create a

[09:27]

model wherein so suppose my task is that

[09:31]

I need to create a model whenever it

[09:34]

takes the New Age first of all we train

[09:36]

this model with this data and whenever

[09:39]

we take age a new age it should be able

[09:41]

to give us the output of weight this

[09:44]

particular model is also called as

[09:46]

hypothesis okay I'll discuss about this

[09:49]

today when I we discussing about linear

[09:51]

regression now what are the important

[09:53]

components whenever we have this kind of

[09:55]

problem statement first of all you need

[09:56]

to understand there are two important

[09:59]

things one is independent features and

[10:02]

the other one is something called as

[10:03]

dependent features now let's go ahead

[10:05]

and discuss what is independent feature

[10:07]

independent feature basically means in

[10:09]

this particular case since the input

[10:11]

that I'm basically training in all those

[10:13]

features becomes an independent feature

[10:15]

now in this particular case my age is

[10:17]

independent feature and whatever I'm

[10:20]

actually predicting so when I say

[10:21]

predicting I know this is my output okay

[10:24]

this is the what I have to basically

[10:27]

make my model uh give this as a an

[10:29]

output so in this particular casee my

[10:31]

dependent feature becomes weight why we

[10:34]

specifically say a dependent feature

[10:36]

because this is completely dependent on

[10:38]

this value whenever this is increasing

[10:40]

or decreasing this value is basically

[10:41]

getting changed so that is the reason

[10:44]

why we basically say this has

[10:45]

independent and dependent feature

[10:47]

whenever we are solving a problem right

[10:50]

in the case of supervised machine

[10:51]

learning remember they will be one

[10:53]

dependent feature and there can be any

[10:55]

number of independent features now let's

[10:58]

go ahead and let's discuss about

[10:59]

regression and classification what is

[11:01]

the difference between them now let

[11:03]

let's go ahead and let's discuss about

[11:05]

two things one

[11:08]

is let's say I want a regression problem

[11:11]

statement suppose I take the same

[11:14]

example as age and weight so I have

[11:17]

values like as discussed 24 72 23

[11:22]

71 uh 24 or 25

[11:26]

71.5 okay so this kind of data I have

[11:29]

see this is my output variable which is

[11:32]

my dependent feature now in this

[11:34]

particular dependent feature now

[11:36]

whenever I'm trying to find out the

[11:37]

output and in this particular output you

[11:39]

have a continuous variable when you have

[11:42]

a continuous variable then this becomes

[11:44]

a regression problem statement now one

[11:47]

example I would like to give suppose

[11:49]

this is my data set right this is my age

[11:52]

this is my weight suppose I am

[11:54]

populating this particular data set with

[11:56]

the help of scatter plot then in order

[11:58]

to basically solve this problem what

[12:01]

we'll do suppose if I take an example of

[12:03]

linear regression I will try to draw a

[12:05]

straight line and this particular line

[12:08]

is my equation which is called as yal mx

[12:11]

+ C and with the help of this particular

[12:13]

equation I will try to find out the

[12:15]

predicted points so this will be my

[12:17]

predicted point this will be my

[12:18]

predicted point this this any new points

[12:21]

that I see over here will basically be

[12:23]

my predicted point with respect to Y so

[12:26]

in this way we basically solve a

[12:28]

regression problem statement so this is

[12:30]

very much important to understand let's

[12:32]

go to the always understand in a

[12:34]

regression problem statement your output

[12:35]

will be a continuous variable the second

[12:37]

one is basically a classification

[12:40]

problem now in classification problem

[12:42]

suppose I have a data set let's say that

[12:45]

number of hours study number of study

[12:48]

hours number of play

[12:51]

hours so this is my independent feature

[12:54]

let's say a number of sleeping hours and

[12:57]

finally I have my output which will will

[12:59]

be pass or fail so in this I have all

[13:03]

this as my independent features and this

[13:05]

is my dependent feature so I will be

[13:08]

having some values like this and here

[13:11]

either you'll be pass or fail or pass or

[13:15]

fail now whenever you have in your

[13:18]

output fixed number of categories then

[13:21]

that becomes a classification problem

[13:23]

suppose it just has two outputs then it

[13:25]

becomes a binary classification if you

[13:28]

have more than two different categories

[13:30]

at that time it becomes a multiclass

[13:32]

classification so this is the difference

[13:34]

between regression problem statement and

[13:36]

the classification problem statement now

[13:39]

let's go ahead and let's discuss about

[13:40]

something called as unsupervised machine

[13:42]

learning now in unsupervised machine

[13:44]

learning which is my second main topic

[13:47]

over here I'm just going to write

[13:49]

unsupervised machine learning now what

[13:52]

exactly is unsupervised machine learning

[13:54]

here whenever I talk about there are two

[13:56]

main problem statement that we solve one

[13:58]

is clustering

[13:59]

one is dimensionality reduction let's

[14:02]

take one example of a specific data set

[14:04]

over here let's say that my data set is

[14:06]

something called as salary and age now

[14:10]

in this scenario we don't have any

[14:12]

output variable no output variable no

[14:14]

dependent variable then what kind of

[14:16]

assumptions that we can take out from

[14:19]

this particular data set suppose I have

[14:21]

salary and age as my values so in this

[14:23]

particular case I would like to do

[14:25]

something called as clustering now why

[14:28]

clustering is used just understand let's

[14:31]

say I am going to do something called as

[14:33]

customer segmentation now what does this

[14:35]

customer segmentation do clustering

[14:37]

basically means that based on this data

[14:39]

I will try to find out similar groups

[14:41]

groups of people suppose this is my one

[14:44]

group this is my another group this is

[14:46]

my third group let's say that I was able

[14:48]

to create this many groups this many

[14:50]

groups are clusters I'll say cluster 1 2

[14:53]

three each and every cluster will be

[14:56]

specifying some information this cluster

[14:58]

May specify that this person uh he was

[15:01]

very young but he was able to get some

[15:03]

amazing salary this person it may some

[15:06]

specify that these people are basically

[15:07]

having more age and they are getting

[15:10]

good salary these people are like middle

[15:12]

class background where with respect to

[15:14]

the age the salary is not that much

[15:16]

increasing so here what we are doing we

[15:18]

are doing clustering we are grouping

[15:20]

them together main thing is grouping

[15:23]

this word is very much important now why

[15:25]

do we use this suppose my company

[15:28]

launches is a product and I want to just

[15:31]

Target this particular product to rich

[15:33]

people let's say product one is for rich

[15:35]

people product two is for middle class

[15:37]

people so if I make this kind of

[15:40]

clusters I will be able to Target my ads

[15:43]

only to this kind of people let's say

[15:46]

that this is the rich people this is the

[15:48]

middle class people I will be able to

[15:50]

Target this particular ads or this

[15:53]

particular product or send this

[15:55]

particular things to those specific

[15:56]

group of people by that that is

[15:59]

basically called as ad marketing and

[16:00]

this uses something called as customer

[16:04]

segmentation a very important example

[16:07]

and based on this customer segmentation

[16:08]

we can later apply any regression or

[16:10]

classification kind of problem statement

[16:12]

now coming to the second one after

[16:14]

clustering which is called as

[16:15]

dimensionality reduction now in

[16:17]

dimensionality reduction what we are

[16:19]

focusing on suppose if we have th000

[16:22]

features can we reduce this features to

[16:25]

lower Dimensions let's say that I want

[16:27]

to convert this

[16:29]

uh th000 feature to 100 features lower

[16:32]

Dimension so can we do that yes it is

[16:36]

possible with the help of dimensionality

[16:38]

deduction algorithm there are some

[16:40]

algorithms like PCA so I'll also try to

[16:42]

cover this as we go ahead understand

[16:44]

clustering is not a classification

[16:46]

problem clustering is a grouping

[16:48]

algorithm there is no output feature no

[16:50]

dependent variable in clustering sorry

[16:53]

in unsupervised ml so yes I will also

[16:55]

try to cover up LDA we'll cover up PCA

[16:58]

and all as we go ahead so with respect

[17:00]

to supervised and unsupervised so first

[17:03]

thing that we are going to cover is

[17:04]

something called as linear regression

[17:06]

the second algorithm that we will try to

[17:08]

cover after linear regression is

[17:10]

something called as Ridge and lasso

[17:12]

third that we are going to cover is

[17:14]

something called as logistic regression

[17:16]

the fourth that we are basically going

[17:17]

to cover is something called as decision

[17:19]

tree decision tree includes both

[17:21]

classification and regression four fifth

[17:24]

that we are going to cover is something

[17:25]

called as adab boost sixth that we are

[17:27]

going to cover is something called as

[17:28]

random Forest seventh that we are going

[17:30]

to cover is something called as gradient

[17:32]

boosting eighth that we are going to

[17:34]

cover is something called as XG boost N9

[17:37]

that we are going to cover is something

[17:38]

called as n bias then when we go to the

[17:41]

unsupervised machine learning algorithm

[17:43]

the first algorithm that we are going to

[17:45]

do is something called as K means K

[17:47]

means algorithm then we also have DV

[17:48]

scan then we are also going to do higher

[17:50]

C clustering there is also something

[17:52]

called as K nearest neighbor clustering

[17:55]

fifth we'll try to see about PCA then

[17:57]

LDA so different different things we

[18:00]

will try to cover up yes svm I have

[18:02]

missed here I'm going to include svm KNN

[18:05]

will also get covered so I have that in

[18:07]

my list probably I may miss one or two

[18:08]

but we are going to cover everything so

[18:10]

let's start our first algorithm linear

[18:13]

regression so let's go ahead and discuss

[18:15]

about linear regression linear

[18:16]

regression problem statement is very

[18:18]

simple guys so suppose I have let's say

[18:21]

I have two features one is my X feature

[18:23]

and one is my y feature let's say that X

[18:25]

is nothing but age and Y is nothing but

[18:29]

weight so based on these two features I

[18:31]

have some data points that has been

[18:34]

present over here so in linear

[18:35]

regression what we try to do is that we

[18:38]

try to create a model with the help of

[18:40]

this training data set so this will be

[18:43]

my training data set what I'm actually

[18:45]

going to do is that I'm going to

[18:47]

basically train a model and this model

[18:50]

is nothing but a kind of hypothesis

[18:52]

testing or it is just kind of hypothesis

[18:54]

which takes the new age and gives the

[18:57]

output of the weights and then with the

[19:01]

help of performance metrics we try to

[19:03]

verify whether this model is performing

[19:05]

well or not now in short what we are

[19:06]

going to do in linear regression is that

[19:08]

we'll try to find out a best fit line

[19:10]

which will actually help us to do the

[19:12]

prediction that basically means if I get

[19:14]

my new age over here then what should be

[19:16]

my output with respect to Y okay so with

[19:19]

respect to this what should be my output

[19:21]

over here in this particular case

[19:23]

whenever we are drawing a diagram like

[19:24]

this I can basically say that Y is a

[19:28]

linear function of X so this is what we

[19:31]

are going to do now understand how we

[19:33]

are going to create this best fit line

[19:35]

this is very much important whenever we

[19:36]

say linear regression it basically means

[19:39]

that we are going to create a linear

[19:40]

line over there you may be thinking sir

[19:43]

why to create linear line why not

[19:44]

nonlinear line that I'll discuss about

[19:46]

it as we go ahead see other other

[19:48]

algorithms so to begin with let's

[19:51]

consider this line that you see over

[19:53]

here right this line equation can be

[19:56]

given by multiple equations someone some

[19:58]

people people write yal mx + C some

[20:01]

people write uh H some people write yal

[20:05]

beta 0 + beta 1 into X some people write

[20:08]

H Theta of xal to Theta 0 + Theta 1 into

[20:13]

X many many equations are there for this

[20:16]

this straight line this straight line

[20:18]

many many equations are there with

[20:20]

respect to many many different kind of

[20:22]

notations but the first algorithm that I

[20:24]

have probably learned of linear

[20:26]

regression is from Andrew Ng definitely

[20:29]

I would like to give him the entire

[20:30]

credits and based on his notation

[20:33]

whatever he has explained I'll try to

[20:34]

explain you over here so the credits for

[20:37]

this algorithm specifically goes to

[20:40]

Andrew NG so let's consider this one

[20:43]

over here in order to create this

[20:45]

straight line I will basically use a

[20:47]

equation which is called as H Theta so

[20:50]

this is the equation of a straight line

[20:52]

if I know the equation of the straight

[20:54]

line whatever I can write I can write

[20:56]

many things yal mx + C yal beta 0 + beta

[21:00]

1 * X and then I can also write one more

[21:04]

that is H Theta of xal theta 0 + Theta 1

[21:08]

into X of I here also you can basically

[21:11]

say x of I here also you can say x of I

[21:13]

now let's go ahead and let's take this

[21:15]

equation for now let's take this

[21:17]

equation of now so I'm I'm going to take

[21:19]

out this equation and just write one

[21:21]

equation through which I have also

[21:23]

studied but I will definitely be adding

[21:25]

some points which probably Andrew and

[21:27]

could not mention mention in his video

[21:29]

but I'll try my level best obviously he

[21:32]

is the best I cannot even compare myself

[21:34]

to him so Theta 0 + Theta 1 into X now

[21:39]

let's understand what is Theta 0 Theta 1

[21:42]

as I said that let's say I have a

[21:44]

problem statement over here let's say I

[21:47]

this is my X and this is my y this is my

[21:49]

data points now what I'm doing I'm

[21:51]

trying to create a best fit line like

[21:53]

this now what is this best fit line what

[21:55]

is uh when I say this best fit line is

[21:57]

basically given by this equation what

[21:59]

does Theta 0 basically indicate Theta 0

[22:02]

over here is something called as

[22:04]

intercept now what exactly is intercept

[22:08]

intercept basically means that when your

[22:10]

X is zero then H Theta of X is equal to

[22:13]

Theta 0 so in this particular case

[22:16]

intercept basically indicates that at

[22:18]

what point you are meeting the Y AIS so

[22:22]

this particular point is basically

[22:24]

your intercept when your X is equal to 0

[22:28]

at that point of time you'll be seeing

[22:30]

that this line is intersecting the y-

[22:32]

AIS whatever value this will be that is

[22:34]

your intercept now the second thing is

[22:37]

about your Theta 1 what is Theta 1 this

[22:40]

is nothing but slope or coefficient now

[22:43]

what does this basically indicate this

[22:45]

indicates let let's say that this is the

[22:47]

unit one unit in the x-axis and probably

[22:50]

with respect to this I can find one

[22:52]

point over here one point over here and

[22:55]

if I try to draw this over here to here

[22:57]

this is the unit movement in y so what

[23:00]

does it basically say slope with the

[23:02]

unit movement in one one unit movement

[23:05]

towards the x-axis what is the unit

[23:07]

movement in y- axis that is basically

[23:09]

slope or coefficient Theta 0 and Theta 1

[23:11]

two things and X of I is definitely your

[23:14]

data points now our main aim is to

[23:18]

create a best fit line in such a way

[23:21]

that I I'll just try to show it to you

[23:22]

what is our main aim let's let's

[23:24]

understand what is the aim of a linear

[23:26]

regression so if I take an example of

[23:29]

linear regression I need to find out the

[23:32]

best fit line in such a way that the

[23:35]

distance

[23:36]

between this data points that I have and

[23:40]

the predicted points should be very very

[23:42]

less suppose I'm creating a best fit

[23:46]

line okay I'm creating a best fit line

[23:49]

so with respect to this data points

[23:51]

initially was this right but my

[23:52]

predicted point is this point in this

[23:55]

particular case my predicted point is

[23:56]

this point so and if I do do the

[23:58]

summation of all these points those

[24:01]

distance should be minimal then only

[24:04]

I'll be able to say that this is the

[24:06]

best fit line so I I cannot definitely

[24:08]

say that this is exactly the best fit

[24:10]

line or not how will I say when I try to

[24:13]

calculate the difference between this

[24:15]

point and the predicted Point these are

[24:17]

my predicted point right if I try to

[24:19]

calculate the distance between them then

[24:22]

I will basically have a aim to it should

[24:24]

be minimal if I do the summation of all

[24:26]

the distance it should be minimal

[24:29]

so for that what I can do is that see

[24:31]

you may be also thinking Krish why not

[24:33]

just do one thing okay suppose if these

[24:35]

are my data points why not just play and

[24:38]

create multiple lines and try to compare

[24:40]

what we can do is that we can compare

[24:42]

multiple we can create multiple lines

[24:44]

right like this and then whoever is

[24:46]

giving the best minimal point I will go

[24:48]

and select that but how many iteration

[24:51]

you will do how you will come to know

[24:52]

that okay this line is the best line so

[24:55]

for that specific purpose we should

[24:57]

start at one point and we should lead

[25:01]

towards finding the best fit line start

[25:04]

at one point and then we should go

[25:06]

towards finding the best fit line so for

[25:10]

this particular purpose what we do is

[25:12]

that we create a something called as uh

[25:15]

cost function I have already shown you

[25:17]

what is my hypothesis function my best

[25:19]

fit line equation is basically given as

[25:21]

H Theta of x equal to Theta 0 + Theta 1

[25:26]

* X this is my hypothesis right now

[25:29]

coming to the cost function which is

[25:32]

super super important why this it is

[25:34]

super important because cost function

[25:37]

basically what what is cost function

[25:38]

over here I told right right this

[25:41]

distance when I do the

[25:42]

summation this distance that I when I'm

[25:45]

doing the summation it should be minimal

[25:48]

so if I really want to find out this

[25:49]

particular distance I will be using one

[25:51]

more equation how can I use a distance

[25:54]

formula between the predicted and the

[25:56]

real point I will just say that H Theta

[26:00]

of x - y so when I say h Theta of x - Y

[26:06]

what does this basically mean this is my

[26:07]

real point and this is my predicted

[26:10]

Point predicted point is basically given

[26:12]

by H Theta of X and what I'm going to do

[26:15]

I'm going to basically do the squaring

[26:17]

because I may get a negative value so

[26:18]

because of that I really want to do the

[26:20]

squaring part Now understand one thing I

[26:23]

need to also do the

[26:25]

summation I = 1 to compl complete M

[26:29]

let's say that I'm taking the number of

[26:30]

data points over here as M because I

[26:33]

need to calculate the distance between

[26:34]

all the points right with respect to the

[26:37]

predicted and the predict with respect

[26:39]

to the real

[26:40]

points so after this I also need to

[26:44]

divide by 1X 2m the reason why I'm

[26:47]

dividing by first of all let me show you

[26:49]

why we are dividing by 1 by m 1 by m

[26:51]

will give us the average of all the

[26:53]

values that we have the specific reason

[26:56]

why we are dividing by 1 by 2 do is for

[26:59]

the derivation purpose it helps us to

[27:02]

make our equation very much simpler so

[27:05]

that later on when I am updating the

[27:08]

weights when I say weights I'm basically

[27:10]

updating Theta 0 and Theta 1 Theta 0 and

[27:13]

Theta 1 at that point of time you'll be

[27:15]

able to see that this particular value

[27:18]

when we probably do the derivative it

[27:20]

will help us to do it again I'm going to

[27:22]

repeat it I'm going to write it down for

[27:24]

you first of

[27:26]

all now in order to find find out the

[27:28]

best fit line I need to keep on changing

[27:30]

Theta 0 and Theta 1 unless and until I

[27:33]

get the best fit line unless and until I

[27:35]

don't get the best fit line I need to

[27:37]

keep on updating Theta 0 and Theta 1 now

[27:40]

if I need to keep on updating Theta 0

[27:42]

and Theta 1 I probably require a cost

[27:45]

function okay what this cost function

[27:47]

will do I'll just tell you so cost

[27:49]

function over here I will specify as J

[27:53]

of theta 0 comma Theta 1 is equal to now

[27:57]

what is cost fun function over here what

[27:59]

this distance I told right this distance

[28:01]

between the H Theta of X and Y if I do

[28:05]

the summation of all these things it

[28:07]

needs to be minimal it needs to be less

[28:10]

because with respect to an X point this

[28:12]

is my y point

[28:14]

right similarly with respect to this x

[28:16]

point this is my y point so what I'm

[28:19]

actually going to do I'm going to use a

[28:20]

cost function now in this cost function

[28:23]

my main aim is

[28:25]

to basically write H Theta of x - y s

[28:29]

this will be with respect to I I I why I

[28:32]

am saying I because this will be moving

[28:34]

from I equal to 1 to all the points that

[28:37]

is m m is basically all the points over

[28:41]

here now apart from this what I actually

[28:44]

going to do I'm going to divide by 1X 2

[28:46]

m I'll tell you why I'm specifically

[28:48]

dividing by 1X 2 m first of all by

[28:51]

dividing by m I will be getting an

[28:53]

average

[28:54]

output average cost function because

[28:57]

here I'm iterating M the reason why I'm

[29:00]

dividing by two because it will help us

[29:01]

in derivation why let's say that I have

[29:04]

x² if I try to find out derivative of x²

[29:08]

with respect to X then what will I get I

[29:11]

will basically get 2x right that is what

[29:14]

is the formula what is the derivation of

[29:16]

X of n it is nothing but n x of n

[29:19]

minus1 so that is the reason why I'm

[29:21]

actually making it 1 by two so that when

[29:24]

two comes over here this two and two

[29:26]

will get cancelled so I hope everybody's

[29:29]

able to understand so this is my cost

[29:32]

function Now understand what is this

[29:34]

called as this entire equation is

[29:36]

basically called as squared error

[29:40]

function yes mathematical Simplicity

[29:42]

basically means because when we are

[29:44]

updating Theta 0 and Theta 1 we

[29:46]

basically find out derivation in the

[29:47]

cost function so that is the reason why

[29:50]

we are specifically doing it squaring

[29:52]

off is basically done because so that we

[29:54]

don't get any negative values here

[29:56]

squared error function now let's go

[29:59]

towards the what we need to solve this

[30:02]

is my cost function okay so I need to

[30:07]

minimize minimize this particular value

[30:10]

that is 1x 2 m summation of I = 1 2 m

[30:15]

and then this will basically be H Theta

[30:17]

of X of I minus y of I whole Square we

[30:23]

need to minimize this by adjusting

[30:26]

parameter Theta 0 and Theta 1

[30:28]

this entirely is what this is nothing

[30:31]

but J of theta 0 comma Theta 1 and we

[30:36]

really need to minimize this so this is

[30:38]

our task okay this is our task now let's

[30:41]

go ahead and let's try to compare with

[30:44]

two different thing one is the

[30:46]

hypothesis testing and one is with

[30:48]

respect to the cost

[30:49]

function okay let's take an

[30:52]

example so right now my equation of

[30:58]

the

[30:59]

hypothesis is nothing but H Theta of x

[31:02]

equal to Theta 0 + Theta 1 *

[31:06]

X if Theta 0 is 0 then what does this

[31:11]

basically indicate can I say that it

[31:14]

basically the line the line the best fit

[31:16]

line passes through the origin and this

[31:18]

is nothing but s Theta of xal to Theta

[31:21]

1 multiplied by X can I say like this

[31:25]

obviously I can definitely say like this

[31:27]

right so my equation will be like this

[31:29]

so for right now let's consider that

[31:33]

your Theta 0 is equal to 0 so this is

[31:35]

what it is we have done till here we

[31:37]

have minimized we have written the

[31:39]

equation everything yes so it is passing

[31:42]

through the origin and this is what is

[31:44]

the equation I'm actually getting now

[31:47]

let's take one example and let's try to

[31:48]

solve this if I if I have H Theta of X

[31:51]

so this is my new hypothesis considering

[31:54]

that my intercept is passing through the

[31:57]

region so with respect to this let's say

[32:00]

that I will create one line over here

[32:04]

let's say this is

[32:05]

my this is my data points like X1 y1 I

[32:11]

have 1 2 3 I have 1 2 3 now let's

[32:19]

consider that if I have T I have data

[32:22]

points like what I have data points like

[32:24]

let's say I have three data points 1

[32:26]

comma 1 2A 2 3 comma 3 so 1A 1 is

[32:31]

nothing but this is my data point 2A 2

[32:34]

is nothing but this is my data point and

[32:36]

3 comma 3 is this is my data point so

[32:39]

these are my data points from the data

[32:41]

set that I

[32:43]

have so 2 comma 2 is this point and 3

[32:47]

comma 3 is basically this point let's

[32:49]

consider that these are my points that I

[32:51]

have these are my data points now if I

[32:54]

consider Theta 1 as 1 where do you think

[32:57]

the straight line will pass through

[32:59]

where do you think the straight line

[33:00]

will pass the straight line will

[33:02]

definitely pass like this right my

[33:05]

straight line will definitely pass

[33:06]

through all the points this same point

[33:08]

becomes a prediction point also right

[33:11]

same point let's consider that this is

[33:13]

also getting pass through this it passes

[33:15]

through all the points when Theta 1 is

[33:17]

equal to 1 Theta 1 is nothing but slope

[33:19]

when slope is equal to 1 in this

[33:21]

scenario it passes through all the

[33:22]

points now go ahead and calculate your J

[33:25]

of theta so what will the form of J of

[33:28]

theta 1 become because Theta 0 is 0 okay

[33:31]

we can basically write 1 by 2 m

[33:33]

summation of I = 1 2 three how many

[33:36]

points are there three right and here I

[33:39]

have J of H of theta of X1

[33:43]

sorry X of theta of x i - y i

[33:49]

s right now let's go ahead and compute

[33:52]

now in this particular scenario what

[33:54]

will happen 1X 2 m

[33:57]

then what is what is this point minus y

[34:00]

of I see h of X is also 1 y of I is also

[34:04]

one both the point are 1 so this will

[34:06]

become 1 - 1 whole S Plus because we are

[34:09]

doing summation the next point is also

[34:11]

falling in 2A 2 so this will become 2 -

[34:13]

2 s + 3 - 3 S so in total this will

[34:18]

become zero so when your J of theta when

[34:22]

Theta 1 is 1 Theta 1 is 1 so J of theta

[34:26]

1 is how much it is

[34:29]

Z right so what is this J of theta 1 it

[34:33]

is the cost function so let me draw the

[34:35]

cost function graph over here let's say

[34:39]

that this is my Theta and this is

[34:42]

my so here I have 0.5 here I have 1 here

[34:46]

I have 1.5 so this is my Theta here I

[34:49]

have two then I have 2.5 okay then

[34:52]

similarly I have 0. five then I have 1

[34:58]

1.5 2 2.5 this is my J of theta 1 so

[35:04]

right now what is my Theta 1 my Theta 1

[35:07]

is 1 at this particular Point what did I

[35:09]

get J of theta 1 is nothing but zero so

[35:12]

this will be my first point this will be

[35:15]

my first point guys I have discussed why

[35:18]

why the value will be 1X 2m basically to

[35:20]

make the calculation simpler we are

[35:22]

dividing by 1X 2 m is basically used to

[35:26]

average aage is the sumission that we

[35:28]

are actually doing over here now let's

[35:30]

go ahead and let's take the second

[35:32]

scenario in the second scenario let's

[35:34]

consider my Theta 1 let's say that my

[35:37]

Theta 1 over here is now 0.5 if my Theta

[35:41]

1 is 0.5 then tell me what are the

[35:43]

points that I will get for x equal to

[35:47]

1.5 * 1 so it will come as 0.5 over

[35:51]

here right then similarly when X is

[35:54]

equal to 2.5 * 2 is nothing but 1 over

[35:59]

here and then similarly when uh for x

[36:03]

equal to

[36:04]

35 multiplied by 3 see we are

[36:07]

multiplying here right5 multi by 3 is

[36:09]

1.5 so the next point will come over

[36:12]

here now when I create my best fit line

[36:15]

what will happen so here is my next best

[36:19]

fit line which I will probably create by

[36:20]

green

[36:23]

color okay so this is my second one

[36:25]

which is green color here definitely

[36:27]

slope is decreasing so if I go ahead and

[36:30]

calculate my J of theta let's see what

[36:32]

I'll get so J of theta

[36:35]

1 is nothing but 1X 2

[36:39]

m again same equation summation of I = 1

[36:42]

2 3 H Theta of X of

[36:46]

i - y of

[36:49]

i² so what we have for over here we have

[36:52]

nothing but 1X 2 m now let's do the

[36:56]

summation what is this point this point

[36:58]

is nothing but the predicted point and

[37:01]

this point is the real point right so in

[37:03]

this particular scenario the first point

[37:05]

that I will get is nothing but. 5 - 1

[37:10]

whole s how I'm getting. 5 - 1 whole

[37:12]

Square this is 1 this is the real Point

[37:15]

1 this is the predicted Point .5 so here

[37:18]

I'm getting. 5 - 1 whole Square the

[37:21]

second point will be 1 - 2 whole s right

[37:25]

2 so 1 - 2 whole

[37:28]

s and then I will finally get 1.5 - 3

[37:34]

whole s so finally if I do this

[37:36]

calculation how much I'm actually

[37:38]

getting 1X 2 * 3 which is 6 here I'm

[37:42]

getting

[37:44]

.25 5 Square here I'm getting 1 here I'm

[37:47]

getting 1.5 whole Square so my final

[37:51]

output will be which I have already

[37:53]

calculated it is nothing but point it

[37:56]

will be approximately equal to. 58 so 58

[38:01]

now with Theta as this is nothing but

[38:04]

Theta Theta 1 as

[38:07]

.5 right that is what Theta 1 as .5 we

[38:11]

are able to get. 58 so Theta 1 is .5

[38:15]

over here and. 58 will be coming

[38:17]

somewhere here right so this is my next

[38:20]

point which will be again in green color

[38:23]

now let's go ahead and calculate the

[38:24]

third condition now in third condition

[38:26]

what I'm actually going to write I'm

[38:28]

going to basically say Theta 1 as 0 at

[38:31]

that point of time just go and assume

[38:34]

what is 0 multiplied by X it will

[38:36]

obviously be zero so I will be getting

[38:38]

three points and my next line will be in

[38:41]

this line that is the

[38:45]

x-axis and this is basically all my

[38:47]

points now if I go ahead and calculate

[38:50]

this what is J of theta 1

[38:52]

now what is J of theta 1 now in this

[38:55]

particular case when my Theta 1 is equal

[38:57]

= to 0 1X 2 m now this part you'll be

[39:02]

able to see this is 0 - 1 0 - 2 0 -

[39:08]

3 okay so it will become 0 - 1 s 0 - 2 s

[39:14]

and 0 - 3

[39:16]

S okay so this will become 1X 6

[39:20]

* 1 + 4 + 9 which will not be it will be

[39:25]

nothing but 2.3 which is approximately

[39:29]

equal to

[39:30]

2.3 then what will happen with respect

[39:33]

to Theta 1 as 0 we are getting 2.3 so if

[39:36]

I draw this it is nothing but with

[39:38]

respect to zero I'm getting 2.

[39:41]

2

[39:44]

2.3 this is my point so similarly when

[39:47]

you start constructing with Theta 1 is

[39:49]

equal 2 I may get some point over here

[39:52]

so here when I join this points

[39:56]

together you will be seeing that I will

[39:58]

be getting this kind of

[40:01]

curve okay and this curve is something

[40:04]

called as gradient

[40:07]

descent and this gradient descent will

[40:10]

play a very very important role in

[40:14]

making sure that in making sure that you

[40:17]

get the right Theta 1 value or light

[40:20]

slope value now which is the most

[40:22]

suitable point the most suitable point

[40:24]

is to come over here because this is

[40:27]

this this point is basically called AS

[40:30]

Global

[40:31]

Minima because see out of all these

[40:34]

three lines which is the best fit line

[40:35]

this is the best fit line right this is

[40:38]

the best fit line when I had this best

[40:40]

fit line my point that came over here

[40:44]

was here itself this was my point that

[40:46]

came over here right and I want to

[40:48]

basically come to this region because

[40:50]

this is my Global

[40:52]

Minima when I basically am over here the

[40:56]

distance between the predicted and the

[40:58]

real point is very very less right so

[41:02]

this specific point is basically called

[41:04]

AS Global minimum but still I did not

[41:07]

discuss Krish you have assumed Theta 1

[41:10]

is 1 Theta 1 is .5 Theta 1 is 0 here

[41:13]

also you're assuming many things right

[41:15]

and then you probably calculating and

[41:17]

you're creating this gradient descent

[41:19]

but the thing should be that probably

[41:22]

you come to one point over here and then

[41:25]

you reach towards this so for that

[41:27]

specific reason how do you do that how

[41:30]

do I first of all come to a point and

[41:32]

then move towards This Global Minima so

[41:35]

for that specific case we will be using

[41:37]

one convergence algorithm because if I

[41:40]

come to one specific point after that I

[41:43]

just need to keep on updating Theta 1

[41:45]

instead of using different different

[41:47]

Theta 1 value so for this we use

[41:50]

something called as convergence

[41:52]

algorithm so here the convergence

[41:54]

algorithm basically says

[41:59]

repeat until

[42:03]

convergence that basically means I'm in

[42:05]

a while loop let's say and here I'm

[42:08]

basically going to update my Theta value

[42:11]

which will be given by this notation

[42:13]

which is continuous updation where I'll

[42:15]

say Theta J minus I'll talk about this

[42:19]

Alpha don't worry and then it will be

[42:22]

derivative of theta

[42:25]

J with respect to this J of theta

[42:29]

0 and Theta 1 so this should happen that

[42:34]

basically means after we reach to a

[42:36]

specific point of theta after performing

[42:40]

this particular operation we should be

[42:43]

able to come to the global Minima and

[42:45]

this this specific thing that you are

[42:47]

able to see is called as

[42:50]

derivative this is called as derivative

[42:52]

derivative basically means I'm trying to

[42:54]

find out the slope

[42:57]

derivative which I can also say it as

[42:59]

slope this equation will definitely work

[43:02]

guys trust me this will definitely work

[43:04]

why it will work I'll just draw it show

[43:06]

it to you let's say that this is my cost

[43:09]

function let's say that I've got this

[43:11]

gradient

[43:12]

descent and let's say that my first

[43:15]

point is somewhere here but I have to

[43:18]

reach somewhere here right now when I

[43:20]

reach this this is my Theta 1 and this

[43:23]

is my J of theta 1 suppose I reach at

[43:25]

this specific point and I will also have

[43:28]

another gradient descent which looks

[43:30]

like this let's say that in the initial

[43:33]

time I reach the point over here how we

[43:35]

will be coming to this minimal Global

[43:37]

Minima by using this equation I'll talk

[43:40]

about Alpha also don't worry now this is

[43:42]

also my Theta 1 this is also my J of

[43:44]

theta 1 now let's say suppose I came to

[43:47]

this particular point right after coming

[43:49]

to this particular point I will

[43:52]

basically apply this derivative on this

[43:55]

J of theta 1 okay now when I find out a

[43:59]

derivative that basically means we are

[44:00]

trying to find out the slope and in

[44:02]

order to find the slope we just create a

[44:04]

straight line like

[44:05]

this which will look like this I'll just

[44:08]

try to

[44:10]

create so I'll try to create a slope

[44:12]

like this this

[44:15]

slope so if you try to find out with

[44:17]

respect to this this is a positive slope

[44:20]

how do we indicate it because understand

[44:22]

the right hand side of the line of this

[44:24]

is pointing on the top wordss Direction

[44:27]

this is the best easy way to find out

[44:30]

whether it is a positive slope or

[44:31]

negative slope now in this particular

[44:33]

case this is a positive slope now when I

[44:36]

get a positive slope that basically

[44:38]

means I will update my weights or Theta

[44:40]

1 as Theta 1 let's say I'm writing it

[44:44]

over here so I will just apply this

[44:46]

convergence algorithm see Theta

[44:49]

1 colon Theta 1 minus this learning rate

[44:55]

which is called as Alpha this is my my

[44:57]

learning rate I'll talk about learning

[44:58]

rate don't worry then this derivative

[45:02]

value in this particular case since I'm

[45:04]

having a positive slope I will be

[45:06]

getting a positive value let's say that

[45:09]

for this Theta value I got this slope

[45:12]

initially now I need to come to this

[45:15]

location so for that I have to reduce

[45:17]

Theta 1 so that I come to this main

[45:20]

point now here you can see that I am I

[45:23]

subtracting Theta 1 with something which

[45:25]

is a positive number

[45:28]

right this is a positive number so

[45:29]

definitely I know that after some n

[45:31]

number of iteration I will be able to

[45:34]

come to the global Minima similarly if I

[45:36]

take the right hand side and if I try to

[45:38]

draw the slope in this particular case

[45:40]

my slope will be

[45:42]

negative so similarly I can write the

[45:44]

equation as Theta

[45:46]

1 = to Theta 1 minus learning rate

[45:51]

multiplied by a negative number so minus

[45:54]

into minus will be positive right

[45:55]

suppose initially my 1 was

[45:58]

here my Theta 1 was here now I'll keep

[46:01]

on updating the weight to come to this

[46:02]

Global Minima so minus into minus is

[46:06]

positive so I will basically get Theta 1

[46:09]

+

[46:10]

Alpha by a positive number because minus

[46:13]

into minus is plus so this will

[46:16]

definitely work so that we will be able

[46:19]

to come over here to the global Minima

[46:22]

whether it is a positive slope or a

[46:24]

negative slope now what is this learning

[46:26]

learning rate now learning rate based on

[46:30]

this learning rate suppose I want to

[46:32]

come from this point to the global

[46:35]

Minima by what speed I should be coming

[46:39]

what speed if my learning rate value is

[46:41]

bigger what speed I may be coming

[46:43]

suppose if I say usually we select

[46:45]

learning rate as 01 if I select a small

[46:48]

number then it'll start taking small

[46:50]

small steps to move towards the optimal

[46:52]

Minima but if I take a alpha value a

[46:55]

huge value if it is a huge huge value

[46:57]

then what will happen this uh this

[47:00]

updation of the Theta 1 will keep on

[47:02]

jumping here and there and the situation

[47:03]

will be that it will never meet it will

[47:07]

never reach the global Minima so it is a

[47:09]

very very good decision to take a alpha

[47:12]

small value it should also not be a very

[47:13]

very small value if it becomes a very

[47:16]

very small value then what will happen

[47:18]

very tiny steps it will take forever to

[47:20]

reach the global Minima that basically

[47:22]

means my model will keep on training

[47:24]

itself so definitely this Al is going to

[47:27]

work now let me talk about one

[47:30]

scenario one scenario will be that what

[47:33]

if my my cost function has a local

[47:36]

Minima what if I have a local Minima

[47:39]

because here if I

[47:41]

come here if I come this is a local

[47:43]

Minima suppose one of my points come

[47:46]

over here and finally I'm reaching over

[47:48]

here what will happen in this particular

[47:50]

case because in this case you'll be

[47:52]

seeing that what will be my equation my

[47:54]

equation will be simply Theta 1

[47:57]

Theta 1 minus Alpha in this point in

[48:01]

this local Minima slope will be zero so

[48:03]

in this particular case my Theta 1 will

[48:05]

be equal to Theta 1 now you may be

[48:07]

thinking what is if this is the scenario

[48:10]

then we will be stuck in local Minima

[48:13]

this is called as local

[48:15]

Minima but usually with respect to the

[48:18]

gradient descent and the equation that

[48:20]

we are using here we do not get stuck in

[48:23]

local Minima because our gradient

[48:25]

descent in this particular scenar iio

[48:27]

will always look like this but yes in

[48:29]

deep learning when we are learning about

[48:31]

grade in descent and a Ann at that point

[48:34]

of time we have lot of local Minima and

[48:37]

because of that we have different

[48:38]

different G decent algorithm like RMS

[48:40]

prop we have Adam optimizers which will

[48:43]

solve that specific problem so this one

[48:46]

point also I wanted to mention because

[48:48]

tomorrow if someone asks you as an

[48:49]

interview question that what if in your

[48:52]

uh do you see any local Minima in linear

[48:54]

regression you can just that the cost

[48:57]

function that we use will definitely not

[49:00]

give us local Minima but if in deep

[49:02]

learning techniques with that we are

[49:03]

trying to use like Ann we have different

[49:05]

different kind of optimizers which will

[49:07]

solve that particular problem so that is

[49:10]

the answer you basically have to give

[49:12]

now let me go ahead and write with

[49:14]

respect to the gradient descent

[49:15]

algorithm so here again I'm going to

[49:17]

write the gradient descent algorithm so

[49:19]

this will be my gradient descent

[49:21]

algorithm and remember guys gradient

[49:24]

descent is an amazing algorithm and you

[49:26]

you will definitely be using it so

[49:29]

please make sure that you know this

[49:32]

perfectly now some questions are that

[49:35]

when will convergence stop convergence

[49:37]

will stop when we come to near this area

[49:40]

where my uh J of theta will be very very

[49:44]

less now in gradient descent algorithm I

[49:47]

will again repeat it so what did I say I

[49:50]

said

[49:51]

repeat until convergence I told you

[49:54]

right here we have written this

[49:55]

algorithm

[49:57]

and now let's take it for Theta 0 and

[49:59]

Theta 1 so here I will write Theta 0

[50:02]

J equal to Theta

[50:06]

J minus learning rate of derivative of

[50:11]

theta

[50:14]

J J of theta 0 and Theta 1 so this is my

[50:19]

repeat until convergence now we really

[50:22]

need to find out what we'll try to

[50:24]

equate we'll try to first of all find

[50:25]

out what is this

[50:28]

now if I really want to find out

[50:30]

derivative

[50:32]

of derivative of derivative of theta J

[50:37]

with respect to J of theta 0 and Theta 1

[50:41]

so how do I write this I can definitely

[50:44]

write this in a easy way okay so this

[50:46]

will be derivative of theta J and

[50:49]

remember J will be 0 and 1 right because

[50:53]

we need to find out for 0 Theta 0 and

[50:55]

Theta 1 so this will be 1 by 2 m what is

[50:59]

what is J of theta 0a Theta 1 obviously

[51:02]

my cost function so I will write

[51:04]

summation of IAL 1 to M and here I will

[51:08]

basically write J of theta of X of I

[51:11]

minus y of I whole squar so if my J is

[51:16]

equal to Z so what will happen for this

[51:19]

so here I can specifically say that

[51:21]

derivative of derivative of theta 0 J of

[51:25]

theta 0a 1

[51:27]

now it's simple here what I will be

[51:29]

doing is that I will be simply applying

[51:31]

derivative function see guys what is

[51:34]

this derivative let's consider this is

[51:36]

something like this 1X 2 m x² so if I

[51:40]

try to find out the derivative this will

[51:42]

be 2x 2 MX so 2 and 2 will get cancel so

[51:46]

similarly I'll have 1 by m and here I

[51:49]

will specifically be writing summation

[51:52]

of I = 1 2 m h Theta of x X of I which

[51:58]

will be my

[51:59]

x - y of i² so this will be my

[52:03]

derivative with respect to Theta 0 this

[52:06]

is what I got now the second thing will

[52:08]

be that when J is equal to 1 derivative

[52:11]

of derivative of theta 1 J of theta 0

[52:15]

comma Theta

[52:16]

1 in this particular case I will be

[52:19]

having 1 by m summation of I = 1 to M

[52:23]

then again see in this particular case

[52:26]

Theta of 1 is there right Theta of 1

[52:29]

basically means what if I try to replace

[52:31]

this let's say that I'm trying to

[52:33]

replace this H Theta of X with something

[52:35]

else what is s Theta of X I know that

[52:38]

right it is Theta 0 + Theta 1 * X so

[52:42]

Theta 0 + Theta 1 * X so after this if

[52:46]

I'm trying to find out the derivative

[52:48]

with respect to Theta 0 this will

[52:50]

obviously become I will be able to get

[52:52]

this much right now with respect to the

[52:54]

second derivative what I will be writing

[52:56]

I will again be writing H thet of X of i

[52:59]

- y of i s

[53:03]

multiplied X of I so this Square also

[53:06]

went off understand this H Theta of X is

[53:09]

what see they H Theta of X is nothing

[53:12]

but Theta 0 + Theta 1 * X so if I'm

[53:16]

trying to find out derivative with

[53:18]

respect to Theta 0 nothing will be going

[53:19]

to come okay Theta 1 of X will become a

[53:22]

constant in this particular case in this

[53:25]

case because Theta 1 of X is there so if

[53:28]

I try to find out derivative of theta 1

[53:30]

into X only I'll be getting X Y Square

[53:33]

will not be there it's easy right X squ

[53:35]

means 2x this is the derivative of x

[53:37]

square right so that square went and 1X

[53:40]

2 1 2 by two got cancelled so this will

[53:44]

be now my convergence algorithm so here

[53:47]

we have discussed about linear

[53:48]

regression oh sorry I have to remove

[53:50]

Square here also so let me write it

[53:53]

again okay repeat until conver con let

[53:57]

me write it down again repeat until

[53:59]

convergence finally your two updates

[54:03]

will be happening one is Theta 0 so here

[54:06]

it will be Theta 0

[54:09]

minus Alpha that is my learning rate 1

[54:12]

by m summation of IAL 1 to M and this

[54:17]

will basically be H Theta of X of I

[54:21]

minus y of

[54:23]

I and similarly if I want to update

[54:26]

Theta 1 it will be - alpha 1 by m

[54:30]

summation of I = 1 to m h Theta of X of

[54:36]

I oh my God y of I uh multiplied by X of

[54:42]

I Alpha is your learning rate guys Alpha

[54:45]

is nothing but it is learning rate here

[54:48]

we have to initialize some value like

[54:51]

0.1 see what is s Theta of X Theta 0 +

[54:55]

Theta 1 into X right if I do derivative

[54:58]

of theta 1 into x what is derivative of

[55:01]

theta 1 with Theta 1 x it is nothing but

[55:03]

X so this x will come over here now

[55:07]

let's discuss about two important thing

[55:09]

one is R square and adjusted R square

[55:11]

now similarly what will happen you will

[55:14]

have lot of convex functions now see if

[55:16]

I talk about uh like if you have

[55:19]

multiple features like X1 X2 X3 x4 at

[55:23]

that point of time you will be having a

[55:25]

3D curve curve which looks like this

[55:28]

gradient

[55:29]

decent which will be something like this

[55:40]

gradient it's just like coming down a

[55:44]

mountain now let's discuss about two

[55:46]

performance metrics which is important

[55:48]

in this particular case one is R

[55:52]

square and adjusted R square

[55:57]

we usually use this performance metrix

[55:59]

to verify how our model is and how good

[56:01]

our model is with respect to linear

[56:03]

regression so R square is basically

[56:05]

given R square is a performance Matrix

[56:07]

to check how good the specific model is

[56:10]

so here we basically have a formula

[56:12]

which is like 1 minus sum of residual

[56:16]

divided by sum of total now this is the

[56:19]

formula of R squ now what is this sum of

[56:21]

residual I can basically write like this

[56:23]

summation of y i Min - y i hat whole

[56:29]

Square this Yi hat is nothing but H

[56:31]

Theta of X just consider in this way

[56:33]

divided by summation of Y of i - y mean

[56:39]

y mean y s to formula this is the

[56:42]

formula I'll try to explain you what

[56:44]

this formula definitely says okay so

[56:47]

first thing first let's consider that

[56:49]

this is my this is my problem statement

[56:51]

that I'm trying to solve suppose these

[56:53]

are my data points and if I try to

[56:55]

create the best fit

[56:57]

line This Yi hat Yi hat basically means

[57:01]

this specific point we are trying to

[57:03]

find out the difference between this

[57:05]

things difference between these things

[57:07]

let's say that these are my points I'm

[57:09]

trying to find out a difference between

[57:11]

this predicted this is my predicted the

[57:13]

point in green color are my predicted

[57:15]

points which I have denoted as y i hat

[57:18]

and always understand this is what Su

[57:21]

sum of residual is sum of residual is

[57:23]

nothing but difference between this

[57:24]

point to this point this point to this

[57:26]

point this point to this point this

[57:27]

point to this point and I doing the all

[57:29]

the summation of those now the next

[57:32]

point which is very much important here

[57:34]

is my X and Y what is this y IUS y y bar

[57:39]

Y Bar is nothing but mean mean of Y if I

[57:43]

calculate the mean of Y then I will

[57:45]

probably get a line which looks like

[57:47]

this I'll get a line something like this

[57:49]

and then I will probably try to

[57:51]

calculate the distance between each and

[57:53]

every point and this specific point with

[57:55]

respect to the distance between this

[57:57]

point and this point the denominator

[57:59]

will definitely be high right this value

[58:02]

obviously this value will be higher than

[58:04]

this value right the reason why it will

[58:07]

be higher because the mean of this

[58:09]

particular value distance will obviously

[58:11]

be higher so this 1 minus high this will

[58:16]

be a low value and this will be a high

[58:18]

value when I try to divide Low by

[58:23]

High Low by high then obviously this

[58:26]

entire number will become a small number

[58:28]

when this is a small number 1 minus

[58:30]

small number will be a big number so

[58:33]

this basically shows that our R square

[58:35]

has fitted properly right it has

[58:38]

basically got a very good R square now

[58:40]

tell me can I get this entire R square a

[58:43]

negative number let's say that in this

[58:44]

particular case I got 90% can I get this

[58:47]

R square as negative number there will

[58:50]

be situation guys what if I create a

[58:52]

best fit line which looks like

[58:54]

this if I create this best fit line

[58:57]

which looks like this then this value

[58:59]

will be quite High it is only possible

[59:02]

when this value will be higher

[59:05]

than higher than this

[59:08]

value okay but in the usual scenario it

[59:11]

will not happen because obviously we'll

[59:13]

try to fit a line which will be at least

[59:16]

good it's not just like pulling one line

[59:19]

somewhere we don't want to create a best

[59:21]

fit line which is worse than this right

[59:23]

worse than this so in this particular

[59:26]

scenario you'll be saying that in R

[59:28]

square now here you'll be able to see

[59:31]

one one amazing feature about R square

[59:33]

is that let's say let's say one scenario

[59:36]

suppose I have features like let's say

[59:38]

that my feature is something like uh

[59:41]

let's say I have a price of a house okay

[59:43]

so suppose this is my bedrooms how many

[59:45]

bedrooms I have and this is basically

[59:48]

the price of the house now if I if I

[59:51]

probably solve this Pro problem I'll

[59:53]

definitely get an R square value let's

[59:54]

say the R square value is 85% let's say

[59:57]

that my R square is 85% now what if if I

[60:00]

add one more feature the one more

[60:02]

feature basically says that okay if I

[60:05]

add

[60:06]

location location of the house will be

[60:09]

definitely correlated with price so

[60:12]

there is a definite chance that the R

[60:14]

square value will increase let's say

[60:16]

that R square will become 90% if I

[60:19]

probably have this two specific feature

[60:21]

and obviously it is basically increasing

[60:23]

the R square because this is also

[60:24]

correlated to price

[60:26]

and let me change the example see first

[60:29]

case I got by R square as 85% let's say

[60:32]

now as soon as I added location I got

[60:35]

90% now let's say that I added one more

[60:37]

feature which gender is going to stay

[60:40]

gender like male or female is going to

[60:42]

stay you know that gender is no way

[60:44]

correlated to price but even though I

[60:47]

add one feature there is a scenario that

[60:48]

my R square will still increase and it

[60:51]

may become

[60:52]

91% even though my feature is not that

[60:56]

important even gender is not that

[60:58]

important the R square formula Works in

[61:01]

such a way that if I keep on adding

[61:03]

features and that are not nowhere

[61:05]

correlated this is obviously nowhere

[61:07]

correlated this is not correlated with

[61:10]

price then also what it does is that it

[61:13]

is basically increasing my r² so this

[61:16]

specific thing should not happen whether

[61:19]

a male will stay or female will stay

[61:21]

that does not matter at all still when

[61:23]

you do the calculation the R square will

[61:26]

still increase so in order to not impact

[61:30]

the model because see now right now with

[61:32]

this particular model where I have got

[61:34]

90% now as soon as I see R square as 91%

[61:38]

because it is considering this

[61:40]

particular gender so this model will be

[61:43]

picked right because it is performing

[61:45]

well and is giving you a better R square

[61:46]

value but this should not happen because

[61:49]

that is not at all corelated this model

[61:51]

should have been picked so in order to

[61:53]

prevent this situation what we do we

[61:55]

basically Ally use something called as

[61:57]

adjusted R square now what is this

[61:59]

adjusted R square and how it will work

[62:02]

I'll also show it to you very very nice

[62:04]

concept of adjusted R square so adjusted

[62:06]

R square R square

[62:08]

adjusted is given by the

[62:11]

formula is given by the Formula 1 - 1 -

[62:16]

r² * N - 1 where n is the total number

[62:20]

of samples n minus P minus 1 this p p is

[62:24]

nothing but number of features

[62:26]

or predictors we'll also say or

[62:28]

predictors suppose initially my number

[62:31]

of predictors were in this particular

[62:33]

scenario in this scenario where I saw

[62:35]

this my number of predictors was two and

[62:37]

in this particular case my number of

[62:39]

predictor was three now if my predictor

[62:41]

is 2 I got the r squ as 90% so in this

[62:45]

particular scenario what all the

[62:46]

calculation will happen okay all the

[62:48]

calculation will happen and let's say

[62:50]

that my R square adjusted it'll be

[62:52]

little bit less it'll be little bit less

[62:55]

let's say it8 is 6% let's say that my R

[62:57]

square adjusted is 86% based on this

[63:00]

predictor 2 now when I use my predictor

[63:03]

3 predictor basically means number of

[63:05]

features that I'm going to use and now

[63:08]

in this one one feature is nowhere

[63:10]

related like gender but what we are

[63:12]

getting we are basically getting R

[63:14]

square increased to

[63:16]

91% now for the R square

[63:19]

adjusted this will not increase this

[63:21]

will in turn decrease right now it will

[63:24]

become 82% how it will become I'll show

[63:26]

you I've just considered some value 8682

[63:29]

here you can see that there is an

[63:31]

increase here an increase is there here

[63:33]

decrease is there now how this is

[63:35]

basically happening see this P value

[63:39]

that I will be putting okay if I put a p

[63:42]

isal 3 obviously with n minus P minus 1

[63:46]

this will become a little bit smaller

[63:48]

number or sorry little bit uh smaller

[63:50]

number right so now in this particular

[63:53]

case if it is not correlated obviously

[63:55]

this will be high when I'm increasing

[63:56]

this so this will also be high let me

[63:58]

write the equation something like this

[64:00]

just a second so this will basically

[64:04]

be okay now why probably this value may

[64:08]

have decreased let me talk about this

[64:10]

one what is r squ I hope everybody

[64:12]

understood n is the number of data

[64:17]

points p is the number of

[64:21]

predictors if p is increasing then what

[64:24]

will happen as P keeps on increasing

[64:27]

this value will keep on

[64:29]

decreasing this value will keep on

[64:31]

decreasing if this values keep on

[64:33]

decreasing this will be a bigger number

[64:35]

this will obviously be a big number a

[64:38]

big number divided by a small number

[64:40]

what it will be obviously this will be a

[64:42]

little bit bigger number 1 minus bigger

[64:45]

number we will basically get some values

[64:47]

which will be decreasing if my P value

[64:49]

is two in this particular case it will

[64:52]

be less smaller than this right at least

[64:54]

it will be greater than this this

[64:55]

particular value right when p is equal

[64:57]

to

[64:57]

3 so with the help of P obviously R

[65:01]

square is there to support you okay

[65:03]

whether it is correlated or not always

[65:05]

remember when the features are highly

[65:07]

correlated your R square value will

[65:09]

increase tremendously if it is less

[65:12]

correlated then it will be there will be

[65:14]

a small increase but there will not be a

[65:16]

very huge increase now if I consider p

[65:18]

is equal to 2 obviously when I'm trying

[65:20]

to find out this uh calculation n minus

[65:22]

P minus 1 it will obviously be greater

[65:25]

than p is equal to 3 when p is equal to

[65:28]

3 then this value will be still more

[65:30]

smaller and when we are dividing a

[65:32]

bigger number by a smaller number

[65:34]

obviously we are subtracting with one so

[65:37]

that basically means even though my R

[65:39]

square is 86 over here there may be a

[65:41]

scenario since this is nowhere

[65:43]

correlated I'm basically getting an 82%

[65:45]

because of this entire equation so I

[65:48]

hope you are understanding this this is

[65:50]

very much important to understand a very

[65:53]

very important property simple way to

[65:55]

define is that as my P value keeps on

[65:58]

increasing the number of predictors

[66:00]

keeps on increasing my R squ gets

[66:02]

adjusted whatever R square I'm getting

[66:05]

with respect to this it will always be

[66:07]

less than this particular R square there

[66:10]

was one interview question that was

[66:11]

asked one of my student between R square

[66:14]

and adjusted R square which will always

[66:15]

be bigger definitely the student said R

[66:18]

square then he told him to explain about

[66:20]

adjusted R square why does that specific

[66:22]

happen agenda one is about Ridge lasso

[66:27]

regression second is assumptions of

[66:31]

linear regression the third point that

[66:34]

we are probably going to discuss about

[66:37]

is logistic regression then the fourth

[66:42]

thing that we are going to discuss about

[66:43]

is something called as confusion

[66:46]

Matrix the fifth thing that we are going

[66:49]

to consider about

[66:51]

is practicals

[66:54]

for lead lineer Ridge lasso and logistic

[67:00]

so first topic uh that we are probably

[67:03]

going to discuss is something called as

[67:05]

Ridge and lasso

[67:10]

regression so let's understand about

[67:12]

Ridge and lasso regression if you

[67:15]

remember in our previous session what

[67:17]

all things we discussed linear

[67:21]

regression and then we had discussed

[67:23]

about the cost function we have

[67:24]

discussed about R square adjusted

[67:26]

adjusted R square sorry R square and

[67:29]

adjusted R square we have discussed

[67:30]

about it gradient descent we have

[67:32]

discussed about it it was nothing but 1

[67:34]

by 2 m summation of I = 1 2 m h Theta of

[67:41]

x i -

[67:45]

y - y i s so this is the cost function

[67:50]

that we had discussed right yesterday

[67:53]

and this cost function was able to give

[67:55]

us a

[67:57]

gradient descent with respect to the J

[67:59]

of

[68:00]

theta J of theta Zer or Theta not so I

[68:03]

can also write this as J of theta comma

[68:06]

Theta 0 comma Theta 1 now let me give

[68:09]

you a scenario let's say that I have a

[68:11]

scenario over here and I have this

[68:14]

specific scenario let's say that I just

[68:16]

have two points which looks like this

[68:20]

okay now if I have these two specific

[68:23]

points what will happen I will probably

[68:25]

try to create a best fit line the best

[68:27]

fit line will definitely pass through

[68:29]

all the points like this if I try to

[68:32]

calculate the cost function what will be

[68:34]

the value of J of theta 0 comma Theta 1

[68:38]

let's say that in this particular case

[68:39]

since it is passing through the origin

[68:41]

my Theta 0 will be zero okay so what

[68:44]

will be the value of theta 0 comma Theta

[68:47]

1 so here obviously you can see that

[68:49]

there is no difference so it will

[68:50]

obviously become zero Now understand

[68:54]

this data that you see right right this

[68:56]

data is basically called as training

[68:59]

data so this data that I have actually

[69:01]

plotted with two points these are

[69:03]

specifically called as training

[69:05]

data now what is the problem in this

[69:08]

data right now see right now exactly

[69:11]

whatever line is basically getting

[69:13]

created over here which is through the

[69:16]

uh hypothesis over here you can see that

[69:18]

it is passing through every point so

[69:19]

that is the reason your cost is zero and

[69:21]

our main aim is to basically minimize

[69:23]

the cost function that is absolutely

[69:26]

fine now in this particular case in

[69:29]

which my model this if this model is

[69:32]

getting trained initially this data is

[69:34]

basically called as training data now

[69:37]

just imagine that tomorrow new data

[69:40]

points comes so if my new data points

[69:42]

are here let's consider that I I want to

[69:45]

basically uh come up with this new data

[69:48]

point now in this particular scenario if

[69:50]

I want to predict with respect to this

[69:52]

particular Point let's say my predicted

[69:54]

point is here

[69:55]

is this the difference between the

[69:57]

predicted and the real Point quite

[70:00]

huge yes or no so this is basically

[70:04]

creating a condition which is called as

[70:07]

overfitting that basically means even

[70:11]

though my

[70:13]

model has given or trained well with the

[70:16]

training

[70:17]

data or let me write it down properly

[70:20]

over here so this condition since since

[70:23]

you can see that over here my each and

[70:25]

every point is basically passing through

[70:27]

the best fit line so because of that

[70:30]

what happens it causes something called

[70:32]

as

[70:33]

overfitting so you really need to

[70:35]

understand what is overfitting now what

[70:37]

does overfitting mean overfitting

[70:40]

basically means my model performs well

[70:44]

with training data but it fails to

[70:48]

perform well with test data now what is

[70:51]

the test data over here the test data is

[70:53]

basically this points the real test data

[70:55]

answer was this points but because the

[70:58]

my line is like this I'm actually

[71:00]

getting the predicted point over here so

[71:02]

this distance if I try to calculate it

[71:03]

is quite huge so in this scenario

[71:06]

whenever I say my model performs well

[71:08]

with training data and it fails to

[71:10]

perform well with test data then this

[71:12]

scenario we say it as overfitting so

[71:14]

this scenario when the model performs

[71:16]

well with training data I have a

[71:18]

condition which is called as low bias

[71:20]

and when it fails to perform with the

[71:22]

test data then it is basically called as

[71:25]

high High variance very important okay I

[71:28]

will make each and everyone understand

[71:31]

one by one if it is performing well with

[71:33]

the training data that is basically low

[71:35]

bias and whenever it performs well with

[71:38]

the test sorry fails to perform well

[71:40]

with the fails to perform well with the

[71:42]

test data then it is basically High

[71:44]

variance now similarly I may have

[71:46]

another scenario which is called as

[71:48]

underfitting so let's say that I have

[71:50]

something called as

[71:51]

underfitting now in this underfitting

[71:53]

what is the scenario the

[71:56]

model fails to perform it gives bad

[72:00]

accuracy I say that model always

[72:03]

remember whenever I talk about bias then

[72:05]

you can understand that it is something

[72:07]

related to the training data whenever I

[72:10]

talk about test data at that point of

[72:12]

time you talk about variance and that

[72:15]

specifically whenever you talk about

[72:17]

variance that basically means we are

[72:18]

talking about the test data so for an

[72:21]

overfitting you will basically have low

[72:23]

bias and high variance low bias with

[72:26]

respect to the training data and high

[72:29]

variance with respect to the test data

[72:31]

now if the model accuracy is bad with

[72:36]

training data and the model accuracy is

[72:39]

also bad with test data in this scenario

[72:44]

we basically say it as underfitting so

[72:47]

these are the two conditions that are

[72:50]

with respect to underfitting that

[72:51]

basically means that both for the

[72:54]

training data also the model is giving

[72:55]

bad accuracy and again for the test data

[72:59]

also it is basically having a bad

[73:01]

accuracy so in this particular scenario

[73:03]

we can definitely say two things out of

[73:05]

underfitting one is high bias and high

[73:10]

variance so this is the condition with

[73:12]

respect to underfitting very super

[73:15]

important let me just explain you once

[73:17]

again suppose let's consider I have one

[73:21]

model I have model two this is model one

[73:24]

this is model one this is model two and

[73:27]

this is model 3 okay guys so suppose

[73:30]

let's say that I have my model my

[73:33]

training accuracy is let's say

[73:36]

90% And my let's say that my test

[73:39]

accuracy is 80% now in this particular

[73:42]

case let's say that my training accuracy

[73:44]

is

[73:46]

92% and my test accuracy is 91% and

[73:51]

let's say my model three is basically

[73:53]

having training accuracy as

[73:56]

70% and my test accuracy is 65% so if I

[74:01]

take this particular case it is

[74:03]

basically overfitting if I take this

[74:06]

particular thing this basically becomes

[74:08]

my generalized model and when I talk

[74:11]

about this this is my I'll just say that

[74:15]

okay I'll also put nice color so that uh

[74:17]

you'll be able to understand this this

[74:19]

becomes our generalized model and this

[74:22]

finally becomes our underfitting right

[74:24]

under under fitting so here is my red

[74:27]

color I will just say it as underfitting

[74:29]

what are the main properties of this

[74:31]

overfitting as I said in this scenario

[74:34]

since it is performing well with the

[74:36]

training data so it will be low bias

[74:38]

High variance in this particular case it

[74:41]

will be low bias low variance and this

[74:44]

particular case it will be high bias and

[74:47]

high variance understand in this

[74:49]

terminology in this particular way

[74:51]

you'll be able to understand so why do

[74:53]

we require always a generalized model

[74:55]

because whenever our new data will

[74:57]

definitely come generalized model will

[74:59]

be able to give us very good output

[75:01]

let's go back to this particular example

[75:03]

here you'll be able to see this straight

[75:05]

line the red line that I have actually

[75:07]

created is basically overfitting so that

[75:10]

whenever I probably get the new points

[75:12]

which is having this real value and the

[75:14]

predicted points here you'll be able to

[75:16]

see the difference is quite huge so

[75:18]

because of this it will definitely be a

[75:20]

scenario of overfitting where it has low

[75:24]

bias and high weight

[75:25]

so again let me go ahead and take this

[75:28]

example so this was my line which I have

[75:30]

actually drawn I had two points and when

[75:33]

I draw this line which was a best fit

[75:36]

line to which is passing through both

[75:37]

the points this scenario is basically

[75:40]

causing a overfitting problem and I've

[75:42]

also shown you my J of theta 1 will be

[75:45]

zero in this scenario since it is

[75:47]

passing exactly and the predicted point

[75:49]

is also over there now understand one

[75:52]

thing is that what can can we take out

[75:55]

from this what assumptions we can take

[75:57]

out from this definitely if I talk about

[76:00]

our cost function our cost function here

[76:02]

is nothing but 1X 2 m summation of I = 1

[76:06]

2 m h Theta of X of i - y of I whole s

[76:13]

now let's consider that I am going to

[76:15]

use this H Theta X and I'm going to

[76:17]

basically write it as y hat okay let's

[76:19]

focus on this specific point so when I

[76:22]

take this I'm I'm just going to focus on

[76:24]

this particular point so here I will

[76:26]

definitely write it as y hat minus y of

[76:30]

I whole squ so this is my y y hat of I

[76:35]

minus y hat y i whole Square so this is

[76:38]

nothing but the difference between the

[76:40]

predicted value and the real value okay

[76:42]

this is what I'm actually trying to get

[76:44]

now in this scenario if I am adding this

[76:47]

values obviously I'm going to get the

[76:48]

value as zero now I have to make sure

[76:52]

that this value does not come to zero

[76:53]

because this is still over fitting so

[76:57]

that is where your Ridge regression will

[76:58]

come into picture Ridge and lasso will

[77:01]

come into picture now when I use Ridge

[77:03]

and lasso suppose if I use Ridge now in

[77:06]

Ridge what we say this this is also

[77:09]

called as L2

[77:11]

regularization now L2 regularization

[77:14]

what it does is that it basically adds a

[77:17]

unique

[77:18]

parameter add a One More Sample value

[77:21]

which is like Lambda multiplied by slope

[77:25]

Square now what is this slope whatever

[77:28]

slope of this particular line it is we

[77:30]

are just going to square it off now

[77:33]

suppose if I take my equation which

[77:34]

looks like this H Theta of X is equal to

[77:39]

Theta 0 + Theta 1 x now in this

[77:41]

particular case my Theta 0 was zero so

[77:44]

my H Theta of X is nothing but Theta 1

[77:47]

what is Theta 1 this is specifically

[77:49]

called as slope and I am basically

[77:52]

taking this Theta 1 I'm actually making

[77:54]

it as a square Square so always

[77:56]

understand I don't want to make this as

[77:57]

zero because if it becomes zero it may

[78:00]

lead to overfitting condition now what

[78:03]

will happen if I add this particular

[78:05]

equation if I add this particular

[78:06]

equation this will obviously come as

[78:08]

zero let's consider my Lambda value over

[78:12]

here my Lambda value is one I'll talk

[78:15]

about how do you set up Lambda value

[78:17]

okay let's consider that I'm

[78:18]

initializing it to one let's say my

[78:21]

Lambda value is 1 now what I will do is

[78:24]

that this l Lambda value is 1 Let's

[78:26]

consider our slope value initially is

[78:28]

two and because of this two I got this

[78:30]

best fit line I'm just going to consider

[78:32]

it so if I do the total sum over here if

[78:35]

I'm just considering this this value is

[78:37]

three now the cost function will not

[78:40]

stop over here because still it has to

[78:42]

minimize it has to reduce this three

[78:45]

value so what it will do it will again

[78:47]

change the Theta 1 value and let's say

[78:49]

that my Theta van value has changed now

[78:52]

it got another best fit line which looks

[78:54]

something like like this this is my next

[78:56]

best fit line I'll talk about Lambda

[78:57]

Lambda is a hyper parameter guys what

[79:00]

exactly is Lambda I'll just talk about

[79:01]

it now when I basically change this line

[79:04]

now see why I'm getting this line let's

[79:06]

consider I have changed my Theta 1 value

[79:08]

since we need to minimize now when we

[79:11]

need to minimize what it will do we'll

[79:12]

again calculate the slope of this

[79:14]

particular line and then we will try to

[79:16]

create a new line when we sorry it is

[79:18]

two two not three just a second guys 0 +

[79:24]

1 multiplied by 2 s which is nothing but

[79:27]

4 so now my cost function will not stop

[79:31]

over here so we are going to still

[79:33]

reduce this now in order to reduce this

[79:36]

again Theta 1 value will get changed and

[79:39]

then we will get a next best fit line

[79:40]

for this point now what will happen in

[79:43]

this scenario once we have this best fit

[79:45]

line we will definitely get a kind of

[79:47]

small difference so now if I go ahead

[79:50]

and consider the new equation my y hat I

[79:54]

minus y

[79:55]

i² + Lambda of slope squar this value

[80:00]

will be a small value now because I have

[80:03]

some difference and then plus again 1

[80:06]

multiplied by now understand whether the

[80:09]

slope will increase in this particular

[80:11]

case or whether it will decrease in this

[80:13]

particular case there will be some slope

[80:15]

value let's say that I have got some

[80:17]

slope of this particular line in this

[80:19]

particular scenario again your slope

[80:21]

will definitely decrease so let's say in

[80:23]

the case of two initially it was now it

[80:25]

is basically

[80:27]

1.36 whole squ now this small Value

[80:32]

Plus 1 + 1.3 squ or let me consider that

[80:37]

my slope is now one simple value that is

[80:40]

5 so if I get this it is 2.25 2.25 plus

[80:44]

small value it will be less than three

[80:46]

only right it will obviously be less

[80:48]

than three or equal to 3 but understand

[80:50]

what is happening the value is getting

[80:52]

reduced from 4 to 3 so this is is the

[80:55]

importance of Ridge now what will happen

[80:57]

is that you will try to get a

[80:59]

generalized model which has low bias and

[81:02]

low variance instead of this overfitting

[81:05]

condition you know why specifically we

[81:08]

are adding Ridge L2 regularization it is

[81:11]

basically to prevent

[81:14]

overfitting because here you are not

[81:16]

stopping here you are trying to reduce

[81:18]

it unless and until you get a line you

[81:21]

get a line which will be able to handle

[81:24]

the which will be able to handle as a uh

[81:27]

generalized model now here you can see

[81:29]

now if I have my new points like how I

[81:31]

drew over here now the distance will be

[81:33]

less so now you'll be able to see that

[81:36]

it will be able to create a generalized

[81:38]

model guys this will be a small value

[81:40]

only see initially when we have this

[81:42]

line obviously we have zero if we try to

[81:45]

slightly move here and there so here

[81:48]

you'll be able to see that it will just

[81:50]

a slight movement but what this movement

[81:52]

is basically specifying it is specifying

[81:55]

that the slope should not be steep if we

[81:59]

probably have a steep slope it obviously

[82:02]

leads to most of the time overfitting

[82:04]

condition it should not be steep it

[82:06]

should be very very it should be less

[82:08]

steeper but it should actually help you

[82:10]

to create a generalized model so you

[82:13]

will be seeing that after playing for

[82:14]

some amount of time this value will not

[82:18]

reduce after some point of time it'll

[82:19]

get almost it'll be a minimal value

[82:22]

it'll be a smaller value and for this

[82:24]

also you have to specify iterations how

[82:26]

many times you probably have to train

[82:29]

them now this iterations is also a

[82:33]

hyperparameter based on number of

[82:35]

iterations you will probably see your R

[82:38]

square or adjusted R square over here so

[82:41]

this iterations based on the number of

[82:43]

iterations it will never become zero

[82:45]

guys understand because zero it is not

[82:48]

possible if it becomes zero trust me it

[82:50]

is an overfitting model you cannot get

[82:52]

that is something zero now what is

[82:55]

Lambda coming to this Lambda this Lambda

[82:57]

is a

[82:58]

hyperparameter this is basically to

[83:01]

check how fast you want to lessen the

[83:04]

steepness or how fast you want to make a

[83:06]

steepness grow higher right and this

[83:08]

Lambda will also be selected by using

[83:11]

hyper parameter and this also I'll show

[83:13]

you today in Practical what do you mean

[83:15]

by iterations iteration basically means

[83:17]

how many time I want to change the Theta

[83:18]

1 value how many times you want to

[83:20]

change the Theta value that is the

[83:22]

convergence algorithm right

[83:25]

convergence algorithm over here L2

[83:27]

regularization or Ridge is basically

[83:30]

used in such a way that you should never

[83:33]

overfit why we assume Theta 0 is equal

[83:35]

to 0 because I'm considering that it

[83:37]

passes through a origin right origin

[83:40]

over here Lambda is a hyper

[83:43]

parameter steep basically means how

[83:46]

steep the line is if I have this line

[83:49]

this line is quite steep if I have this

[83:51]

line This is less steep now if I go to

[83:54]

the next regularization which is called

[83:56]

as lasso raso R lasso regression this is

[84:00]

also called as L1

[84:02]

regularization now here the formula will

[84:05]

be changing little bit here you will be

[84:07]

having y hat of minus of Y whole Square

[84:11]

here you'll be adding a parameter Lambda

[84:14]

but understand here you'll not be adding

[84:16]

slope Square no here you'll be adding

[84:20]

mode of slope here you'll be adding mode

[84:23]

of slope and this mode of slope will

[84:26]

work is that it will actually help you

[84:29]

to do feature selection now you may be

[84:31]

thinking how feature selection crash

[84:33]

let's consider a equation over here

[84:35]

let's say that I have many many features

[84:37]

I have many many many features okay so

[84:40]

my H Theta of X which I'm indicating

[84:42]

here as y hat let's say that I'm I'm

[84:45]

writing this equation apart from

[84:47]

preventing for overfitting it will also

[84:49]

help you to do feature selection here

[84:51]

let me just show you over here with an

[84:53]

example this H Theta of X which I'm

[84:56]

probably writing as y hat will basically

[84:59]

be indicated by something over here

[85:01]

you'll be able to see that it is nothing

[85:03]

but let's say that I have multiple

[85:05]

features like this now in this

[85:07]

particular features obviously there are

[85:09]

so many coefficients over here so many

[85:11]

slopes over here now mod of slope will

[85:13]

be what it will be nothing but mod of X1

[85:16]

plus X2 plus X3 plus X4 plus X5 like

[85:20]

this up to xn now in this particular

[85:23]

case how it is basically helping you to

[85:25]

sorry not X1 sorry just a second this

[85:29]

mod of I have taken the data point this

[85:31]

is not data points this should be your

[85:34]

mod of theta 0 + Theta 1 + Theta 2 +

[85:38]

theta 3 + Theta 4 + Theta 5 like this up

[85:42]

to Theta n so here you'll be able to see

[85:45]

that this is how I will basically uh

[85:47]

I'll basically be calculating the slope

[85:50]

now as we go ahead guys whichever

[85:52]

features are probably not playing an

[85:55]

amazing role the Theta value the

[85:57]

coefficient value the slope value will

[85:59]

be very very small it is just like that

[86:01]

entire feature is neglected that entire

[86:04]

feature is neglected now in this

[86:06]

particular case we were doing squaring

[86:08]

because of the squaring that value was

[86:10]

also increasing but here because of the

[86:11]

mode that value will not increase

[86:14]

instead it will be a condition wherein

[86:16]

we are basically neglecting those

[86:18]

features that are not at all important

[86:21]

in this specific problem statement so

[86:23]

with the help of L1 regularization that

[86:26]

is lasso you are able to do two

[86:28]

important things one is preventing

[86:31]

overfitting and the second case is that

[86:33]

if you have many features and many of

[86:36]

the features are not that important okay

[86:39]

in basically finding out your slope or

[86:42]

your line or the best fit line in that

[86:44]

particular case it will also help you to

[86:46]

perform feature selection so this is the

[86:48]

importance of the entire what is the

[86:51]

importance of this this is the

[86:52]

importance of the uh Ridge and the lasso

[86:56]

regression that we are doing here I'm

[86:57]

just going to write L1

[86:59]

regularization and obviously we have

[87:01]

discussed about L2 regularization also

[87:04]

now you have probably understood Lambda

[87:06]

is one hyperparameter okay which we will

[87:09]

specifically using okay and based on

[87:12]

this Lambda this will be found out

[87:13]

through cross

[87:14]

validation cross validation is a

[87:16]

technique wherein we will try to

[87:19]

probably train our model and try to find

[87:21]

out the specific things okay what should

[87:24]

be the exact value and there also we

[87:26]

play with multiple values in short what

[87:28]

we are doing we just trying to reduce

[87:29]

the cost function in such a way that uh

[87:32]

it will definitely never become zero but

[87:34]

it will basically reduce based on the

[87:37]

Lambda and the slope value in most of

[87:39]

the scenario if you ask me we should

[87:41]

definitely try both the regularization

[87:44]

and see that wherever the performance

[87:46]

Matrix is good we should use that what

[87:48]

is cross validation basically means I

[87:50]

will try to use different different

[87:52]

Lambda value and basically Ally use it

[87:55]

so in a short let me write it down again

[87:58]

for Ridge regression which is an L2 Norm

[88:03]

here I'm simply writing my cost function

[88:05]

in this particular case will be little

[88:08]

bit different here I can definitely

[88:10]

write my cost function as H Theta X of i

[88:14]

- y of I S Plus Lambda multiplied slope

[88:21]

Square what is the purpose of this the

[88:24]

purpose is very simple here we are

[88:27]

preventing overfitting this was with

[88:29]

respect to the Ridge Recreation that is

[88:31]

L2 nor now if I go ahead and discuss

[88:33]

about the next one which is called as

[88:34]

lasso regression which is also called as

[88:38]

L1 regularization in the case of lasso

[88:41]

regression your cost function will be H

[88:44]

Theta of X of

[88:47]

IUS y of

[88:49]

i² plus Lambda ultied mode of flow so

[88:55]

here you have this specific thing and

[88:57]

what is the purpose the purpose are two

[89:00]

one is prevent overfitting and the

[89:04]

second one is something called as

[89:05]

feature selection so these two are the

[89:08]

outcomes of the entire thing see with

[89:10]

respect to this lasso right you have

[89:12]

slopes slopes here you'll be having

[89:15]

Theta 0 plus Theta 1 plus Theta 2 plus

[89:17]

theta 3 like this up to Theta n now when

[89:20]

you'll have this many number of thetas

[89:22]

when you have many number of features

[89:24]

and when you have many number of

[89:25]

features that basically means you'll

[89:26]

have multiple slopes right those

[89:28]

features that are not performing well or

[89:30]

that has no contribution in finding out

[89:32]

your output that coefficient value will

[89:35]

be almost nil right it will be very much

[89:37]

near to zero in short you neglecting

[89:40]

that value by using modulus you're not

[89:42]

squaring them up you're not increasing

[89:44]

those values now I will continue and uh

[89:47]

probably I will also discuss about the

[89:49]

assumptions of linear regressions so

[89:52]

what are the assumptions of linear

[89:54]

regression in this particular scenario

[89:56]

so assumption is that number one point

[90:00]

linear regression if our features are in

[90:03]

normal or gion

[90:06]

distribution if our features follows

[90:09]

this particular distribution it is

[90:11]

obviously good our model will get

[90:13]

trained well so there is one concept

[90:16]

which is called as feature

[90:18]

transformation now in future

[90:20]

transformation always understand what

[90:22]

will happen if a model does not fall

[90:24]

follow a gan distribution then we apply

[90:26]

some kind of mathematical equation onto

[90:28]

the data and try to convert them into

[90:30]

normal orian distribution the second

[90:33]

assumption that I would definitely like

[90:34]

to make is that standard scalar or

[90:37]

standard digestion standard dig is

[90:40]

nothing but it is a kind of scaling your

[90:43]

data by using Z score I hope everybody

[90:46]

remembers Z score this is what we

[90:48]

basically apply there your mean is equal

[90:50]

to zero and standard deviation equal to

[90:52]

1 see guys wherever you have gradient

[90:54]

descent involved it is good to basically

[90:57]

do

[90:58]

standardization because if our initial

[91:01]

point is a small Point somewhere here

[91:03]

then to reach the global Minima or

[91:05]

training will happen quickly otherwise

[91:07]

what will happen if your values are

[91:09]

quite huge then your graph may be very

[91:11]

big and the point can come over any over

[91:13]

there and the third point is that this

[91:16]

linear regression works with respect to

[91:19]

linearity it works if your data is

[91:22]

linearly separable

[91:24]

I'll not say linearly separable but this

[91:26]

linearity will come into picture if your

[91:28]

data is too much linear it will

[91:30]

obviously be able to give a very good

[91:31]

answer like logistic regression also

[91:34]

which we are going to discuss today this

[91:35]

also has the same property now you may

[91:38]

be asking is it compulsory to do

[91:40]

standardization guys if you want to

[91:42]

increase the training time of your model

[91:45]

or if you want to optimize your model I

[91:47]

would suggest go ahead and do

[91:48]

standardization now coming to the fourth

[91:50]

Point here you really need to check

[91:52]

about multicolinearity

[91:54]

this is also one kind of check we

[91:56]

basically do what is multicol

[91:58]

linearities let's say I have X1 I have

[92:00]

X2 and this is my output feature I have

[92:03]

let's say X3 also now let's say that if

[92:05]

I try to see the colinearity of this two

[92:08]

feature how how correlated these two

[92:10]

feature are let's say that these two

[92:12]

feature are 95% correlated is it is it a

[92:16]

wise decision to use both the features

[92:18]

and let's say that let's let's say that

[92:20]

these two features are 95% correlated

[92:23]

but it is highly correlated with Y is it

[92:25]

necessary that we should use both the

[92:27]

feature in this particular scenario the

[92:29]

answer should be no we can drop this

[92:32]

particular feature okay we can drop this

[92:34]

particular feature any one of the

[92:36]

feature we can definitely drop it and

[92:38]

based on that I can just use one single

[92:40]

feature and basically we do the

[92:42]

prediction there is also a concept which

[92:44]

is called as variation inflation factor

[92:46]

I will try to make a dedicated video

[92:48]

about this multical is also solved with

[92:51]

the help of variation inflation Factor

[92:53]

one more term is there homos orc so that

[92:56]

kind of terminologies also we use one

[92:58]

more condition in this but if you almost

[93:00]

satisfied with this assumptions you will

[93:02]

definitely be able to outperform in

[93:03]

linear regression so you have got an

[93:06]

idea of the assumptions you have also

[93:07]

got an idea of multiple things okay now

[93:10]

let's go towards something called as

[93:12]

logistic regression now logistic

[93:14]

regression what logistic regression is

[93:16]

the first type of algorithm that we are

[93:18]

going to learn in classification let's

[93:20]

say that in classification I have one

[93:22]

example you know so suppose I have say

[93:24]

number of hours study hours and number

[93:28]

of play hours based on this I want to

[93:31]

predict whether a child is passing or

[93:33]

failing suppose these two are my

[93:35]

features I want to predict whether it is

[93:36]

pass or fail so here you'll be able to

[93:39]

see that I have some fixed number of

[93:40]

categories specifically in this

[93:42]

particular scenario I have two

[93:43]

categories binary logistic regression

[93:46]

works very well with binary

[93:48]

classification now the uh question comes

[93:50]

that can we solve multiclass

[93:52]

classification using logistic the answer

[93:54]

is simply yes you can definitely do it

[93:57]

so let's go ahead and let's try to

[93:58]

discuss about uh logistic regression now

[94:02]

what is the main purpose of the logistic

[94:03]

regression first of all let's let's uh

[94:06]

understand one scenario okay suppose I

[94:08]

have a feature which basically says um

[94:13]

number of study hours and this is like 1

[94:17]

2 3 4 5 6 7 and let's say that I have

[94:24]

pass this point is basically pass and

[94:28]

this point is basically

[94:29]

fail so I have this two conditions these

[94:32]

are my outcomes now what I'll do I will

[94:34]

just try to make some data points let's

[94:36]

say that if I study Less Than 3 hours I

[94:40]

will probably be fail if I study more

[94:43]

than 3 hours then probably I will pass

[94:47]

this I'll make it as fail and this I

[94:49]

will make it as pass so I will be having

[94:52]

points over here this 1 2 3 let's say

[94:57]

that this is my training data set now

[95:00]

the first question says that okay Chris

[95:02]

fine you have some data over here

[95:05]

whenever it is less than three you are

[95:06]

basically the person is failing if it is

[95:08]

greater than five greater than three it

[95:12]

is basically showing data points points

[95:14]

with respect to pass now can't we solve

[95:17]

this problem first with linear

[95:18]

regression now with the help of linear

[95:21]

regression here the first point will be

[95:23]

that yes I can definitely draw a best

[95:26]

fit line my best fit line in this

[95:28]

particular scenario may be something

[95:30]

like this it may it may look something

[95:32]

like this so here fail is nothing but

[95:35]

zero pass is one the middle point is

[95:38]

basically 0.5 so obviously with the help

[95:41]

of linear

[95:42]

regression I'm able to create this best

[95:44]

fit line and I'll put a scenario that

[95:47]

whenever the value is less

[95:50]

than5 whenever the value is less than

[95:52]

0.5 whenever the output is less than5

[95:55]

let's say that new data point is this

[95:57]

and based on this I'll try to do the

[95:58]

prediction I'm actually able to get the

[96:00]

output over here now when I'm getting

[96:02]

the output over here this basically is

[96:04]

0.25 now in this particular scenario

[96:06]

obviously I'm able to say that yes the

[96:08]

person I'll write a condition over here

[96:11]

saying that if my H Theta of x value is

[96:15]

less than 0.5 then my output should be

[96:20]

zero let's say less than 0.5 I'll say

[96:22]

not less than or equal to less than5

[96:25]

then my output will be zero right so in

[96:28]

this particular case Zero basically

[96:29]

means fail similarly I'll have a

[96:32]

scenario where I'll say that when if my

[96:35]

S of theta of X is greater than or equal

[96:36]

to 5 then this will basically be one

[96:39]

which is nothing but pass so this two

[96:41]

condition I can definitely write over

[96:42]

here this is my center point so that any

[96:45]

point that will probably come over here

[96:47]

let's say that this point is coming over

[96:49]

here right let's say new data point is

[96:51]

somewhere coming over here with this red

[96:53]

point

[96:54]

now what I'll do I'll basically draw a

[96:56]

straight line it will come over here I

[96:58]

will just extend this line

[97:00]

long I will extend this line over here

[97:04]

and I will extend this line over here

[97:07]

and here you can see that based on this

[97:09]

I'm actually getting this particular

[97:11]

prediction which is greater than 0.5 so

[97:13]

I will say that okay the person has

[97:15]

passed obviously this is fine this is

[97:18]

obviously working better this is

[97:20]

obviously working better so what what is

[97:22]

the problem why we are not using linear

[97:24]

regression okay in order to solve this

[97:26]

particular problem why you are

[97:27]

specifically having logistic regression

[97:29]

the answer is very much simple guys the

[97:31]

answer is that whenever let's say that

[97:34]

if I have an outlier which looks

[97:35]

something like this suppose I have an

[97:37]

outlier which comes like this over here

[97:40]

what is this value let's say that this

[97:41]

value is nothing but 7 8 9 10 let's say

[97:46]

that the number of study hours and I'm

[97:48]

studying for nine it is obviously pass

[97:51]

now in this particular scenario when I

[97:52]

have an outlier this entire line will

[97:54]

change now I will probably get my line

[97:57]

which looks something like this okay my

[97:59]

line will basically move something like

[98:01]

this it will now get moved something

[98:03]

like this now when it gets moves

[98:04]

completely like this now for even five

[98:08]

or even at any point that I am actually

[98:10]

predicting let's say that at this

[98:11]

particular point if I try to find out

[98:14]

it'll be showing less than 0. five so

[98:16]

here this particular value or answer

[98:19]

will be wrong right because if we are

[98:21]

studying more than 5 hours OB viously B

[98:24]

based on the previous line the person

[98:26]

had to pass but in this scenario it is

[98:28]

failing it is coming less than 0.5 but

[98:31]

the real value for this is basically

[98:33]

passed so I hope you are understanding

[98:36]

because of the outlier the entire line

[98:37]

is getting changed so how do we fix this

[98:40]

particular problem now in this two

[98:42]

scenarios are there first of all

[98:44]

obviously because of just an outlier

[98:46]

your entire line is getting shifted here

[98:48]

and there the second point is that over

[98:50]

here sometimes you're also getting

[98:52]

greater than one you you're also getting

[98:54]

less than one suppose if I try to

[98:55]

calculate for this particular point if I

[98:58]

project it in behind I'll be getting

[99:00]

some negative value so we have to squash

[99:02]

this function if I squash this function

[99:04]

then it'll become a plain line right how

[99:07]

do we squash it and for this we use

[99:09]

something called as sigmoid activation

[99:11]

function or sigmoid function if somebody

[99:14]

ask you why don't you use linear

[99:17]

regession in order to solve this

[99:19]

classification problem then your answer

[99:21]

should be very much simple you should

[99:23]

say this to specific points so we will

[99:26]

try to go ahead and solve some linear

[99:27]

regression now with the help of cost

[99:29]

function everything as such and we'll

[99:31]

try to understand how the cost function

[99:34]

will look for logistic regression second

[99:36]

reason I told you right it is greater

[99:38]

than zero over here the line is going

[99:40]

greater than zero right greater than

[99:42]

zero I have only Z and one and it is

[99:45]

becoming greater than zero but I have

[99:47]

already told that our maximum and

[99:49]

minimum value are 1 and zero so I hope

[99:51]

you have understood why linear Reg

[99:53]

cannot be used okay I showed you all the

[99:56]

scenarios why linear regression should

[99:58]

not be used now we'll continue and

[100:00]

probably discuss about the other things

[100:02]

over here and uh we will now try to

[100:05]

understand fine what exactly logistic

[100:07]

regression is all about and how the

[100:09]

decision boundaries basically created

[100:11]

now we'll go ahead and discuss about

[100:12]

that specific thing so let's go ahead

[100:15]

our values should be always between 0 to

[100:17]

one over here in this particular case

[100:19]

because it is a binary classification

[100:21]

problem only this should be the answer

[100:23]

so let's go ahead and let's define our

[100:25]

decision boundary so my decision

[100:26]

boundary decision boundary in the case

[100:29]

of logistic regression first of all as

[100:31]

usual in logistic regression we defined

[100:34]

our hypothesis okay guys first of all

[100:36]

let's see if I'm writing my my h of

[100:40]

theta my H Theta of X as Theta 0 + Theta

[100:45]

1 into x + Theta 2 into X like this X1

[100:49]

X2 + Theta n into xn

[100:53]

now in this scenario can I write this

[100:55]

entire equation as Theta transpose X

[100:59]

obviously I can definitely write this

[101:01]

way right and this is what is the

[101:02]

notation that you will probably seeing

[101:04]

in many places so with respect to the

[101:06]

decision boundary of logistic regression

[101:10]

our Theta see like this we can write I'm

[101:12]

saying okay but since we have to

[101:14]

consider two things one is squashing the

[101:17]

line okay how that squashing will

[101:19]

basically happen see if I have this if I

[101:22]

have this line

[101:24]

we saw in the above right if I have this

[101:26]

line suppose I have some data points

[101:28]

over here and I have some data points

[101:30]

over here if I want to create the best

[101:32]

fit line how will I create I will

[101:33]

basically create like this but I have to

[101:35]

also do two things one is squash over

[101:37]

here and squash over here right squash

[101:40]

over here and squash over here now in

[101:42]

order to squash I'm saying squash squash

[101:46]

means

[101:48]

okay now in order to do this I use a

[101:51]

function which is called as sigmoid

[101:52]

activation function

[101:54]

that basically means what happens

[101:56]

obviously you know this line is

[101:57]

basically denoted by H Theta of x equal

[102:01]

to how do you denote this straight line

[102:04]

let me write it down nicely for you so

[102:06]

how do you denote this straight line the

[102:08]

straight line is obviously denoted by

[102:11]

Theta 0 + Theta 1 * X1 let's say now on

[102:15]

top of this on top of this I have to

[102:18]

apply something on top of this value I

[102:21]

have to apply something so that I can

[102:23]

make this line straight instead of just

[102:26]

expanding in this way so my hypothesis

[102:29]

will basically be now G of G is

[102:32]

basically a function on Theta 0 and

[102:34]

Theta 1 * X1 so here I'm trying to

[102:38]

basically what I'm trying to do I will

[102:40]

apply a mathematical formula on top of

[102:42]

this linear regression to squash this

[102:45]

line now let's go ahead and let's try to

[102:47]

find out what is this G okay what is

[102:50]

this G I will say let Z equal to Theta 0

[102:54]

+ Theta 1 * X I'm just initializing this

[102:58]

now my H Theta of X is nothing but G of

[103:00]

Z now we need to understand what is this

[103:03]

z g of Z and how do we basically specify

[103:06]

what is the G function so my G function

[103:08]

is nothing but H Theta of x equal to 1

[103:11]

by 1 + e ^ of minus Z which in short if

[103:15]

I try to initialize Zed now it is 1 ^ of

[103:19]

e ^ of minus Theta 0 + Theta 1 * X so

[103:24]

this is what is my H Theta of X which is

[103:26]

my hypothesis and this obviously works

[103:29]

well because it is being able to squash

[103:32]

the function so this is basically my

[103:34]

hypothesis which I am definitely trying

[103:36]

to use it and this function that you are

[103:39]

actually able to see is called as

[103:43]

sigmoid or logistic function now you

[103:47]

need to understand what does this

[103:48]

sigmoid function look like in graph in

[103:50]

graph it looks something like this so

[103:52]

this this is my Zed value and this is my

[103:56]

G of Z this is my 05 your sigmoid

[104:00]

function will have this curve so this is

[104:03]

your one this is zero your value when

[104:07]

now from this we can make a lot of

[104:08]

assumptions what are the assumptions

[104:10]

that we can basically make your G of Zed

[104:15]

your G of Zed is greater than or equal

[104:18]

to

[104:18]

5.5 is obviously greater than or equal

[104:21]

to 0.5 when your Zed value is greater

[104:24]

than or equal to zero this is the major

[104:27]

assumptions that we can basically make

[104:29]

that is whenever your G of Z is greater

[104:32]

than your G of Z is greater than or

[104:35]

equal to 0.5 whenever your Zed is

[104:38]

greater than or equal to Z so obviously

[104:40]

whenever your Zed value is greater than

[104:42]

Z it is greater than 0.5 if your Zed

[104:44]

value is less than zero what it will

[104:46]

become it will basically be less than

[104:47]

0.5 so you can write that specific

[104:50]

condition also you want so this is the

[104:52]

most important condition

[104:53]

over here why it is called as logistic

[104:55]

regression see guys with the help of

[104:56]

regression you creating this straight

[104:57]

line and with the help of the concept of

[104:59]

sigmo you are able to squash it so they

[105:01]

have probably combined that name and uh

[105:04]

basically have written in this way will

[105:05]

squashing of the best fit L line help to

[105:07]

overcome the outlier issues yes

[105:09]

obviously it'll be able to help you so

[105:10]

let's go ahead and let's try to solve

[105:12]

the problem statement now usually let's

[105:14]

consider my training set let's consider

[105:17]

my training set suppose I have some

[105:19]

training points like this x of 1 comma y

[105:22]

of 1

[105:24]

let's say x of 2A y of 2 okay X of 3A y

[105:28]

of 3 like this I have lot of training

[105:30]

points and finally X of n comma y of n

[105:33]

let's say that this is my training data

[105:35]

so here uh my y y will belong to what

[105:41]

zero or 1 because I will only have two

[105:43]

outputs since we are solving a binary

[105:45]

classification problem here is my

[105:47]

training set with two outputs and I hope

[105:50]

everybody knows about J Theta of Z

[105:53]

it is nothing but 1 + e ^ of minus Z

[105:57]

here your Z is nothing but Theta 0 +

[105:59]

Theta 1 * X1 so this is your Theta 0 now

[106:04]

what we have to do we have to select

[106:06]

this Theta now in this particular case

[106:08]

let's consider that my Theta 0 is 0

[106:10]

because it is passing through the origin

[106:13]

just for time pass sake suppose my Z is

[106:15]

Theta 1 into X so now I need to change

[106:19]

what is my parameter my parameter is

[106:21]

Theta 1

[106:23]

I have to change parameter Theta 1 in

[106:25]

such a way that I get the best fit line

[106:28]

and along that I apply this sigmoid

[106:30]

activation function now let's go ahead

[106:33]

and let's first of all Define our cost

[106:36]

function because for this we definitely

[106:38]

require our cost

[106:39]

function now everything will be same

[106:42]

obviously you know the cost function of

[106:44]

linear regression because the first best

[106:47]

fit line that you are probably creating

[106:48]

is with the help of linear

[106:50]

regression now in this particular case

[106:52]

in the case of linear regression so here

[106:55]

you can basically write J J of theta 1

[106:57]

is nothing but 1 by m summation of I = 1

[107:02]

2 m 1X 2 and here you have H Theta of x

[107:08]

minus y of I I whole Square so this is

[107:13]

your entire thing of if you remember

[107:15]

linear regression whatever things we

[107:17]

have discussed yesterday okay so this is

[107:19]

the cost function let's consider that

[107:22]

for linear regression for this is for

[107:24]

the linear regression now for the

[107:25]

logistic regression what will happen for

[107:27]

your logistic regression I will take the

[107:28]

same cost function H Theta of X now you

[107:31]

know what is s Theta of X it is nothing

[107:33]

but 1 + 1 + e ^ of minus Theta 0 + Theta

[107:37]

sorry Theta 1 multiplied by X right this

[107:40]

is my with respect to logistic

[107:42]

regression this is my entire equation

[107:45]

now similarly I will try to only put

[107:48]

this H Theta of X let's consider that

[107:51]

this is my cost function only only my H

[107:53]

Theta of X is changing in this

[107:55]

particular case so if I go ahead and

[107:57]

write my cost function I can basically

[107:59]

say 1x2 h Theta of X of i - y of

[108:05]

i² and in this particular scenario what

[108:07]

is h Theta of X it is nothing but 1 + 1

[108:11]

+ e ^ minus Theta 1 x so this is what

[108:16]

this is getting replaced and this is my

[108:18]

logistic regression cost function I'm

[108:20]

just considering this cost function part

[108:22]

this part later on if you replace this

[108:25]

to this see if I replace this to this

[108:28]

and if I replace this to this it becomes

[108:30]

a logistic regression cost function

[108:33]

intercept I'm considering it as zero

[108:34]

guys now when I'm replacing this to this

[108:36]

this to this then it becomes a logistic

[108:39]

uh regression cost function but there is

[108:41]

one problem we cannot we cannot use we

[108:45]

cannot use this cost function there is a

[108:48]

reason for this because this equation

[108:50]

that you're seeing 1/ 1 + e^ of minus

[108:54]

Theta 1 * X this is a non-convex

[108:59]

function now you may be considering what

[109:01]

is a non-convex function so let me write

[109:03]

it down so here this this term this

[109:07]

terminology right it is a non-convex

[109:09]

function now what is this non-convex

[109:10]

function let me show you and let me

[109:12]

differentiate it with convex function

[109:15]

okay we'll try to understand what is the

[109:16]

difference between non-convex function

[109:18]

and convex function this is related to

[109:21]

gradient descent very important this is

[109:24]

related to gradient desent if you

[109:27]

remember with the help of linear

[109:29]

regression whatever gradient Dent we are

[109:32]

actually getting it is a convex function

[109:34]

like this this is the convex function

[109:38]

which looks like a parabola curve

[109:40]

Parabola curve because of this Parabola

[109:42]

curve whenever we use this linear

[109:44]

regression cost function specifically

[109:46]

because here my H Theta of X is what it

[109:48]

is nothing but Theta 0 + Theta 1 into X

[109:51]

because of this this equ

[109:53]

will always give you a parabola curve

[109:56]

this kind of cost function or convex

[109:59]

function you can say but here your s

[110:01]

Theta of X is changing so in the case of

[110:03]

if I use that cost function you will be

[110:05]

getting some curves which looks like

[110:07]

this now what is the problem with this

[110:08]

curve here you have lot of local Minima

[110:11]

if local Minima is there you will never

[110:13]

reach This Global Minima so that is the

[110:15]

reason we cannot use that c function now

[110:18]

mathematically you can also go and

[110:20]

probably search in the Google what is

[110:22]

the

[110:23]

what is the graph or what is a convex or

[110:25]

non-convex function but always remember

[110:27]

whenever we updates Theta 1 with this

[110:30]

within this particular equation by

[110:32]

finding the slope then this way it will

[110:35]

not be differentiable and here you have

[110:37]

lot of local Minima and because of this

[110:39]

local Minima you will never be able to

[110:41]

reach the global Minima this is your

[110:42]

Global Minima right in case

[110:45]

of in case of linear regression you'll

[110:48]

reach This Global Minima but in this

[110:50]

case you will never reach never never

[110:52]

you'll be stuck over here or you may get

[110:54]

stuck over here you may get stuck over

[110:56]

here okay so this has a local Minima

[111:00]

problem so how do we solve this

[111:02]

understand in local Minima these are my

[111:03]

points right I have to come over here

[111:05]

this is my deepest point in this

[111:07]

particular case I don't have any local

[111:09]

Minima now in local Minima also you'll

[111:11]

get slope is equal to Z so that is the

[111:13]

reason your Theta 1 will never get

[111:14]

updated so in order to solve this

[111:17]

problem you can see this diagram we have

[111:19]

something called as logistic regression

[111:20]

cost function so I can now write my

[111:23]

logistic regression cost function in a

[111:25]

different way so this researcher

[111:27]

researcher thought of it and basically

[111:30]

came up with this proposal that the

[111:31]

logistic cost function should look

[111:33]

something like this so the entire cost

[111:36]

function of logistic regression that is

[111:38]

specifically H Theta of X of I comma y

[111:43]

this should be written something like

[111:44]

this and it should be written like this

[111:47]

see here I'm just going to write cost

[111:49]

function of J of theta 1 let's say that

[111:51]

I'm writing J of theta 1 okay so J of

[111:54]

theta 1 what are the different different

[111:56]

output that I'll be getting I'll be get

[111:58]

I'll be getting yal 1 or y equal to 0 So

[112:02]

based on this two scenarios our cost

[112:04]

function will look something like this

[112:06]

minus log of H of theta of X and I know

[112:11]

I hope you all know what is h Theta of x

[112:13]

h Theta of X is nothing but 1 + 1 ^ of -

[112:19]

Theta 1 x so this is what is my H Theta

[112:22]

of X and whenever Y is Zer then you

[112:25]

basically have minus log * 1 - H Theta

[112:31]

of X of I of I okay so this is how you

[112:35]

basically write your cost function in

[112:36]

this particular scenario now with the

[112:38]

help of this cost function it is always

[112:40]

possible since it is getting log log is

[112:42]

basically getting used in this scenario

[112:45]

you'll always get a global Minima that

[112:46]

is the reason why they have completely

[112:48]

neglected this cost function and utiliz

[112:51]

this cost function now what does this

[112:52]

cost function basically mean two

[112:55]

scenarios if Y is equal to 1 Let's

[112:58]

consider this is my cost function

[113:01]

graph I have H Theta of X and you know

[113:06]

that H Theta of x value will be ranging

[113:08]

between 0 to 1 since it is a

[113:10]

classification problem so it will be

[113:11]

ranging between 0 to 1 and this is

[113:14]

basically of J of theta 1 which is my

[113:16]

cost function so if Y is equal to 1 this

[113:19]

specific equation will be used and

[113:21]

whenever this equation is is basically

[113:22]

used you get a you get a curve see minus

[113:25]

log s of X of I you get a curve which

[113:29]

looks something like this okay which

[113:31]

you'll get a curve which looks like this

[113:33]

now what does this curve basically

[113:35]

specify the curve come up with two

[113:37]

assumptions the cost will be zero if Y

[113:42]

is = 1 and H Theta of x equal to 1 that

[113:46]

basically when your s Theta of X is 1

[113:49]

and the Y is output is one that

[113:51]

basically means you're going to assign

[113:52]

over here one right so in this

[113:54]

particular case you will be seeing that

[113:56]

your cost function will be zero cost is

[113:59]

zero so here is my zero it is meeting

[114:01]

over here if you of x equal to 1 and Y

[114:04]

is equal to 1 so this is this is again a

[114:06]

convex function only then the next point

[114:08]

that you can probably discuss over here

[114:10]

is with respect to Y is equal to 0 if

[114:13]

your Y is Z then what kind of curve you

[114:16]

will be getting you'll get a different

[114:18]

kind of curve which will look like this

[114:20]

H Theta of x here your value will be 0

[114:23]

to one and here you'll be having a curve

[114:26]

which looks like this so when you

[114:29]

combine this two you'll be able to see

[114:31]

that you are able to get a kind of

[114:34]

gradient descent so this will definitely

[114:36]

help us to create a cost function so I

[114:38]

hope everybody is able to understand

[114:40]

till here with respect to this and this

[114:42]

will definitely work so finally I can

[114:45]

also write my cost function in a

[114:47]

different way the cost function that I

[114:49]

will probably write over here so this

[114:50]

will be my J of theta 1

[114:53]

so I can come up with a cost function

[114:54]

which looks like this

[114:57]

cost of H of theta of X of I comma Yus

[115:02]

log of H Theta of x if Y is equal

[115:09]

1 and then minus

[115:11]

log 1 - H Theta of x if Y is equal

[115:17]

0 now I can combine this both and

[115:21]

probably write something like like this

[115:23]

I can combine this both and I can

[115:25]

basically write cost of H Theta of X of

[115:27]

IA Y is equal to - y log H Theta of X of

[115:35]

I minus log 1 -

[115:40]

y okay 1 - y log of 1 - H Theta of X so

[115:47]

this will be my final cost

[115:50]

function and here also you can see that

[115:53]

if I

[115:54]

replace if I replace y with one then

[115:57]

what will remain only this particular

[115:59]

value will remain right this value when

[116:01]

Y is equal to 1 this thing only will

[116:03]

come you see over here replace y with

[116:05]

one probably replace y with one and then

[116:08]

you'll be able to see so here I can now

[116:10]

write if Y is equal to 1 my cost

[116:14]

function will Rook something like this

[116:18]

which is nothing

[116:19]

but see Y is 1 then what will happen my

[116:22]

log of H Theta of X of I will come and

[116:26]

this 1 - 1 is 0 so 0 multili by anything

[116:29]

will be 0 if Y is equal to 0 then what

[116:32]

will happen my cost function will be so

[116:36]

when it is zero this will - y will

[116:38]

become 0 0 multili by anything is z so

[116:42]

here you'll be able to see that I am

[116:43]

I'll be having minus log 1 - H Theta of

[116:48]

x i so this both the condition has been

[116:50]

proved by this cost function

[116:52]

so this is my cost function yes cost

[116:54]

function and loss function with respect

[116:55]

to the number of parameters will be

[116:57]

almost same so finally if I try to write

[117:00]

J of theta because I have that 1X 2 m

[117:03]

also right so 1X 2 m also I have so what

[117:06]

I'm actually going to do here you will

[117:08]

be able to see that I can write J of

[117:11]

theta 1 is equal to 1 by 2 m summation

[117:16]

of IAL 1 to M and then write down the

[117:19]

entire equation that you have probably

[117:22]

over here so here you have minus y or I

[117:26]

I'll just remove this minus and put it

[117:27]

over here and this will become plus

[117:29]

sorry y of I

[117:31]

* log H Theta of X of I 1 - y of i y

[117:41]

log 1 - H Theta of X of I so this

[117:45]

becomes my entire first function and

[117:48]

obviously you know what is h thet of x H

[117:52]

Theta of X of I is nothing but 1 + 1 e^

[117:56]

minus Theta 1 * X and finally my

[117:59]

convergence algorithm I have to repeat

[118:02]

this to update Theta 1 repeat until this

[118:07]

updation that is Theta Theta

[118:11]

J is equal to Theta J minus learning

[118:15]

rate derivative with respect to Theta J

[118:18]

and this will be my J of theta 1 this is

[118:21]

my repeat until conversion so this is my

[118:24]

cost function this is my repeat

[118:27]

algorithm and here I will be updating my

[118:30]

entire Theta

[118:32]

1 and this solves your problem with

[118:35]

respect to logistic regression simple

[118:37]

simple questions may come like how it is

[118:39]

different from linear regression how it

[118:41]

is not different from linear regression

[118:44]

can we say log likelihood a topic from

[118:46]

probabilistic yes this is uh this is log

[118:50]

likelihood if now I will discuss about

[118:54]

performance metrics and this is specific

[118:56]

to classification problem and binary

[118:59]

classification I'm talking let's

[119:02]

consider let's consider I have a data

[119:04]

set which has X1 X2 and this is y and

[119:09]

obviously in logistic uh classification

[119:11]

you have outputs like 0 1 0 1 1 0 1 and

[119:17]

your y hat y hat is basically the output

[119:20]

of the predicted model now in this

[119:22]

particular scenario my y hat will

[119:24]

probably be 1 1 0 uh 1 1 1 Z so in this

[119:31]

particular scenario this is my predicted

[119:34]

output and this is my actual output so

[119:39]

can we come to some kind of conclusions

[119:41]

wherein probably we will be able to

[119:44]

identify what may be the accuracy of

[119:48]

this specific model with respect to this

[119:49]

many data points because confusion

[119:52]

Matrix is all dealt with this is called

[119:54]

as we will first of all have to create a

[119:56]

confusion Matrix now for a binary

[119:59]

classification problem the confusion

[120:01]

Matrix will look like this so here you

[120:03]

have 1 0 1 0 Let's say that this is

[120:06]

prediction let's say that these are my

[120:08]

actual value and these are my prediction

[120:10]

value okay these both are prediction

[120:12]

value these are my output value when my

[120:15]

actual value is zero my predicted value

[120:17]

is one does this what does this mean

[120:21]

wrong prediction right so when my actual

[120:23]

value is zero my predicted value is 1 so

[120:26]

here my count will increase to one let's

[120:28]

go to the second scenario when the

[120:30]

actual value is one and my predicted

[120:33]

value is one that basically means one

[120:35]

and one so here I'm going to increase my

[120:37]

count similarly when my actual value is

[120:40]

zero my predicted value is zero so that

[120:42]

basically mean when my actual value is z

[120:43]

my predicted value is zero I'm going to

[120:45]

increase the count by one if I go over

[120:47]

here 1 one again it is so instead of

[120:50]

writing one now this will become two I'm

[120:52]

going to increase the count similarly

[120:54]

I'll go over here one more one is there

[120:56]

so I'm going to increase the count three

[120:58]

then I have 01 01 basically means when

[121:00]

my actual value is zero I'm actually

[121:02]

getting it as one so I'm also going to

[121:04]

increase this particular value as two

[121:07]

and then finally I have 1 and zero where

[121:09]

I'm going to increase like this now what

[121:11]

does this basically mean now what does

[121:13]

this basically mean see with respect to

[121:16]

this kind of predictions whenever we are

[121:17]

discussing this basically basically says

[121:20]

so this is my actual values and I have Z

[121:22]

1 and zero and this is my predicted

[121:24]

values I also have 1 and zero this value

[121:27]

when one and one are there this is

[121:29]

called as true positive this value when

[121:31]

0 and Zer are there this is called as

[121:33]

false negative whenever your actual

[121:35]

value is zero and you have predicted one

[121:37]

this becomes false positive and whenever

[121:40]

your actual value is one you have

[121:41]

predicted zero this becomes false

[121:43]

negative now coming to this I really

[121:45]

need to find out the accuracy of this

[121:47]

model now if I really want to find out

[121:51]

and this is what is called as confusion

[121:52]

Matrix now in this confusion Matrix if I

[121:55]

really want to find out the accuracy the

[121:57]

accuracy of this model it is very much

[121:59]

simple this middle elements that you are

[122:01]

able to see will basically give us the

[122:03]

right output so this and this if I add

[122:07]

it up it will give us the right output

[122:10]

so here I'm going to get TP + TN divided

[122:13]

by TP + FP + FN + TN so once I calculate

[122:21]

this so I have 3 + 1

[122:23]

/ 3 + 2 + 1 + 1 so this is nothing but 4

[122:29]

by 7 what is 4 by

[122:32]

757 so am I getting 57 percentage

[122:35]

accuracy so I'm actually getting 57%

[122:38]

accuracy over here with respect to the

[122:39]

accuracy so this is how we basically

[122:42]

calculate with respect to basic accuracy

[122:45]

with the help of uh the confusion Matrix

[122:48]

okay so this is specifically called as

[122:49]

confusion Matrix now there are some more

[122:52]

things that you really need to specify

[122:54]

always remember our model aim should be

[122:56]

that we should try to reduce false

[122:57]

positive and false negative now let's

[123:00]

say that I want to discuss about two

[123:02]

topics what one is suppose in our data

[123:04]

set I have zeros and one category let's

[123:07]

say in my output if I say Zer are 900

[123:11]

and ones are 100 this becomes an

[123:13]

imbalanced data very clear right so this

[123:15]

become an imbalanced data set it is a

[123:18]

biased data suppose if I say zeros are

[123:21]

probab

[123:22]

600 and ones are probably 400 in this

[123:25]

particular scenario I will say that this

[123:27]

is the balance data because yes you have

[123:29]

100 less but it's okay the it may not

[123:32]

impact many of the algorithm now see

[123:34]

guys most of the algorithm that we will

[123:36]

be probably discussing imbalanced if we

[123:38]

have an imbalanced data set it will

[123:40]

obviously affect the algorithms let me

[123:42]

talk about this let's say that I have

[123:44]

number of zeros as 900 and number of

[123:46]

ones is 100 now let's say that my model

[123:49]

I have created which will directly

[123:51]

predict

[123:52]

zero it'll I'll just say that all my

[123:55]

inputs that it is probably getting with

[123:57]

respect to this training data it'll just

[123:59]

output zero now in this particular

[124:01]

scenario what will be my accuracy my

[124:03]

accuracy will be 900 divid by 1,000

[124:05]

right so this is nothing but 90% so is

[124:09]

this a good

[124:10]

accuracy obviously it is a good accuracy

[124:12]

but this is a biased data if my model is

[124:15]

basically just outputting 00000000 0 if

[124:19]

it is outputting 00 00 0 obviously most

[124:22]

of the answer will be zeros but this

[124:24]

will be a scenario like you know where

[124:27]

it is just outputting one thing then

[124:28]

also it is able to get 90% accuracy so

[124:31]

you should only not be dependent on

[124:33]

accuracy so there are lot of

[124:35]

terminologies that we will basically use

[124:37]

one terminology that we specifically use

[124:40]

is something called as Precision then

[124:42]

we'll also use recall what is precision

[124:45]

what is recall I'll write the formula

[124:46]

over here in Precision what do we need

[124:48]

to focus and then finally we will

[124:50]

discuss about f score so we have to use

[124:53]

different kind of parametrics of sorry

[124:55]

different kind of formulas whenever you

[124:58]

have an imbalanced data set you can also

[124:59]

do oversampling but again understand in

[125:02]

most of the scenarios in some of the

[125:04]

scenarios oversampling may work but we

[125:06]

have to focus on the type of performance

[125:08]

metrics that we are focusing on right

[125:10]

now I'll not say F1 score I'll say F

[125:11]

score the reason why I'm saying I'll

[125:13]

just let you know so let's talk about

[125:15]

recall recall formula is basically given

[125:17]

by true positive divided by true

[125:20]

positive plus false negative

[125:22]

Precision is given by true positive

[125:23]

divided by true positive plus false

[125:27]

positive and then I will probably

[125:29]

discuss about F sore also or we

[125:31]

basically say fbaa also now I'll just

[125:34]

draw this confusion Matrix again okay

[125:36]

which is having true positive true

[125:37]

negative so let me draw it over here so

[125:40]

this is my ones and zeros these are my

[125:42]

actual values and these are my predicted

[125:44]

values I have true positive I have true

[125:47]

negative false positive and false

[125:49]

negative now in this particular scenario

[125:50]

when I'm actually discussing understand

[125:53]

what is recall and what focus it is

[125:54]

basically given on so here whenever I

[125:57]

talk about recall recall basically says

[125:59]

that TP TP divided by TP plus FN so I'm

[126:04]

actually focusing on this so what does

[126:06]

this basically say true uh recall out of

[126:10]

all the actual true positives how many

[126:13]

have been predicted correctly that is

[126:15]

basically mentioned by TP out of all the

[126:18]

positive values how many of them have

[126:20]

predicted as positive so this is what it

[126:22]

is basically saying and this scenario is

[126:24]

called as recall in this the false

[126:27]

negative is basically given more

[126:28]

priority and our focus should be that we

[126:31]

should try to reduce false positive

[126:33]

false negative sorry we should try to

[126:35]

reduce this now let's go ahead and let's

[126:37]

discuss about Precision in Precision

[126:39]

what we are doing we are basically

[126:41]

taking out of all the predicted values

[126:44]

out of all the predicted positive values

[126:47]

how many of them are actual true or

[126:50]

positive okay this is what Precision

[126:52]

basically means now suppose if I

[126:54]

consider spam classification suppose

[126:56]

this is my task tell me in this

[126:57]

particular case should we use Precision

[127:00]

or recall and one more use case I'm

[127:02]

saying that whether the person has

[127:05]

cancer or not in which case we have to

[127:08]

support recall and in which case we have

[127:10]

to go ahead with Precision has cancer or

[127:13]

not in spam what is important okay guys

[127:16]

the recall is also called as true

[127:18]

positive rate I can also say recall as

[127:20]

sensitivity so if I go with Spam

[127:22]

classification it should definitely go

[127:24]

with Precision why it should go with

[127:26]

Precision if I probably get a Spam ma

[127:28]

the main aim should be that whenever I

[127:30]

get a Spam Mill it should be identified

[127:31]

as spam okay in that specific scenario

[127:34]

my positive false positive we should try

[127:37]

to reduce and in this scenario my false

[127:39]

pository talks about the spam

[127:41]

classification a lot in a better way in

[127:43]

the case of cancer I should definitely

[127:46]

use recall let's let's focus on the

[127:48]

recall formula tp/ by TP plus FN if a

[127:52]

person has a cancer see one actually he

[127:55]

has a cancer it should be predicted as

[127:57]

one otherwise if we have FN it is

[127:59]

basically predicting it does not have a

[128:01]

cancer that is really a big situation in

[128:04]

this case if a person does not have a

[128:07]

Cancer and if he's predict if the model

[128:09]

predicts okay fine he has a cancer he

[128:11]

may go and further do the test and then

[128:13]

he'll come to know whether he has a

[128:14]

cancer or not but this scenario is very

[128:16]

dangerous if a person has a cancer but

[128:19]

he is being indicated that he does not

[128:20]

have that cancer

[128:22]

so here false negative is given more

[128:24]

priority over here in the case of spam

[128:26]

classification false positive is given

[128:28]

more priority so this is something

[128:30]

important over here and you really need

[128:31]

to understand with respect to different

[128:33]

different problem statement let me give

[128:35]

you one more example tomorrow the stock

[128:37]

market is going to crash in this what we

[128:40]

need to focus on should we focus on

[128:41]

Precision or should we focus on recall

[128:44]

now here two things are there who is

[128:46]

solving what kind of problem see many

[128:48]

people will say recall or Precision but

[128:50]

here two things are there on whose point

[128:52]

of view you are creating this model are

[128:55]

you creating this model for the industry

[128:57]

or are you creating this model for the

[128:59]

people for the people he should

[129:01]

definitely get identified that okay in

[129:04]

this particular scenario you need to

[129:06]

sell your stock because tomorrow stock

[129:07]

market is going to crash but for

[129:09]

companies this is very bad okay I hope

[129:11]

everybody is able to understand for

[129:13]

companies it is very very bad so in this

[129:15]

particular case sometime we need to

[129:17]

focus both on false positive and false

[129:19]

negative and again I'm telling you for

[129:22]

which problem statement you are solving

[129:23]

that indicates if you are solving for

[129:25]

people then they should be able to get

[129:27]

the notification saying that it is going

[129:29]

to crash if you're probably uh doing it

[129:32]

for companies at that time your

[129:34]

Precision recall may change but if I

[129:36]

consider for both the scenarios at that

[129:39]

point of time I will definitely use

[129:40]

something called as F score F score or

[129:42]

I'll also say it as F beta now how is

[129:45]

fbaa Formula given as I will talk about

[129:48]

it and here in the F score you have

[129:50]

three different formulas the first

[129:51]

Formula I will say basically as when

[129:53]

your beta value is 1 okay first of all

[129:57]

I'll just give a generic definition of f

[129:59]

s or F beta here you are basically going

[130:01]

to consider 1 + beta squ Precision

[130:05]

multiplied by recall divided beta Square

[130:09]

* Precision plus recall whenever your

[130:14]

both false positive and false negative

[130:16]

are important we select beta as one so

[130:19]

if I select beta as 1 it becomes 1 + 4

[130:22]

Precision multiplied by recall then you

[130:25]

have Precision plus recall so here sorry

[130:28]

1 + 1 so this becomes 2 multiplied by

[130:31]

Precision into recall divided by

[130:34]

Precision plus recall so here you have

[130:37]

this is basically called as harmonic

[130:39]

mean harmonic mean probably you have

[130:41]

seen this kind of equation where you

[130:42]

have written 2x y / x + y same type you

[130:46]

are able to see this this is called as

[130:48]

harmonic mean here the focus is on both

[130:51]

false positive and false negative let's

[130:53]

say that your false positive is more

[130:56]

important than false negative at that

[130:58]

point of time you will try to decrease

[131:01]

or you will try to decrease your beta

[131:03]

value let's say that I'm decreasing my

[131:05]

Beta value to 0.5 then what will happen

[131:07]

1 +5 whole

[131:09]

s and then you have P * R Precision

[131:13]

recall and here also you have 25 p + r

[131:17]

now in this particular scenario I'm

[131:19]

decreasing my Beta decreasing the beta

[131:21]

basically means that you are providing

[131:23]

more importance to false positive than

[131:25]

false negative and finally you'll be

[131:27]

able to see that if I consider beta

[131:30]

value as let me just say my notes if I

[131:34]

consider beta value as two that

[131:37]

basically means you are giving more

[131:38]

importance to false negative than false

[131:40]

positive so with this specific case you

[131:42]

can come up to a conclusion what value

[131:44]

you basically want to use now whenever I

[131:46]

use beta is equal to 1 it becomes fub1

[131:49]

score if I use beta as .5 then this

[131:52]

basically becomes f.5 score and this

[131:56]

becomes your F2 score So based on which

[132:00]

is important okay which is important

[132:03]

whether your Precision or false positive

[132:05]

or false negative is important you can

[132:06]

consider those things F score will have

[132:09]

different values if you're using beta is

[132:11]

equal to 1 that basically means you are

[132:13]

giving importance to both precision and

[132:16]

recall if your false positive is more

[132:18]

important then at that point of time you

[132:20]

reduce beta value if false negative is

[132:23]

greater than false bet uh false positive

[132:25]

then your beta value is

[132:26]

increasing beta is a deciding parameter

[132:29]

to decide your F1 score or F2 score or F

[132:32]

Point score now first thing first what

[132:34]

is the agenda of today's session first

[132:36]

of all we will complete practicals for

[132:39]

all the algorithms that we have

[132:41]

discussed these all algorithms that we

[132:43]

have discussed we will cover the

[132:45]

practicals probably we will be doing

[132:47]

hyper parameter tuning everything the

[132:49]

second thing and again here we are going

[132:51]

to take just simple examples so yes uh

[132:54]

so today's session I said practicals

[132:56]

with simple examples where I'll probably

[132:59]

discuss about all the hyper parameter

[133:01]

tuning then the second one the second

[133:04]

algorithm that I'm going to discuss

[133:05]

about is something called as n bias this

[133:09]

is a classification algorithm so we are

[133:10]

going to understand the intuition and

[133:13]

the third one that we are going to

[133:15]

probably discusses KNN algorithm so KNN

[133:19]

algorithms is definitely there

[133:21]

so this our today's plan I know I've

[133:23]

written very less but this much maths

[133:26]

and involved in na bias right we'll

[133:29]

understand the probability theorem again

[133:30]

over there there is something called as

[133:32]

bias theorem we'll try to understand and

[133:35]

then we'll try to solve a problem on

[133:36]

that so let's proceed and let's enjoy

[133:39]

today's session how do we enjoy first of

[133:42]

all we enjoy by creating a practical

[133:44]

problem so I am actually opening a

[133:47]

notebook file in front of you so here uh

[133:50]

we will try to Sol solve it with the

[133:51]

help of linear regression Ridge lasso

[133:55]

and try to solve some problems let's see

[133:58]

how much we will be able to solve it but

[134:00]

again the aim is that we learn in a

[134:02]

better way okay uh so that everybody

[134:06]

understands some basic basic things okay

[134:08]

so first of all as usual uh everybody

[134:11]

open your jupyter notebook file the

[134:13]

first algorithm that I'm going to

[134:14]

discuss about is something called as SK

[134:16]

learn linear regression so everybody I

[134:19]

hope everybody knows about this SK learn

[134:21]

let's see what all things are basically

[134:23]

there in this we will be using fit

[134:25]

intercept everything as such but here

[134:28]

the main aim is to find out the

[134:29]

coefficients which is basically

[134:31]

indicated by Theta 0 Theta 1 and all the

[134:34]

first thing we'll start with linear

[134:38]

regression and then we will go ahead and

[134:40]

discuss with r and lassor I'm just going

[134:42]

to make this as

[134:44]

markdown how many different libraries of

[134:46]

for linear regression you can do with

[134:48]

stats you can do with skyi you can do

[134:49]

with many things okay so first thing

[134:52]

first let's first of all we require a

[134:53]

data set so for the data set what we are

[134:56]

going to do is that we are going to

[134:58]

basically take up some smaller smaller

[135:01]

data just let me do this so for this uh

[135:05]

we are going to take the house pricing

[135:07]

data set so we are going to solve house

[135:10]

pricing data set problem a simple data

[135:13]

set which is already present in SK learn

[135:16]

only now in order to import the data set

[135:18]

I will write a line of code which is

[135:19]

like from SK learn dot data sets data

[135:24]

sets

[135:25]

import load uncore Boston so we have

[135:29]

some Boston house pricing data set so

[135:31]

I'm just going to execute this I'm also

[135:33]

going to make a lot of Sals so that I

[135:35]

don't have to again go ahead and create

[135:37]

all the sales again some basic libraries

[135:39]

that I probably want is pro import numai

[135:43]

as

[135:44]

NP

[135:45]

import pandas

[135:48]

SPD okay import cbon as

[135:52]

SNS and then I will also import Matt

[135:56]

Matt plot lib do p plot as PLT and then

[136:02]

percentile matplot lib matlot lib do

[136:07]

inline and I will try to execute this

[136:09]

see this my typing speed has become a

[136:11]

little bit faster by writing by

[136:12]

executing this queries again and again

[136:15]

and uh let's go ahead uh so I have

[136:18]

imported all the necessary libraries

[136:19]

that is required which which will be

[136:21]

more than sufficient for you all to

[136:23]

start with now in order to load this

[136:25]

particular data set I will just use this

[136:27]

Library called as load uncore Boston and

[136:30]

I'm going to just initialize this so if

[136:32]

you press shift tab you will be able to

[136:34]

see that return load and return the

[136:37]

Boston house prices data set it is a

[136:39]

regression problem it is saying and then

[136:41]

probably I'm just going to execute it

[136:43]

now once I execute it I will go and

[136:45]

probably see the type of DF so it is

[136:48]

basically saying skarn dos. bunch now if

[136:51]

I go and probably execute DF you'll be

[136:53]

able to see that this will be in the

[136:55]

form of key value pairs okay like Target

[136:57]

is here data is here okay so data is

[137:01]

here Target is here and probably you'll

[137:03]

be able to find out feature names is

[137:04]

here so we definitely require feature

[137:06]

names we require our Target value and

[137:09]

our data value so we really need to

[137:11]

combine this specific thing in a proper

[137:14]

way in the form of a data frame so that

[137:16]

you will be able to see so what I'm

[137:18]

actually going to do over here I'm just

[137:19]

going to say PD do data frame I'll

[137:22]

convert this entirely into a data frame

[137:24]

and I will say DF do data see this is a

[137:27]

key value pair right so DF do data is

[137:29]

basically giving me all the features

[137:31]

value so if I write DF do data and just

[137:35]

execute it you'll be able to see that I

[137:36]

you will be able to get my entire data

[137:39]

set in this way my entire data set in

[137:41]

this way this is my feature one feature

[137:43]

two feature three feature 4 this feature

[137:45]

12 I have 12 features over here and

[137:47]

based on that I have that specific value

[137:50]

now the next thing thing that I'm going

[137:51]

to do probably I should also be able to

[137:53]

add the target feature name over here so

[137:55]

what I will do I will just convert this

[137:57]

into DF and then I will also say DF do

[138:02]

columns and I'll set it to DF do Target

[138:05]

okay and let me change this to data set

[138:08]

so I'm going to change this to data set

[138:10]

and I'm going to say data set. columns

[138:12]

is equal to DF do Target so if I execute

[138:15]

this and now if I probably

[138:18]

print my data set do head you will be

[138:22]

able to see this specific thing okay it

[138:24]

is an error let's see expected axis has

[138:27]

13 element new values has

[138:30]

506 so Target okay I should not use

[138:33]

Target over here instead I had a column

[138:36]

which is called as features feature

[138:38]

names like if I go and probably see

[138:41]

DF DF over here you'll be able to see

[138:45]

there is one thing which is called as

[138:46]

feature names so I'm going to use DF do

[138:48]

feature names over here so here it is DF

[138:52]

do feature names I'm just going to paste

[138:55]

it over here and now if I go and write

[138:57]

here you can see print DF data set. head

[139:00]

if I go and execute without print you'll

[139:02]

be able to see my entire data set so

[139:04]

these are my features with respect to

[139:06]

different different things and this is

[139:09]

basically a house pricing data set so

[139:10]

initially I have this features CRM ZN

[139:13]

indust CH nox RM age distance radius tax

[139:18]

PT ratio b l stack that so I have my

[139:22]

entire data set over here the same data

[139:24]

set I have basically put it over here

[139:26]

now here also you'll be able to see what

[139:28]

all this feature basically means this is

[139:30]

showing wasted weighted distance to five

[139:31]

do uh Five Boston employment center rad

[139:34]

basically means index of accessibility

[139:36]

to radial Highway tax basically means

[139:39]

full value property tax rate this much

[139:41]

PT rate basically means pupil teacher

[139:44]

ratio I don't know what the hell it

[139:45]

means but it's fine we have some kind of

[139:47]

data over here properly in front of you

[139:51]

so these are my independent features

[139:53]

what are these these all are my

[139:54]

independent features if you want the

[139:56]

features detail here you can see it

[139:59]

right everything what is CRM this

[140:01]

basically means per capita crime rate by

[140:03]

town which is important ZN it is

[140:06]

proportional of residential land zone

[140:08]

for Lots over 25,000 Square ft so this

[140:12]

is my DF I did not do much I'm just

[140:14]

using data frame DF do data column

[140:17]

features name I'm getting this value

[140:18]

very much simple now let's go a little

[140:21]

bit slowly so that many people will be

[140:23]

able to also understand now this is my

[140:25]

data set. head now the thing is that I

[140:29]

obviously have taken all these

[140:31]

particular values but this is my

[140:32]

independent feature I still have my

[140:35]

dependent feature so what I'm actually

[140:37]

going to do I will create a new feature

[140:40]

which is like data set of price I'll

[140:42]

create my feature name price price of

[140:44]

the house and what I will assign this

[140:46]

particular value this value will be

[140:48]

assigned with this target this target

[140:50]

value this target value is basically the

[140:53]

sale the price of the houses right it is

[140:56]

again in the form of array so I'm going

[140:58]

to take this and put it as a dependent

[141:00]

feature so here you'll be able to see

[141:02]

that my price will be my dependent

[141:04]

feature so here I'll basically write DF

[141:06]

do Target so once I execute it and now

[141:09]

if I probably go and see my data set do

[141:12]

head you'll be able to see features over

[141:15]

here and one more feature is getting

[141:17]

added that is price now this price may

[141:20]

be the units may be in

[141:22]

millions somewhere Target should be here

[141:24]

or there it should be probably in

[141:27]

millions

[141:28]

or I cannot see it but it should be

[141:31]

somewhere here it should have definitely

[141:33]

said that it is probably in millions or

[141:36]

okay but that is not a problem I think

[141:37]

but mostly it'll be in millions

[141:39]

somewhere I think it should be

[141:42]

here okay I cannot see it but probably

[141:45]

if I put more time I'll be able to

[141:47]

understand it okay so over here what is

[141:49]

the thing main thing this all are my

[141:51]

independent features and this is my

[141:53]

dependent feature right so if I'm trying

[141:55]

to solve linear regression I have to

[141:57]

divide my independent and dependent

[141:58]

features properly now let's go to the

[142:01]

next step that

[142:03]

is

[142:05]

dividing the data

[142:07]

set dividing the oh my God dividing the

[142:12]

data

[142:14]

set

[142:17]

into

[142:19]

train into first of all I'll try try to

[142:22]

divide into

[142:24]

independent and dependent

[142:27]

features so I want my entire features

[142:30]

data set divided into independent and

[142:31]

dependent features X I will be using as

[142:34]

my independent featur so I will write

[142:35]

data set dot I will use an iock which is

[142:39]

present in data frames and understand

[142:41]

from which feature to which feature I

[142:42]

will be taking as my independent feature

[142:44]

to this feature till lat so the best way

[142:48]

that basically means that I just need to

[142:49]

skip the last feature in order to skip

[142:52]

the last feature what I'm actually going

[142:54]

to do from all the columns I will just

[142:57]

skip the last column so this is how you

[142:59]

basically do an indexing with respect to

[143:02]

just skipping the last feature and this

[143:05]

will basically be my independent

[143:06]

features and here I will basically say Y

[143:08]

is equal to data set do iock and here I

[143:11]

just want the last feature so I will

[143:14]

write colon all the records I want and

[143:18]

see the first term that we are probably

[143:20]

WR writing over here this basically

[143:22]

specifies with respect to records here

[143:24]

this specifies with respect to columns

[143:26]

from all the columns I'm taking the last

[143:27]

column here I will just take the last

[143:29]

column and this will basically be my

[143:32]

dependent features dependent features so

[143:35]

here I have basically executed now if

[143:37]

you can go and probably see x. head here

[143:40]

you'll be able to find all my

[143:41]

independent features in y do head you'll

[143:43]

be able to find the dependent feature

[143:45]

now let's go to the first algorithm that

[143:47]

is called as linear regression

[143:51]

always remember whenever I definitely

[143:53]

start with linear regression I'll

[143:55]

definitely not go directly with linear

[143:56]

regression instead what I will do is

[143:59]

that I'll try to go with Ridge

[144:00]

regression and uh lasso regression

[144:02]

because there you are lot of options

[144:04]

with respect to hyper pment T but I'll

[144:06]

just show you how linear regression is

[144:08]

done so basically you really really need

[144:11]

to use a lot of libraries okay over here

[144:13]

and based on this libraries this

[144:15]

libraries will try to install okay and

[144:18]

what are these libraries these are

[144:19]

basically the linear regression Library

[144:21]

so here I'm basically going to use two

[144:23]

specific thing one is linear regression

[144:25]

Library so I will just use from SK learn

[144:28]

do linear uncore model import linear

[144:32]

regression do you need to remember this

[144:35]

the answer is no because I also do the

[144:37]

Google and I try to find out where in

[144:39]

escal and it is present okay so here is

[144:42]

my linear regression so I will try to

[144:44]

initialize linear reg is equal to

[144:47]

initialize with linear regression and

[144:49]

then here what I'm actually going to do

[144:51]

I'm going to basically apply something

[144:53]

called as cross validation cross

[144:55]

validation is very much important

[144:57]

because in Cross validation we divide

[144:59]

out train and test data in such a way

[145:01]

that every combination of the train and

[145:04]

test data is basically taken by care is

[145:07]

taken by the model and whoever accuracy

[145:09]

is better that all entire thing is

[145:11]

basically combined so here what I'm

[145:13]

going to do I'm going to say mean square

[145:14]

error is equal to here I will import one

[145:17]

more Library let's say from SK learn

[145:20]

dot model selection I'm going to import

[145:25]

cross Val

[145:26]

score so cross Val score cross

[145:29]

validation score basically means it is

[145:31]

going to do a lot of train and test

[145:32]

split it's something like this one

[145:34]

example I will show it to you here only

[145:37]

so what does cross validation basically

[145:39]

do okay so in Cross validation what

[145:42]

happens what you do suppose this is your

[145:44]

entire data

[145:46]

set suppose this is 100 records if you

[145:48]

do five cross validation then in the

[145:51]

first this will be your test data and

[145:53]

remaining all will be your training data

[145:55]

if in the second cross validation this

[145:58]

will be your test data and remaining all

[145:59]

will be your test uh training data like

[146:01]

this five times you'll be doing cross

[146:03]

validation by taking the different

[146:05]

combination of train and test but I'm

[146:07]

not going to discuss much about it in

[146:09]

the future if you want a separate

[146:10]

session I will include that in one of

[146:11]

the session itself so this was uh

[146:13]

basically the plan with respect to cross

[146:15]

validation or cross Val score so here

[146:17]

I'm going to basically take cross

[146:20]

Val

[146:21]

score and here the first parameter that

[146:24]

I give is my model so linear regression

[146:27]

is my model and here I will take X and Y

[146:30]

I'm not doing a train test split

[146:32]

specifically over here I'm giving the

[146:34]

entire X and Y and probably based on

[146:36]

that I'm going to do a cross validation

[146:38]

over here you can also do train test

[146:39]

plate initially and then just give the X

[146:42]

train and Y train over here to do the

[146:43]

cross validation it is up to you but the

[146:45]

best practices will be that first you do

[146:47]

the train test split and then only give

[146:49]

the train data over here to do the cross

[146:51]

validation I'm just going to use scoring

[146:53]

is equal to you can use mean squared

[146:56]

error negative mean squared error let's

[146:58]

say that I'm going to use negative mean

[147:00]

squ error again where do you find all

[147:02]

these things you will be able to see in

[147:04]

the SK learn page of L uh cross Val

[147:06]

score and then finally in the cross Val

[147:08]

score you give cross validation value as

[147:10]

5 10 whatever you want so after this

[147:13]

what I'm actually going to do I'm just

[147:14]

going to basically from this how many

[147:17]

scores I will get the mean squar error

[147:19]

will be five since I'm doing five cross

[147:21]

validation if you don't believe me just

[147:23]

see over here print msse so here you'll

[147:26]

be able to see five different values 1 2

[147:30]

3 4 5 right five different mean values

[147:34]

because we are doing cross five five

[147:36]

cross validation so here what I'm going

[147:37]

to write I'm just going to say np. mean

[147:40]

I want to take the average of all the

[147:41]

five so here will basically be my

[147:45]

meanor

[147:46]

msse okay and then probably I'll print I

[147:49]

will print my Ms meanor MSC so this will

[147:54]

be my average score with respect to this

[147:56]

the negative value is there because we

[147:58]

have used negative mean squ error but if

[148:00]

you just consider mean square error then

[148:01]

it is only 37.1 3 okay so this I have

[148:05]

actually shown you how to do cross

[148:06]

validation see with respect to linear

[148:08]

regression you can't modify much with

[148:10]

the parameter so that is the reason why

[148:12]

specifically in order to overcome

[148:14]

overfitting and do the feature selection

[148:16]

we use uh R and lasso regression so here

[148:19]

I will show show you how to do ridge

[148:21]

ridge regression

[148:24]

now now in order to do the prediction

[148:26]

all you have to do is that just go over

[148:28]

here take the model okay what is the

[148:31]

model linear R and just say do

[148:37]

predict so here you can see uh you'll be

[148:40]

getting a function called as do predict

[148:42]

and give the test value whatever you

[148:44]

want to predict automatically the

[148:45]

prediction will be done so I'm just

[148:46]

going to remove this and focus on Ridge

[148:48]

regression right now because I I want to

[148:50]

show how hyperparameter tuning is done

[148:52]

in R regression so for R regression the

[148:54]

simple thing is that I'll be using two

[148:56]

different libraries from skarn do

[149:01]

linear linear uncore model I'm going to

[149:05]

import Ridge so for the ridge it is also

[149:09]

present in linear underscore model for

[149:11]

doing the hyperparameter tuning I will

[149:12]

be using from SK learn do modore

[149:17]

selection and then I'm going to import

[149:20]

grid SE CV so these are the two

[149:22]

libraries that I'm actually going to use

[149:24]

grid SE CV will be able to help you out

[149:26]

with the um okay will be able to help

[149:30]

you out with Hyper parameter tuning and

[149:32]

then probably you'll be able to do

[149:34]

that uh difference between MSE and

[149:37]

negative MSE not big thing guys if you

[149:39]

use MSE here mean squ error you'll be

[149:42]

getting 37 I've just used negation of

[149:45]

MSE it's okay anything is fine you can

[149:48]

go with MSE also means square error

[149:50]

there is also another uh another scoring

[149:53]

area which is like which focuses on

[149:54]

square root square mean Square uh sorry

[149:58]

root means Square eror okay so there are

[150:00]

different different things which you can

[150:01]

basically focus on okay now in order to

[150:05]

give you this specific good value I'm

[150:07]

actually going to do hyper Peter tuning

[150:10]

now let's go ahead with uh grid s CV so

[150:13]

here what I'm going to do again I'm

[150:14]

going to basically Define my model which

[150:16]

will be

[150:18]

Ridge okay so this this is what I have

[150:20]

actually imported now uh let me open the

[150:24]

ridge skarn so SK learn

[150:28]

Ridge we need need to understand what

[150:30]

all parameters are basically

[150:33]

used do you remember this Alpha value

[150:36]

guys do you remember this Alpha value

[150:38]

why do we use Alpha I I told you now

[150:40]

Alpha multiplied by slope square if you

[150:43]

remember in Ridge we specifically use

[150:45]

this right Ridge and lasso regression

[150:48]

Alpha so this is the alpha the this is

[150:50]

probably the best parameter we can

[150:52]

perform hyper parameter tuning the next

[150:54]

parameter that we can probably perform

[150:56]

is basically uh this Max iteration okay

[151:00]

Max iteration basically means how many

[151:01]

number of iteration how many number of

[151:03]

times we may probably change the Theta 1

[151:05]

value to get the right value so we can

[151:08]

do this so what I'm actually going to do

[151:10]

I'm going to select some Alpha values

[151:12]

I'm going to play with this apart from

[151:14]

that if I want I can also play with the

[151:16]

other parameters which are uh like kind

[151:19]

of uh you know probably you can you can

[151:21]

also play with the iteration parameter

[151:23]

it is up to you try whichever parameter

[151:25]

you want to change you can go ahead and

[151:26]

change it now let me show you how do we

[151:28]

write this and how do we make sure that

[151:31]

this specific thing is done now uh

[151:34]

before doing grid s CV uh let me do one

[151:36]

thing I will Define my parameters okay

[151:39]

so here is my Ridge now what I'm going

[151:41]

to do I'm going to say parameters and in

[151:44]

this parameter two important value that

[151:46]

I'm probably going to take is this one

[151:49]

that is my C value and I will try to

[151:51]

Define this in the form of dictionaries

[151:53]

so here the C value that I sorry not C

[151:57]

just a second

[151:58]

guys my mistake it is not C it is

[152:03]

Alpha let's see so how do I Define my

[152:05]

Alpha value we'll try to see so here the

[152:09]

parameters will be Alpha C is basically

[152:13]

for uh logistic regression I'll show you

[152:16]

so the alpha value I will just mention

[152:18]

some values like

[152:20]

1 e to the power of -5 that basically Me

[152:27]

00000000 0 0 0 1 similarly I I can write

[152:31]

1 E to the^ of - 10 that again means 0 0

[152:35]

0 0 0 0 0 0 10 * 0 1 I'm just making fun

[152:38]

okay so that you will also get

[152:39]

entertained 1 E to the^ of minus 8 okay

[152:43]

similarly I can write 1 E to the^ of

[152:45]

minus 3 from this particular value now

[152:48]

I'm increasing this value see 1 E to

[152:50]

the^ of minus 2 and then probably I can

[152:53]

have 1 5 10 um 20 something like this so

[152:58]

I'm going to play with all this

[152:59]

particular parameters for right now

[153:01]

because in grit or CV what they do is

[153:03]

that they take all the combination of

[153:04]

this Alpha value and wherever your uh

[153:07]

your your model performs well it is

[153:09]

going to take that specific parameter

[153:11]

and it is going to give you that okay

[153:13]

this is the best fit parameter that is

[153:14]

got selected so here I have got all

[153:16]

these things now what I'm going to do

[153:18]

I'm going to basically apply the grid C

[153:19]

TV so here I have uh gridge uh sorry

[153:24]

Ridge GD I'm

[153:25]

saying ridore regressor so I'm going to

[153:28]

use git s

[153:31]

CV git s CV and here I'm basically going

[153:34]

to take the parameters regge okay Ridge

[153:36]

is my first model and then I will take

[153:39]

up all this params that I have actually

[153:40]

defined see in git CV if I press shift

[153:43]

tab I have to first of all execute this

[153:46]

then only it will be able to press shift

[153:48]

tab so here if I press shift tab here

[153:50]

you'll be able to see estimator and

[153:52]

parameter grid is my second parameter

[153:54]

then scoring and then all the other

[153:56]

parameters so here the first thing that

[153:58]

goes is your model then your parameters

[154:00]

which what you are actually playing then

[154:03]

the third parameter is basically your

[154:05]

scoring

[154:06]

scoring and again here I'm going to use

[154:09]

negative mean squ error some people are

[154:10]

saying that mean squared error is not

[154:13]

present so that is the reason why

[154:15]

negative mean squ error is done why it

[154:18]

may not be present because

[154:20]

uh they try to always create a generic

[154:22]

Library probably this kind of uh scoring

[154:24]

parameter may also get used in other

[154:26]

algorithms so that is the reason they

[154:28]

may not have created but if you want to

[154:30]

Deep dive into it Google

[154:33]

Google then what is r regress dot fit on

[154:38]

X comma y again I'm telling you you can

[154:40]

first of all do train test split on X

[154:42]

and Y and then probably only do this on

[154:44]

X train and Y train parameter is not oh

[154:47]

sorry

[154:49]

okay I get this okay parameter is not

[154:53]

and why it is not and oh yeah it has

[154:56]

become a

[154:58]

list I'm going to make this as

[155:00]

dictionary right now I'm fully focused

[155:02]

on implementing things if I get an error

[155:04]

I'll definitely make sure that it'll get

[155:07]

fixed anyhow if I get that error I will

[155:09]

not say oh Kish why why this error came

[155:12]

you

[155:13]

know why this error came I I'll not get

[155:15]

worried I'll get the error down only you

[155:17]

cannot give this as the one okay so try

[155:21]

to understand okay so this is your gitar

[155:23]

CV I've also done the fit and let's go

[155:27]

and select the best parameter so what I

[155:28]

can do I will write print

[155:32]

ridore

[155:35]

regressor dot

[155:37]

params sorry there will be a parameter

[155:39]

called as best params I'm going to print

[155:42]

this and I'm going to print ridore

[155:46]

regressor Dot

[155:50]

best

[155:51]

score so these are all the values that

[155:53]

are got selected one is Alpha is equal

[155:55]

to 20 and the best score is - 32 so

[155:58]

initially I gotus 37 but because of

[156:00]

Ridge regression you can see that our

[156:02]

negative mean square error has

[156:04]

definitely become better there is a

[156:06]

minus sign don't worry but from 37 it

[156:08]

has come to 32 cross validation guys

[156:11]

over here inside grids s CV also when it

[156:13]

is probably taking the entire

[156:15]

combination over there the CV Value

[156:17]

Cross validation also we can use

[156:20]

so probably if I am probably considering

[156:23]

all these

[156:24]

things many people has a question Chris

[156:27]

is this minus value increased that

[156:29]

basically means you cannot use Ridge

[156:31]

regression you are right in this

[156:33]

particular case Ridge regression is not

[156:34]

helping you out so guys let me again

[156:36]

write it down everybody don't worry yeah

[156:41]

previous I got minus 32 right now I'm

[156:43]

getting - 37

[156:45]

right sorry previously I got what - 37

[156:52]

- 37 now I got - 32 so here you can see

[156:56]

this I got it from linear regression

[156:59]

this I got it from what Ridge which one

[157:02]

should I select I should select this

[157:03]

model only because it is performing well

[157:05]

than this but again understand Ridge

[157:08]

also tries to reduce the overfitting so

[157:11]

probably in this particular scenario we

[157:12]

cannot use Ridge because the performance

[157:14]

is becoming more bad so what I will do I

[157:17]

will go and try with lasso regression

[157:20]

now I'll copy and paste the same thing

[157:22]

so linear model import lasso then this

[157:25]

will basically be my

[157:27]

lasso let's see with lasso whether it

[157:29]

will increase or not let's

[157:33]

see this is my parameter that got

[157:35]

selected now let me write lasto

[157:38]

regressor

[157:39]

dot best params so this is Alpha is

[157:42]

equal to one is got selected over here

[157:44]

I'm just going to print it okay and then

[157:47]

I'm going to print with last one

[157:49]

regression DOT score will be the best so

[157:52]

here I'm actually getting - 35 - 35 here

[157:56]

I'm actually getting - 32 so minus 35

[157:59]

still I will focus on linear regression

[158:01]

now see what will happen if I add more

[158:04]

parameters if I add more parameters see

[158:06]

what will happen so now I'm going to

[158:08]

take Alpha different different values

[158:10]

see this I'm just going to remove this

[158:13]

and probably add Alpha value in this

[158:16]

way see here I have added more values 5

[158:19]

10 20 30 35 40 45 100 okay let's see

[158:23]

whether we our performance will increase

[158:25]

or not so here

[158:28]

uh first of all let me remove from here

[158:32]

in Ridge just take it down guys I'm I'm

[158:35]

adding more parameters like this just

[158:36]

take it down yeah CV is equal to 5

[158:40]

nobody okay you're not able to see it um

[158:43]

CV is equal to 5 now here it is uh what

[158:46]

you can basically focus on so here you

[158:49]

can see I have added some values like

[158:51]

this you can also

[158:52]

add and just try to execute and now if I

[158:56]

go and probably see this is my see first

[158:59]

I have tried for Ridge I'm getting minus

[159:02]

29 do you see after adding more

[159:04]

parameters what happened in Ridge after

[159:07]

adding more parameters what happened in

[159:09]

Ridge you can see om minus 29 and the

[159:12]

alpha value that is got selected is 100

[159:14]

if you want try with cross validation

[159:17]

10 and just try to execute now

[159:20]

now so these are are some hyper

[159:22]

parameters that we will definitely play

[159:24]

with here you can see - 29 so here you

[159:27]

can see minus 29 you can also increase

[159:30]

the cross validation

[159:32]

value over here also and probably

[159:34]

execute it but with lasso I don't know

[159:38]

whether it is improving or not it is

[159:39]

coming to minus 34 you just have to play

[159:42]

with this parameters as now for a bigger

[159:45]

problem statement the thing is not

[159:47]

limited to here right we try to take

[159:49]

multiples and many parameters multiples

[159:52]

and many parameters and try to do these

[159:54]

things it is up to you we play with

[159:56]

multiple parameters whichever gives us

[159:58]

the best result we are basically taking

[160:00]

it it's okay error is increased I know

[160:03]

that no error is increasing definitely

[160:06]

error is increasing even though by

[160:08]

trying with different different

[160:09]

parameters but about most of the

[160:11]

scenario see here I gotus 37 probably

[160:14]

what I can actually do is that uh try to

[160:17]

get better one with respect to this

[160:20]

now the best way what I can also do is

[160:22]

that I can basically take up train and

[160:25]

test split also and probably do these

[160:27]

things let's see let's see one example

[160:29]

so how do we do train and test from SK

[160:32]

scalar dot I think model selection

[160:35]

import train test split okay it's okay

[160:38]

guys you may get a different value okay

[160:40]

let's do one thing okay let's make your

[160:42]

problem statement little bit simpler now

[160:45]

what I'm going to do just tell me in

[160:46]

train test plate what we need to do so

[160:48]

I'm going to take the same code I'm

[160:50]

going to paste it over here or let me do

[160:52]

one thing let me insert a cell below and

[160:55]

let me do it for train test split so in

[160:57]

train test plate what we can do so I'm

[161:00]

just going to take the syntax paste it

[161:02]

over here let's say that I'm taking XT

[161:04]

train y train and then I'm using train

[161:07]

test split with 33% now if I execute

[161:10]

with respect to X train and Y train so

[161:12]

here is my you can see this I have

[161:13]

written this code from SK learn. model

[161:15]

selection uh train test plate random

[161:17]

State can be anything whatever you write

[161:20]

it is fine then you basically give X and

[161:22]

Y with test sizes 33 uh this is

[161:25]

basically saying that the test will have

[161:27]

33% and the train data will be 77% so

[161:31]

this is what I'm actually getting with

[161:33]

respect to X train and Y train here what

[161:35]

I'm going to do I'm going to basically

[161:37]

take X train comma y train and now if I

[161:40]

go and probably see this here you can

[161:41]

see minus 25 understand this value

[161:44]

should go towards zero if it is going

[161:47]

towards zero that basically means the

[161:49]

performance is better now similarly I do

[161:52]

it for Ridge in Ridge what I'm actually

[161:54]

going to do here I'm going to write X

[161:56]

train and Y train and if I go and

[161:58]

probably select the best score than this

[162:00]

here you'll be able to see I'm getting

[162:03]

how much I'm getting minus

[162:06]

2.47 okay here I'm getting

[162:09]

25.8 here 25. 47 that basically means

[162:12]

now still the Improvement is little bit

[162:15]

bad because here we are not going

[162:17]

towards zero so the next part again here

[162:20]

also you can basically do it for X train

[162:22]

and Y train X train and Y train so here

[162:25]

you have this one and let's go and

[162:27]

execute this so here you can see minus

[162:30]

2.47 now what you can also do is that

[162:33]

you can use this

[162:35]

lasso regressor do predict and you can

[162:39]

basically predict with respect to X test

[162:42]

so this is your white test value suppose

[162:44]

let's say that this is my y PR Yore PR

[162:47]

then what I can do from SK

[162:50]

learn I will be using R square and

[162:53]

adjusted R square if you remember SK

[162:55]

learn R square r² so this is my R2 score

[163:00]

so where it is present in SK learn.

[163:02]

Matrix so I'm going to write from SK

[163:04]

learn import let's say I'm saying from

[163:08]

skarn do Matrix import r² R2 score now

[163:14]

what I'm going to do over here I'm

[163:16]

basically going to say my R2 score which

[163:20]

is my variable I'll say this is nothing

[163:22]

but R2 score here I'm just going to give

[163:24]

my y PR comma Yore test so if I go and

[163:28]

probably see the output here I will be

[163:30]

able to see print R2 score this is all I

[163:34]

have discussed guys there is also

[163:37]

adjusted rant score is there where is R2

[163:41]

R2 score one adjusted r² okay R2 score

[163:46]

is there but adjusted R square should be

[163:48]

here somewhere in some manner so this is

[163:52]

how your output looks like with respect

[163:53]

to by using this lasso regressor okay

[163:56]

which is very good okay it should be I

[163:59]

told it should be near 100% right now

[164:01]

I'm getting 67% if I want to tie with

[164:04]

the ridge you can also try that so you

[164:06]

can say Ridge regressor do predict and

[164:10]

here you can see 7 68% then you can also

[164:12]

try linear regressor and

[164:16]

predict what is the error saying the

[164:19]

regression is not fitted yet why why it

[164:22]

is not fitted why it is not

[164:25]

fitted let's say that I have fitted here

[164:28]

linear

[164:30]

regression dot fit on X train and Y

[164:33]

train X train and comma y train so I'm

[164:37]

just going to fit it now if I go and

[164:40]

probably try to do the

[164:41]

calculation so if I go and see my R2

[164:44]

score it is also coming somewhere around

[164:46]

68% 67% now since this is just a linear

[164:50]

regression you won't be able to get 100%

[164:52]

because you're drawing a straight line

[164:53]

right so for that you basically have to

[164:56]

other use other algorithms like XG boost

[164:58]

and all n bias so many algorithms are

[165:01]

there it's okay see you give y test over

[165:04]

here y PR over here both are same right

[165:06]

they're

[165:07]

comparing by see at one limit you can

[165:10]

you can increase the performance after

[165:12]

that you cannot see again I'm telling

[165:14]

you in linear regression what we do

[165:15]

these are my points right I will be only

[165:17]

able to create one best line I cannot

[165:19]

create a curve line right over here so

[165:21]

obviously my accuracy will be only

[165:23]

limited let's go and do it logistic

[165:26]

practical

[165:27]

quickly and here uh in logistic also we

[165:31]

can do git SE CV now what I'm actually

[165:34]

going to do first of all let's go ahead

[165:35]

with the data set so I will quickly

[165:38]

Implement logistic so from LC learn.

[165:41]

linear

[165:42]

model I'm going to import logistic

[165:46]

regression so I'm going to use logistic

[165:48]

regression and apart from that we know

[165:50]

that let's take a new data set because

[165:52]

for logistic we need to solve using

[165:54]

classification problem so this is

[165:56]

basically my logistic regression I'll

[165:58]

take one data set so from SK learn. data

[166:01]

sets import we'll take a data set which

[166:03]

is like uh breast cancer data set so

[166:05]

that is also present in SK learn with

[166:07]

respect to the breast cancer data set

[166:09]

I'm just going to use this see load best

[166:12]

cancer data set I'm loading it and all

[166:14]

the independent features are in data and

[166:16]

my columns are feature names the same

[166:18]

thing like how we did previously okay so

[166:20]

this will basically be my

[166:23]

complete uh complete independent feature

[166:25]

so if I go and probably see this x. head

[166:28]

here you'll be able to see that based on

[166:31]

this input features the independent

[166:32]

feature we need to determine whether the

[166:34]

person is having cancer or not these are

[166:37]

some of the features over here and this

[166:39]

is like many many features are actually

[166:40]

present so next thing I this that was my

[166:43]

independent feature now I'll take my

[166:45]

dependent feature dependent feature will

[166:47]

already present in DF Target okay this

[166:50]

particular data set that we have taken

[166:52]

in DF in DF do Target we will basically

[166:55]

have all our dependent feature these are

[166:56]

my independent features so what I'm

[166:58]

actually going to do I'm going to create

[166:59]

Y and I'm going to say PD do data frame

[167:04]

and here I'm going to say DF do Target

[167:07]

Target and this column name should be

[167:11]

Target right so this will be my column

[167:13]

name and now if I go and see my y y is

[167:16]

basically having zeros and one in the

[167:18]

target feature now the next thing that

[167:20]

we are going to do is that uh apply

[167:23]

basically apply the first of all we need

[167:26]

to check whether this data set is uh

[167:29]

this particular y column is balanced or

[167:31]

imbalanced okay in order to do that I

[167:33]

will just write F

[167:35]

Target if the data set is imbalanced

[167:38]

definitely we need to work on that and

[167:40]

try to perform upsampling so if I write

[167:42]

y target. Valore counts if I execute

[167:46]

this so here you'll be able to see that

[167:48]

value SC counts will basically give that

[167:50]

how many number of ones are and how many

[167:52]

number of zeros are so now total number

[167:54]

of ones are 357 and total number of

[167:57]

zeros are 22 so is this a imbalanced

[168:01]

data set probably this is a balanced

[168:03]

data set so here I'm actually going to

[168:04]

now do train test spit train test spit I

[168:08]

will try to do again train test spit how

[168:10]

do we do we can quickly do copy the same

[168:14]

thing entirely I'll copy this entirely

[168:16]

over here and then I will get my X and Y

[168:20]

so here is my X train X test y train y

[168:22]

test so train test plate obviously I'll

[168:24]

be doing it now in logistic regression

[168:26]

if I go and search for

[168:28]

logistic regression escalar I will be

[168:31]

able to see this what all parameters are

[168:33]

there this is basically the L1 Norm or

[168:35]

L2 Norm or L1 regularization or L2

[168:37]

regularization with respect to whatever

[168:39]

things we have discussed in logistic and

[168:41]

then the C value these two parameter

[168:43]

values are very much important if I

[168:45]

probably show you over here the penalty

[168:49]

what kind of penalty whether you want to

[168:50]

add L2 penalty L1 penalty you can use L2

[168:53]

or L1 the next thing is C this is

[168:56]

nothing but inverse of regularization

[168:57]

strength this basically says 1 by Lambda

[169:01]

something like that this parameter is

[169:02]

also very much important guys class

[169:04]

weight suppose if your data set is not

[169:06]

balanced at that point of time you can

[169:09]

apply weights to your classes if

[169:11]

probably your data set is balanced you

[169:14]

can directly use class weight is equal

[169:16]

to balanced other than that you can use

[169:18]

other other weight which you basically

[169:19]

want so this is specifically some of

[169:22]

this right no this is not Ridge or lasso

[169:25]

okay this is logistic in logistic also

[169:28]

you have L1 norm and L2

[169:30]

Norms understand probably I missed that

[169:32]

particular part in the theory but here

[169:35]

also you have an L2 penalty norm and L1

[169:37]

penalty Norm I probably did not teach

[169:39]

you in theory because if you look see

[169:43]

logistic regression can be learned by

[169:45]

two different ways one is through

[169:47]

probabilistic method and one is through

[169:49]

geometric method if you go and probably

[169:51]

see my video that is present with

[169:52]

respect to logistic regression right now

[169:54]

in my YouTube channel there I have

[169:56]

explained you about this L1 and L2 Norms

[169:58]

also over there so in this also it is

[169:59]

basically present it is a kind of

[170:01]

penalty again just for uh using for this

[170:05]

kind of classification problem so what

[170:08]

I'm actually going to do let's go and

[170:10]

play with the parameters that I am

[170:12]

looking at so I will play with two

[170:14]

parameters one is params C value here

[170:17]

I'm defining 1 10 20 anything that you

[170:20]

can Define one set of values you can

[170:22]

Define and there was one more parameter

[170:24]

which is called as Max iteration this is

[170:26]

specifically for grits or CV okay that

[170:28]

I'm specifically going to apply so I

[170:30]

will just try to execute this this will

[170:32]

be my params now I'm going to quickly

[170:34]

Define my model one which will be my

[170:36]

logistic regression model so my logistic

[170:39]

regression here by default one value

[170:41]

I'll give for C and Max itra let's say

[170:45]

I'm giving this value later on what I

[170:47]

will do for this model I'll apply it to

[170:49]

grid sear CV so I'm just going to say

[170:51]

grid s CV and I'm going to apply it for

[170:55]

model one param grid is equal to params

[170:59]

this parameter that I'm specifically

[171:01]

trying to apply since this is a

[171:02]

classification problem and I am not

[171:04]

pretty sure that whether true positive

[171:06]

is important or true negative is

[171:08]

important I'm going to use F1 scoring

[171:10]

okay F1 scoring is basically again the

[171:13]

parametric term which we discussed

[171:14]

yesterday which is nothing but

[171:16]

performance metrics and then I'm going

[171:18]

to use CV is equal to 5 so this will be

[171:21]

entirely my model with respect to grid s

[171:24]

CV and I'll be executing this then I

[171:27]

will do model. fit on my X train and Y

[171:32]

train data so once I execute it here you

[171:34]

can see all the output along with

[171:36]

warnings a lot of warnings will be

[171:38]

coming I don't know because this many

[171:40]

parameters are there and finally you can

[171:42]

see that this has got selected now if

[171:44]

you really want to find out what is your

[171:46]

best param score model

[171:49]

dot best params so here you can see Max

[171:52]

iteration as

[171:54]

150 and what you can actually do with

[171:58]

respect to your best score model do best

[172:03]

score is 95 percentage but still we want

[172:06]

to test it with test data so can we do

[172:09]

it yes we can definitely do it I'll say

[172:11]

model do core or I'll say model dot

[172:15]

predict on my X test data and this will

[172:18]

basically be my y red so this will be my

[172:21]

y red all the Y prediction that I'm

[172:23]

actually getting so if you go and see y

[172:26]

red so these are my ones and zeros with

[172:28]

respect to the Y

[172:30]

prediction at finally after getting the

[172:32]

prediction values I can apply confusion

[172:35]

Matrix I hope I have taught you about

[172:36]

confusion Matrix so from sklearn do

[172:39]

confusion Matrix sorry sklearn do metrix

[172:43]

I'm going to import confusion metrix

[172:46]

classification report and the next thing

[172:49]

that I would like to do is this two I

[172:52]

will try to import confusion Matrix and

[172:54]

classification report now if you want to

[172:56]

see the confusion Matrix with respect to

[172:58]

your I can just write

[173:00]

Yore frad or Yore test whatever you want

[173:04]

go ahead with it and this is basically

[173:06]

my confusion Matrix if I put this

[173:09]

forward no difference will be there only

[173:11]

this thing will be moving that also I

[173:13]

showed you 63 118 3 and 4 now finally if

[173:17]

I want to accuracy score I can also

[173:19]

import accuracy score over here so here

[173:21]

you can see accuracy score is imported I

[173:23]

can also find out my accuracy score

[173:25]

which is my the total accuracy with

[173:28]

respect to this I we can give y test and

[173:31]

Yore PR which we have discussed

[173:34]

yesterday this is giving

[173:35]

96% if you want detailed Precision

[173:38]

recall all the score then at that point

[173:40]

of time I can use this classification

[173:43]

report and here I can give white test

[173:45]

and wied here is what I'm actually

[173:47]

getting so here you can see with respect

[173:50]

to F1 F1 score Precision recall since

[173:52]

this is a balanced data set obviously

[173:54]

the performance will be best yes you can

[173:57]

also use Roc see I'll also show you how

[174:00]

to use Roc and probably you'll be able

[174:01]

to see this you have to probably

[174:03]

calculate false positive rate two

[174:05]

positive rate but don't worry about Roc

[174:07]

I will first of all explain you the

[174:08]

theoretical part now let's go ahead and

[174:10]

discuss about n bias n bias is an

[174:13]

important algorithm so here I'm just

[174:16]

going to go ahead so now let's go ahead

[174:18]

and discuss about na bias and here we

[174:21]

are going to discuss about the intuition

[174:23]

so na bias is an another amazing

[174:26]

algorithm which is specifically used for

[174:29]

classification and this specifically

[174:31]

works on something called as base

[174:34]

theorem now what exactly is base theorem

[174:36]

first of all we need to understand about

[174:38]

base theorem let's say that guys I have

[174:41]

base theorem let's say that I have an

[174:43]

experiment which is called as rolling a

[174:45]

dis now in rolling a dis how many number

[174:47]

of elements I have have so if I say what

[174:49]

is the probability of 1 then obviously

[174:51]

you'll be saying 1X 6 if I say

[174:53]

probability of two then also here you'll

[174:55]

say 1X 6 if I say probability of three

[174:58]

then I will definitely say it is 1x 6 so

[175:01]

here you know that this kind of events

[175:04]

are basically called as independent

[175:06]

events now rolling a dice why it is

[175:08]

called as an independent event because

[175:10]

getting one or two in every experiment

[175:12]

one is not dependent on two two is not

[175:14]

dependent on three so they are all

[175:16]

independent that is the reason why we

[175:18]

specifically say is an independent event

[175:20]

but if I take an example of dependent

[175:22]

events let's consider that I have a bag

[175:24]

of marbles okay in this marble I

[175:28]

basically have three red marbles and I

[175:31]

have two green marbles now tell me what

[175:33]

is the probability of suppose I have a

[175:36]

event in the first event I take out a

[175:38]

red marble so what is the probability of

[175:40]

taking out a red marble so here you can

[175:43]

definitely say that it is

[175:44]

3x5 okay so this is my first event now

[175:47]

in the second event let's say that in

[175:49]

this you have taken out the red marble

[175:51]

now what is the second second time again

[175:53]

you are taking out the second red marble

[175:55]

or forget about second Rand marble now

[175:57]

you want to take out the green marble

[175:59]

now what is the probability with respect

[176:01]

to taking out a green marble so here

[176:03]

you'll be definitely saying that okay

[176:05]

one red marble has been removed then the

[176:07]

total number of marbles that are left

[176:09]

are four so here you can definitely

[176:11]

write that probability of getting a

[176:12]

green marble is nothing but 2x4 which is

[176:14]

nothing but 1x2 so here what is

[176:16]

happening first first element you took

[176:18]

out first marble that you took out first

[176:20]

event from from the first event you took

[176:21]

out red marble from the second event you

[176:23]

took out green marble this two are in

[176:25]

these two are dependent events because

[176:28]

the number of marbles are getting

[176:29]

reduced as you take out from them so if

[176:32]

I tell you what is the probability of

[176:35]

taking out a red marble and then a green

[176:39]

marble so it's the simple the formula

[176:42]

will be very much simple right which we

[176:43]

have already discussed in stats it is

[176:45]

nothing but probability of probability

[176:47]

of red multiplied by probability of

[176:50]

green given Red so this specific thing

[176:53]

is called as conditional probability

[176:55]

here understand what is happening

[176:57]

probability of green marble given the

[176:59]

red marble event has occurred here both

[177:01]

the events are independent now let me

[177:03]

write it down very nicely so I can write

[177:05]

probability of A and B is equal to

[177:08]

probability of a multiplied probability

[177:12]

of B divided by probability of a let's

[177:16]

go and derive something can can I write

[177:18]

probability of A and B is equal to

[177:21]

probability of b and a so answer is yes

[177:24]

we can definitely say we can definitely

[177:26]

say if you go and do the calculation

[177:27]

you'll be able to get the answer you

[177:29]

should not say no now what is the

[177:32]

formula for probability of A and B so

[177:34]

here you can basically write probability

[177:36]

of a multiplied by probability of B

[177:39]

given a if I take out probability of

[177:42]

green what is probability of green in

[177:43]

this particular case 2x 5 what is

[177:46]

probability of red 3x 4 for right now

[177:49]

let's consider this now this part I can

[177:51]

definitely write as this part I can

[177:54]

definitely write as probability of B

[177:56]

multiplied by probability of B

[178:00]

probability of B this one probability of

[178:02]

B and this will be probability of a

[178:04]

given B so I can definitely write this

[178:06]

much with respect to all this

[178:08]

information now can I derive probability

[178:10]

of a is equal to probability of B

[178:14]

multiplied by probability of a / B me

[178:18]

probability of a given B divided by

[178:21]

probability of sorry I'll write this as

[178:24]

probability of B given a divided by

[178:27]

probability of a and this is

[178:28]

specifically called as base theorem and

[178:31]

this is the Crux behind na bias

[178:34]

understand this is the Crux behind the

[178:35]

base theorem now let's go ahead and

[178:38]

let's discuss about how we are using

[178:40]

this to solve let's take some examples

[178:43]

and probably make you understand let's

[178:45]

say that I have some features like X1 X2

[178:49]

X3 X4 X5 like this till xn and I have my

[178:54]

output y so these are my independent

[178:56]

features these all are my independent

[178:58]

features these all are my independent

[179:00]

features so here I'm going to write

[179:02]

independent features and this is my

[179:04]

output feature which is also my

[179:05]

dependent feature now what is happening

[179:08]

if I say probability of b or a what does

[179:10]

this basically mean I need to really

[179:12]

find what is the probability of Y and

[179:15]

you know that guys I will have some

[179:17]

values over here and basically I'll have

[179:19]

some output value over here so based on

[179:21]

this input values I need to predict what

[179:23]

is the output initially on a training

[179:25]

data set I will have your input and then

[179:28]

your output initially my model will get

[179:30]

trained on this now let's consider what

[179:32]

this entire terminology is I will try to

[179:34]

write in terms of this equation so I

[179:36]

will say probability of Y given x1a x2a

[179:41]

X3 up till xn then this equation will

[179:44]

become probability of Y see probability

[179:46]

of Y given X X1 X2 X3 xn this a is

[179:50]

nothing but X1 X2 X3 xn and I'm trying

[179:52]

to find out what is the probability of Y

[179:54]

and then I will write probability of b b

[179:57]

is nothing but y but before that what

[180:00]

I'll write probability of a / B right a

[180:03]

given b or probability of B probability

[180:06]

of B is nothing but y multiplied by

[180:08]

probability of a given B probability of

[180:12]

a given B basically means probability of

[180:15]

x1a X2 comma xn and given b b is given

[180:20]

right so I'm able to find this entire

[180:22]

value now just a second I made some

[180:24]

mistakes I guess now it is correct sorry

[180:26]

I I just missed one term that is this

[180:29]

given y this is how it will become and

[180:32]

this will be equal to probability of a

[180:36]

that is X1 comma X2 like this up to XL

[180:39]

so probability of Y multiplied by

[180:41]

probability of a given y now if I try to

[180:44]

expand this then this will basically

[180:46]

become something like this see

[180:48]

probability of Y multiplied by

[180:51]

probability of X1 given yes a given y

[180:56]

sorry given y multiplied by probability

[181:00]

of X2 given y probability of x3 given Y

[181:06]

and like this it will be probability of

[181:08]

xn given y so this will also be y1 Y2 Y3

[181:12]

YN this I can expand it like this and

[181:15]

then this will basically become

[181:16]

probability of X Y 1 multiplied by

[181:18]

probability of X2 multiplied by

[181:21]

probability of x3 like this up to

[181:23]

probability of xn so this is with

[181:26]

respect to all the probability y will be

[181:28]

different see here for this particular

[181:30]

record y will be different for this y

[181:32]

will be different for this y will be

[181:34]

different but why output it may be yes

[181:37]

or no right it may be yes or no okay I

[181:41]

I'll solve a problem it will make

[181:43]

everything understand and this will

[181:45]

probably be probability of Y it can be

[181:47]

binary multiclass whatever things you

[181:49]

want I'll solve a problem in front of

[181:51]

you now let's say that I have my y as

[181:54]

let's say that I have a lot of features

[181:56]

X1 X2 X3 X X4 with respect to this let's

[182:02]

say in my one of my data set I have this

[182:03]

many x1s this many features and this is

[182:06]

my y so these are my feature number and

[182:09]

this is my y let's say that in y I have

[182:11]

yes or no so how I will probably write

[182:15]

we really need to understand this okay I

[182:17]

will basically

[182:18]

say what is the probability of Y is

[182:21]

equal to yes given this x of I this is

[182:25]

my first record first record of X of I

[182:27]

this is my second record of X of I so I

[182:30]

may write like this what is the

[182:31]

probability of Y being yes if x of I is

[182:34]

given to you X of I basically means X1

[182:37]

X2 X3 X4 so here you'll obviously write

[182:39]

what kind of equation you'll basically

[182:41]

say probability of yes multiplied by

[182:45]

probability of yes multiplied by

[182:46]

probability of X of 1 given

[182:50]

yes multiplied by probability of X2

[182:53]

given yes probability of x3 given yes

[182:58]

and probability of X4 given yes divided

[183:03]

by probability of X1 multiplied by

[183:06]

probability of X2 multiplied by

[183:08]

probability of x3 multiplied by

[183:10]

probability of X4 Y is fixed it may be

[183:13]

yes or it may be no but with respect to

[183:15]

different different records this value

[183:17]

may change similarly if I write

[183:18]

probability of Y is equal to no given X

[183:22]

of I what it will be then it will be

[183:26]

probability of no multiplied by

[183:30]

probability of X1 given no then

[183:33]

probability of

[183:35]

X2 given

[183:37]

no probability of

[183:39]

x3 given

[183:42]

no and probability of X4 given no so

[183:46]

here because every any input that I give

[183:49]

any input X of I that I give I may

[183:51]

either get yes or no so I need to find

[183:53]

both the probability so probability of

[183:54]

X1 multiplied by probability of X2

[183:57]

multiplied by probability of x3

[183:59]

multiplied by probability of X4 see with

[184:02]

respect to Any X of I the output can be

[184:05]

yes or no and I really need to find out

[184:07]

the probabilities so both the formula is

[184:09]

written over here what is the

[184:11]

probability of with respect to yes and

[184:13]

what is the probability with respect to

[184:14]

no now in this case one common thing you

[184:17]

see that this this denominator is fixed

[184:20]

this is definitely fixed it is fixed it

[184:22]

is it is not going to change for both of

[184:24]

them and I can consider that this is a

[184:27]

constant so what I can do I can

[184:30]

definitely ignore so here I can

[184:32]

definitely ignore these things ignore

[184:34]

this also ignore this Al because see

[184:36]

this is constant so I don't want to

[184:38]

consider this in the next time I'll just

[184:40]

use this specific formula to calculate

[184:42]

the probability now let's say that if my

[184:46]

first probability for a specific data

[184:49]

set yes of X of I is let's say that I'm

[184:52]

getting

[184:53]

as13 and similarly probability of no

[184:56]

with respect to X of I if I get

[185:00]

05 you know that in a binary

[185:02]

classification any values if it get

[185:04]

greater than or equal to 5 we are going

[185:06]

to consider it as 1 and if it is less

[185:09]

than 0.5 I'm going to consider it as

[185:10]

zero now I'm getting values like this 13

[185:13]

and .1 05 obviously I'm getting .13 05

[185:18]

so we do something called as

[185:20]

normalization it says that if I really

[185:23]

want to find out the probability of X

[185:24]

with X of I if I do normalization it is

[185:27]

nothing but .13 divided by .13 +

[185:31]

05 72 this is nothing but

[185:35]

72% and similarly if I do for

[185:37]

probability of no given X of I here

[185:39]

obviously it will say 1 - 72 which will

[185:42]

be your remaining answer that is 28

[185:44]

which is nothing but 28% so your final

[185:47]

answer will be this one this formulas

[185:49]

you have to remember now we'll solve a

[185:50]

problem let's solve a problem this will

[185:52]

be a very very interesting problem let's

[185:54]

say I have a data set which has like

[185:56]

this feature day let me just copy this

[185:59]

data set okay for you all now in this

[186:02]

data set I want to take out some

[186:04]

information let's take out Outlook

[186:08]

table now based on this output Outlook

[186:11]

feature see over here Outlook my day

[186:14]

outlook temperature humidity wind are

[186:17]

the input features independent feature

[186:19]

this is my output feature this one that

[186:22]

you are probably seeing play tennis is

[186:24]

my output feature which is specifically

[186:26]

a binary

[186:27]

classification so what I'm actually

[186:29]

going to do I'm basically going to take

[186:31]

my Outlook feature and based on this

[186:33]

Outlook feature I will just try to

[186:34]

create a smaller table which will give

[186:36]

some information now based on Outlook

[186:39]

first of all try to find out how many

[186:40]

categories are there in Outlook one is

[186:43]

sunny one is

[186:45]

overcast and one is rain right three

[186:48]

categories are there so I'm going to

[186:50]

write it down over here Sunny overcast

[186:53]

and rain so these three are my features

[186:56]

with respect to Sunny uh with Outlook I

[186:58]

have three categories one is sunny one

[187:00]

is overcast and one is RA here I'm going

[187:02]

to basically say with respect to Sunny

[187:05]

how many yes are there and how many no

[187:08]

are there and what is the probability of

[187:11]

yes and probability of no so I'm going

[187:13]

to again write it over here so this is

[187:16]

my Outlook feature

[187:18]

and then I have categories first yes no

[187:23]

Sunny overcast rain yes no then

[187:28]

probability of yes and probability of no

[187:31]

now the next thing that we need to find

[187:33]

out is that with respect to Sunny how

[187:37]

many of them are yes see yes we have so

[187:40]

when we have sunny over here the answer

[187:42]

is no so I will increase the count over

[187:44]

here one then again I have sunny again

[187:47]

answer is no so I'm going to increase

[187:49]

the count to two with this sunny this is

[187:52]

basically no okay so again I'm going to

[187:54]

increase the count to three now with

[187:56]

sunny how many of them are yes one and

[188:00]

two so I have this one and this one so I

[188:03]

have two so I'm going to say with

[188:05]

respect to Sunny I have two

[188:07]

yes understand Outlook is my X1 X1

[188:11]

feature let's consider now the next

[188:13]

thing is that let's see with respect to

[188:16]

overcost with overcast how many of them

[188:18]

are yes so this overcast is there yes 1

[188:22]

2 3 and four so total four yes are there

[188:26]

with respect to overcast then with

[188:28]

respect to overcast how many are on no

[188:31]

you can go ah and find out it is

[188:32]

basically zero NOS then with respect to

[188:35]

rain how many of them are yes so here

[188:37]

you can see with respect to one rain yes

[188:40]

yes no no so this is nothing but 3 2

[188:46]

let's try to find out there are three is

[188:47]

two or

[188:48]

not one here also one yes is there right

[188:52]

so 3 yes two NOS so the total number of

[188:55]

yes and NOS if you count it there are

[188:58]

nine yes and five NOS this is my total

[189:01]

count so if you totally count this 9 + 5

[189:04]

is 14 you'll be able to compare that

[189:06]

there will be 9 yes and five NOS what is

[189:08]

the probability of yes when Sunny is

[189:10]

given so here you have 2X 9 here you

[189:14]

have 4X 9 here you have 3x 9 now if if I

[189:17]

say what is the probability of no given

[189:20]

Sunny now see probability of yes given

[189:23]

Sunny probability of yes given forecast

[189:26]

probability of yes given rain so it is

[189:28]

basically that I will just try to write

[189:30]

it in a simpler manner so that you'll

[189:31]

not get confused okay so this is my

[189:33]

probability of yes and this is my

[189:35]

probability of no but understand what

[189:37]

does this basically mean this

[189:39]

terminology basically means probability

[189:41]

of yes given Sunny probability of yes

[189:44]

given overcast probability of yes given

[189:46]

rain similarly what is probability of no

[189:49]

probability of no obviously you know

[189:50]

that 3x 5 is my first probability then

[189:54]

you have 0x 5 and then you have 2X 5 now

[189:58]

with respect to the next feature let's

[190:00]

consider that I'm going to consider one

[190:01]

more feature and in this feature I will

[190:03]

say let's consider

[190:05]

temperature okay let's consider

[190:07]

temperature now in temperature how many

[190:10]

features I have or how many categories I

[190:12]

have I have hot you can see hot mild and

[190:17]

and cold now with respect to hot mild

[190:19]

cold here also I will be having yes no

[190:23]

probability of yes and probability of no

[190:26]

now try to find out with respect to hot

[190:28]

how many are yes so here no is there

[190:31]

here also no is there two NOS uh 1 yes

[190:36]

uh 2 yes so two yes and two NOS probably

[190:39]

then similarly with respect to mild mild

[190:42]

how many are there 1 yes 1 No 2 yes 3s

[190:48]

4s 4S and two knows okay so here you

[190:51]

basically go and calculate 4 yes and two

[190:54]

knows with respect to cold how many are

[190:57]

there cool cool or cold 1 yes 1 No 2 yes

[191:03]

3 S 3 S and 1 no so here I have

[191:07]

specifically have 3s and 1 no again the

[191:10]

total number is 9 and five which will be

[191:12]

equal to the same thing that what we

[191:15]

have got now really go ahead with

[191:16]

finding probability of yes given hot so

[191:19]

it will be 2x 9 over here then here it

[191:22]

will be how much 4X 9 here it will be 3x

[191:26]

9 again here what will be the

[191:28]

probability of no given given hot so

[191:31]

it'll be 2x 5 2x 5 1X 5 so this two

[191:36]

tables has already been created and

[191:37]

finally with respect to play the total

[191:39]

number of plays are yes is 9 no is five

[191:44]

and the answer is total 14 if if I say

[191:47]

what is the probability of yes only yes

[191:50]

then it is nothing but 9 by4 what is the

[191:54]

probability of no it is nothing but

[191:56]

5x4 okay so this two values also you

[191:59]

require now let's say that you get a new

[192:02]

data set you need get a new data set

[192:05]

let's say you get a new test data where

[192:08]

it says that suppose if you are having

[192:11]

sunny and hot tell me what is the output

[192:16]

so this is my problem statement so let

[192:18]

me write it down so here I will write

[192:20]

probability of yes given Sunny comma hot

[192:25]

then here I will write probability of

[192:27]

yes multiplied by probability of so here

[192:31]

I will write probability of Sunny given

[192:34]

yes multiplied by probability of hot

[192:38]

given yes divided by what is it

[192:42]

probability of Sunny multiplied by

[192:45]

probability of hot

[192:50]

equation because it is a

[192:52]

constant because probability of no also

[192:55]

I'll be getting the same value 9 by4 so

[192:58]

probability of yes I'm going to replace

[193:00]

it with 9

[193:02]

by4 multiplied by 2x 9 then probability

[193:06]

of hot given yes so I am going to get 2

[193:09]

by 9 so

[193:12]

here 99 cancel or 2 1 7 then this is

[193:17]

nothing but 2 by

[193:21]

6331 I read this statement little bit

[193:23]

wrong it should be probability of Sunny

[193:25]

given yes now go ahead and calculate go

[193:28]

ahead and calculate what is probability

[193:30]

of no given sunny and hot so here you

[193:33]

have probability of no multiplied by

[193:36]

probability of Sunny given

[193:38]

no multiplied by probability of hot

[193:43]

given

[193:44]

no divided by probability of Sunny

[193:50]

multiplied by probability of heart this

[193:53]

will get cancelled denominator is a

[193:55]

constant guys this is a constant so what

[193:58]

is probability of no so probability of

[194:00]

no is nothing but 5 by4 so I will write

[194:03]

over here 5 by4 multiplied by

[194:07]

probability of Sunny given no what is

[194:09]

probability of Sunny given no what is

[194:11]

probability of Sunny given no is nothing

[194:13]

but probability of Sunny given no is

[194:15]

nothing but 3x 5 so here I'm going to

[194:17]

get 3x 5 multiplied probability of H

[194:22]

given no that is nothing but 2x 5 so 2x

[194:25]

5 is here 3x 5 is there five and five

[194:28]

will get cancelled 2 1 2 7 and then I'm

[194:32]

getting 3x 35 which is nothing but

[194:35]

calculator uh if I'm actually getting

[194:37]

three ID by 35 it's nothing but

[194:41]

857 I will write it down again

[194:44]

probability of yes given Sunny comma hot

[194:49]

which is my independent feature is

[194:51]

nothing but

[194:52]

031

[194:54]

031 and this is probability of no given

[194:57]

Sunny comma hot 85 now we'll try to

[195:00]

normalize this 85 + Point divided by 031

[195:06]

+ 085 73 this is nothing but 73% and

[195:11]

here I can basically say 1 -73 which is

[195:14]

my27 which is nothing but 27% if the

[195:18]

input comes as sunny and hot if the

[195:21]

weather is sunny and hot what will the

[195:23]

person do whether he will play or not

[195:26]

the answer is no okay now my next

[195:29]

question will be that if your new data

[195:31]

is overcast and Mild now tell me what

[195:34]

will be the probability using name bias

[195:37]

now you can add any number of features

[195:39]

let's say that I will say that okay

[195:42]

let's let's say that I will I will

[195:44]

probably say we can consider humidity

[195:47]

mind wind also you basically create this

[195:49]

kind of table to find it out but this

[195:50]

will be an assignment just do

[195:53]

it overcast and Mild if it is with

[195:56]

respect to NB try to solve it so the

[195:58]

second algorithm that we are going to

[196:00]

discuss about is something called as KNN

[196:02]

algorithm KNN algorithm is a very simple

[196:05]

problem statement okay which can be used

[196:09]

to solve both classification and

[196:11]

regression so KNN basically means K

[196:14]

nearest neighbor let's first of all

[196:16]

discuss about classification problem

[196:18]

number one classification problem let's

[196:20]

say that I have a binary classification

[196:22]

problem which looks like this I have two

[196:23]

data points like this one and this is

[196:26]

another one suppose a new data point

[196:29]

suppose a new data point which comes

[196:31]

over

[196:32]

here then how do I say that whether this

[196:35]

belongs to this category or whether it

[196:36]

belongs to this category if I probably

[196:38]

create a logistic regression I may

[196:40]

divide a line but in this particular

[196:42]

scenario how do we Define or how do we

[196:44]

come to a conclusion that

[196:47]

whether this will belong to this

[196:48]

category or this category so for here we

[196:50]

basically use something called as K

[196:52]

nearest neighbor let's say that I say

[196:55]

that my K value is five so what it is

[196:57]

going to do it is going to basically

[196:58]

take the five nearest closest point

[197:01]

let's say from this you have two nearest

[197:03]

closest point and from here you have

[197:05]

three nearest closest point so here we

[197:07]

basically see from the distance the

[197:09]

distance that which is my nearest point

[197:11]

now in this particular case you see that

[197:13]

maximum number of points are from Red

[197:15]

categories from Red from Red categories

[197:18]

I'm getting three points and from White

[197:21]

categories I'm getting two points now in

[197:23]

this particular scenario maximum number

[197:25]

of categories from where it is coming we

[197:27]

basically categorize that into that

[197:29]

particular class just with the help of

[197:30]

distance which all distance we

[197:31]

specifically use we use two distance one

[197:33]

is ukan distance and the other one is

[197:36]

something called as Manhattan distance

[197:37]

so ukan and Manhattan distance now what

[197:40]

does ukan distance basically say suppose

[197:42]

if this is your two points which is

[197:44]

denoted by X1 y1

[197:47]

X2 Y2 ukine distance in order to

[197:50]

calculate we apply a formula which looks

[197:52]

like this X2 - X1 s + Y2 - y1 s whereas

[197:58]

in the case of magetan distance suppose

[198:00]

this are my two points then we calculate

[198:03]

the distance in this way we calculate

[198:05]

the distance from here then here right

[198:07]

this is the distance we calculate we

[198:09]

don't calculate the hypothenuse distance

[198:10]

so this is the basic difference between

[198:11]

ukan and magetan distance now you may be

[198:14]

thinking Chris then fine that is for

[198:15]

classification problem for regression

[198:17]

what do we do for regression also it is

[198:19]

very much simple suppose I have all the

[198:22]

data points which looks like this now

[198:24]

for a new data point like this if I want

[198:26]

to calculate then we basically take up

[198:28]

the nearest Five Points let's say my K

[198:30]

is five k is a hyper parameter which we

[198:33]

play now suppose let's say that K it

[198:35]

finds the nearest point over here here

[198:38]

here here and here so if we need to find

[198:42]

out the point for this particular output

[198:44]

with respect to the K is equal to 5 it

[198:46]

will try to calculate the average of all

[198:48]

the points once it calculates the

[198:51]

average of all the points that becomes

[198:53]

your output so regression and

[198:55]

classification that is the only

[198:56]

difference because this K is actually an

[198:58]

hyper parameter we try with K is equal

[199:00]

to 1 to 50 and then we probably try to

[199:03]

check the error rate and if the error

[199:06]

rate is less then only we select the

[199:08]

model now two more things with respect

[199:10]

to K nearish neighbor K nearest neighbor

[199:12]

works very bad with respect to two

[199:15]

things one is outliers and and one is

[199:17]

imbalanced data set now if I have an

[199:19]

outlier let's say I have an outlier over

[199:22]

here this is one of my categories like

[199:24]

this and this is my another category

[199:26]

let's consider that I have some outliers

[199:28]

which looks like this now if I'm trying

[199:29]

to find out the point for this you can

[199:32]

see that the nearest point is basically

[199:35]

blue only and it belongs to the blue

[199:37]

category but because this outlier you

[199:39]

know it'll consider that the nearest

[199:40]

neighbor is this so then this will be

[199:42]

basically treated in this group only

[199:44]

formula for Manhattan distance it uses

[199:46]

modulus X2 - X1 + Y2 - y1 mode X2 - X1

[199:53]

Y2 - y1 uh this was it from my side guys

[199:55]

and yes I've also made detailed videos

[199:57]

about whatever topics we have discussed

[199:59]

today you can directly go and search for

[200:01]

that particular

[200:03]

topic so this is the agenda of this

[200:06]

session we will try to complete this all

[200:08]

things again here we are going to

[200:10]

understand the mathematical equations

[200:12]

and all uh so today's session we are

[200:14]

basically going to discuss about uh

[200:16]

decision tree okay and uh in this

[200:20]

session we are going to basically

[200:21]

understand what is the exact purpose of

[200:23]

decision tree with the help of decision

[200:25]

tree you are actually solving two

[200:27]

different problems one is regression and

[200:30]

the other one is

[200:32]

classification so we'll try to

[200:34]

understand both this particular part

[200:37]

well we will take a specific data set

[200:38]

and try to solve those problems now

[200:40]

coming to the decision tree one thing

[200:42]

you need to understand I'll say that if

[200:45]

age is less than 8 let's say I'm writing

[200:48]

this condition if age is less than or

[200:51]

equal to 18 I'm going to say print go to

[200:55]

college here I'm printing print college

[200:58]

and then I'll write else if age is

[201:02]

greater than 18 and pag is less than or

[201:05]

equal to 35 I'll say print work then

[201:09]

again I'll write else if age is let me

[201:12]

let me put this condition little bit

[201:14]

better then I'll write here L if if age

[201:17]

is greater than 18 and age is less than

[201:22]

or equal to 35 I'm going to say print

[201:25]

work basically people needs to work in

[201:27]

this age else I'm just going to consider

[201:30]

print retire so here is my ifls

[201:34]

condition over here now whenever we have

[201:36]

this kind of nested if Els condition

[201:38]

what we can do is that we can also

[201:40]

represent this in the form of decision

[201:42]

trees we'll also we can actually form

[201:45]

this in the form of decision and the

[201:46]

decision tree here first of all we will

[201:48]

have a specific root node let's say this

[201:51]

is my root node now in this root node

[201:52]

the first condition is less than or

[201:54]

equal to 18 so here obviously I will be

[201:56]

having two conditions saying that if it

[201:59]

is less than or equal to 18 and one

[202:02]

condition will be yes one condition will

[202:03]

be no so if this is yes and if this is

[202:06]

no right if this condition is true that

[202:09]

basically means we'll go in this side if

[202:11]

it is true then here we will basically

[202:14]

have something like college so this is

[202:17]

your Leaf node similarly when I have no

[202:22]

okay no no in this particular case we

[202:24]

will go to the next condition in this

[202:26]

next condition I will again create a

[202:28]

node and I'll say that okay this is less

[202:30]

than 18 and greater than sorry less than

[202:33]

or equal to 35 so if this is also there

[202:38]

then again I'll have two conditions

[202:39]

which is basically yes or no now when I

[202:42]

create this yes or no over here you'll

[202:43]

be able to see that basically means here

[202:46]

again two condition will be there if it

[202:48]

is yes I will say print work so this

[202:50]

will again be my leaf

[202:52]

node and again for no again I will do

[202:55]

the further splitting which is retire so

[202:59]

here you can see that this entire

[203:00]

algorithm this entire code that I have

[203:02]

actually written you can see that it has

[203:05]

got converted to this kind of

[203:08]

trees where you specifically able to

[203:10]

take decisions yes or no so can we solve

[203:15]

a classification

[203:17]

problem sorry this is greater than 18

[203:21]

again if it is greater than 18 and less

[203:23]

than or 35 so can we solve a

[203:28]

regression and a classification problem

[203:31]

regression and classification problem

[203:34]

using this decision trees by creating

[203:37]

this kind of

[203:38]

nodes so in short whenever we talk about

[203:41]

decision

[203:42]

trees whenever we talk about decision

[203:45]

trees

[203:47]

you will be seeing that decision trees

[203:49]

are nothing but decision trees are

[203:52]

nothing but by using this nested if El

[203:56]

condition we can definitely solve some

[203:58]

specific problem statement but here in

[204:00]

the visualized way we will specifically

[204:02]

create this decision tree in the form of

[204:04]

nodes now you need to understand that

[204:07]

what type of maths we will probably use

[204:10]

okay so let's do one thing let's take a

[204:12]

specific data set which I will

[204:14]

definitely do it over here in front of

[204:15]

you

[204:17]

okay and we will try to solve this

[204:18]

particular data set and this will

[204:20]

basically give you an idea like how we

[204:23]

can probably solve these problems so uh

[204:26]

let me just open my snippet tool so this

[204:29]

is my data set that I have let's

[204:31]

consider that I have this specific data

[204:33]

set now this data set are pretty much

[204:35]

important because this probably in

[204:39]

research papers also probably people who

[204:41]

have come up with this algorithm they

[204:43]

usually take this they take this thing

[204:46]

but but right now this particular

[204:47]

problem statement if I talk about this

[204:49]

is a classification problem statement

[204:51]

okay but don't worry I will also help

[204:53]

you to explain I'll also explain you

[204:56]

about regression also how decision tree

[204:58]

regression will definitely work so let's

[205:01]

go ahead and let's try to understand

[205:03]

suppose if I have this specific problem

[205:05]

statement how do we solve this this is

[205:07]

my output feature play tennis yes or no

[205:10]

okay whether the person is going to pay

[205:12]

tennis or not yesterday or there after

[205:14]

yesterday or whenever you want so if I

[205:17]

have this input features like Outlook

[205:19]

temperature humidity and wind is the

[205:22]

person going to play tennis or not this

[205:24]

is what my model should predict with the

[205:26]

help of decision tree so how decision

[205:28]

tree will work in this particular case

[205:29]

first of all let's consider any any any

[205:33]

specific uh feature let's say that

[205:35]

Outlook is my feature so this will be my

[205:37]

first

[205:38]

feature which is specifically Outlook

[205:41]

now just tell me how many are basically

[205:45]

having no and how many are basically

[205:48]

having yes in the case of Outlook there

[205:51]

you'll be able to find out there are

[205:52]

nine yes see 1 2 3 4 5 6 7 8 9 and how

[205:58]

many NOS are there 1 2 3 4 5 I think 1 2

[206:04]

3 4 5 so nine yes and five NOS what we

[206:09]

are going to do in this specific thing

[206:11]

now we have N9 yes and five Nos and the

[206:13]

first node that I have actually taken

[206:17]

is basically Outlook so Outlook feature

[206:20]

now just try to find out we are focusing

[206:22]

on this specific feature now in this

[206:24]

feature how many categories I have I

[206:26]

have one Sunny category you can see over

[206:29]

here I have Sunny one category then I

[206:31]

have another category called as

[206:33]

overcast then I have another category as

[206:37]

rain so I have three unique categories

[206:40]

So based on these three categories I

[206:42]

will try to create three nodes so here

[206:45]

is my one node here is my second node

[206:49]

here is my third node so these are my

[206:52]

three categories so this category is

[206:53]

basically called as Sunny this category

[206:57]

is basically called as overcast and this

[207:00]

category is basically called as rain

[207:03]

based on these three categories so I'm

[207:04]

splitting it now just go ahead and see

[207:07]

in Sunny how many yes and how many no

[207:10]

are there how many yes with respect to

[207:12]

Sunny are there see in sunny I have two

[207:14]

NOS see one and two no uh one more no is

[207:18]

there three NOS so here you can see this

[207:21]

is my one no then this is my two no this

[207:25]

is my three no and yes are two so this

[207:30]

one and this one so how many total

[207:33]

number of yes so here you can see that

[207:36]

there are 1 2 2 yes and three no let's

[207:41]

say that I have randomly selected one

[207:43]

feature which is Outlook why can't I

[207:45]

when like see it is up to it it is up to

[207:49]

the decision tree to select any of the

[207:51]

feature here I have specifically taken

[207:53]

Outlook later on I'll explain why it it

[207:57]

can basically select how it selects the

[207:59]

feature okay I'll I'll talk about it

[208:00]

don't worry so in the Outlook we have

[208:04]

two yes sorry in the case of Sunny we

[208:06]

have two yes and three NOS now the next

[208:08]

thing is that let's go and see for

[208:10]

overcast in overcast I have 1 yes uh 2s

[208:14]

um 3s and 4 yes I don't have any no in

[208:18]

overcast so over here my thing will be

[208:21]

that four yes and Zer Nos and then

[208:24]

finally when we go to the Rain part see

[208:26]

in Rain how many features are there in

[208:29]

rain if you go and probably see it how

[208:31]

many number of yes and NOS are there go

[208:33]

and see in one one yes in row rain two

[208:36]

yes then one no then again you have one

[208:39]

yes and one no right so here you can

[208:43]

basically say that in rain in the case

[208:45]

of rain if I take a as an example how

[208:47]

many number of yes and NOS are there it

[208:49]

will be 3 yes and two

[208:52]

NOS understand understanding

[208:57]

algorithm then everything will you'll be

[209:00]

able to understand now let's go ahead

[209:03]

and try to cease for sunny sunny

[209:05]

definitely has 2 yes and three NOS this

[209:08]

has four yes and zero NOS here you have

[209:10]

three Y and two NOS now if I probably

[209:13]

take overcast here you need to

[209:15]

understand understand about two things

[209:17]

one is pure

[209:18]

split and one is impure split now what

[209:22]

does pure split basically mean pure spit

[209:25]

basically means that now see in this

[209:26]

particular scenario in overcast in

[209:29]

overcast I have either yes or no so here

[209:32]

you can see that I have four yes and Zer

[209:35]

NOS so that basically means this is a

[209:37]

pure split anybody tomorrow in my data

[209:40]

set if I just take this Outlook feature

[209:43]

suppose in one day in day 15 the Outlook

[209:46]

is Outlook is basically overcast then I

[209:50]

know directly it is the person is going

[209:52]

to play so this part is already created

[209:54]

and this node is called as pure

[209:58]

node understand this why it is called as

[210:00]

pure node because either you have all

[210:03]

Yes or zeros NOS or zero yes or all NOS

[210:08]

like that in this particular case I have

[210:10]

all yes so if I take this specific path

[210:13]

I know that with respect to overcast my

[210:16]

final decision which is yes it is always

[210:17]

going to become yes so this is what it

[210:19]

basically says so I don't have to split

[210:22]

further so from here I will probably not

[210:25]

split I will definitely not split more

[210:28]

because I don't require it because I

[210:31]

have it is a pure leaf node okay you can

[210:34]

also say that this is a pure leaf node

[210:37]

so I'm just going to mention it again

[210:39]

this one I'm specifically talking about

[210:41]

now let's talk about sunny in the case

[210:43]

of Sunny you have two yes and three NOS

[210:45]

so this is obviously impure so what we

[210:48]

do we take next feature and again how do

[210:52]

we calculate that which feature we

[210:54]

should take next I'll discuss about it

[210:56]

let's say that after this I take up

[211:00]

temperature I take up temperature and I

[211:02]

start splitting again since this is

[211:04]

impure okay and this split will happen

[211:08]

until we get finally a pure split

[211:11]

similarly with respect to rain we will

[211:13]

go ahead and take another feature and

[211:15]

we'll keep on splitting unless and until

[211:18]

we get a leaf node which is completely

[211:21]

pure I hope you understood how this

[211:23]

exactly work now two questions two

[211:27]

questions is that Kish the first thing

[211:29]

is that how do we calculate this

[211:32]

Purity and how do we come to know that

[211:35]

this is a pure split just by seeing

[211:38]

definitely I can say I can definitely

[211:41]

say by just seeing that how many number

[211:43]

of yes or NOS are there based on that I

[211:45]

can def itely say it is a pure split or

[211:47]

not so for this we use two different

[211:50]

things one is

[211:53]

entropy and the other one is something

[211:55]

called as guine coefficient so we will

[211:58]

try to understand how does entropy work

[212:01]

and how does Guinea coefficient work in

[212:04]

decision tree which will help us to

[212:06]

determine whether the split is pure

[212:09]

split or not or whether this node is

[212:11]

leaf node or not then coming to the

[212:13]

second thing okay coming to the second

[212:16]

thing one is with respect to Purity

[212:18]

second thing your first most important

[212:20]

question which you had asked why did I

[212:22]

probably select Outlook how the features

[212:24]

are selected and here you have a topic

[212:27]

which is called as Information Gain and

[212:29]

if you know this both your problem is

[212:32]

solved so now let's go ahead and let's

[212:35]

understand about entropy or guinea

[212:38]

coefficient or Information Gain entropy

[212:40]

or guine coefficient oh sorry Guinea

[212:42]

coefficient I'm saying guine impurity

[212:44]

also you can say over here

[212:46]

I'll write it as guine impurity not

[212:48]

coefficient also I'll just say it as

[212:50]

Guinea impurity but I hope everybody is

[212:53]

understood till here let's go ahead and

[212:55]

let's discuss about the first thing that

[212:57]

is

[212:58]

entropy how does entropy work and how we

[213:01]

are going to use the formula so entropy

[213:04]

here I will just write guine so we are

[213:07]

going to discuss about this both the

[213:09]

things let's say that the entropy

[213:12]

formula which is given by I will write h

[213:14]

of s is equal to so h of s is equal to

[213:17]

minus P plus I'll talk about what is

[213:20]

minus what is p plus log base 2 p

[213:26]

+- p

[213:28]

minus log base 2 p minus so this is the

[213:32]

formula and in guine impurity the

[213:34]

formula is 1 minus summation of I equal

[213:39]

1 2 N p² I even talk about when you

[213:43]

should use guine impurity when you

[213:44]

should not use guine impurity

[213:46]

when you should use entropy you know by

[213:48]

default the decision tree regression or

[213:51]

classific sorry decision tree

[213:53]

classification uses Guinea impurity now

[213:56]

let's take one specific example so my

[213:58]

example is that I have a feature one my

[214:00]

root node I have a feature one which is

[214:03]

my root node and let's say that in this

[214:05]

root node I have six yes and three NOS

[214:08]

very simple let's say that this has two

[214:11]

categories and based on this two

[214:13]

categories of split has happened that is

[214:16]

a C1 let's say in this I have 3 S3 Nos

[214:20]

and here I have 3 s0 Nos and this is my

[214:24]

second category always understand if I

[214:26]

do the sumission 3s and 3s is 6s see

[214:30]

this this sumission if I do 3 + 3 is

[214:33]

obviously 6 3 + 0 is obviously so this

[214:36]

you need to understand based on the

[214:38]

number of root nodes only almost it'll

[214:40]

be same now let's go ahead and let's

[214:44]

understand how do we Cal calculate let's

[214:46]

take this example how do we calculate

[214:48]

the entropy of this so I have already

[214:50]

shown you the entropy formula over here

[214:52]

now let's understand the components I

[214:55]

will write h of s is equal to minus sign

[214:59]

is there what is p+ p+ basically means

[215:03]

that what is the probability of yes what

[215:07]

is the probability of yes this is a

[215:10]

simple thing for you all out of this

[215:13]

what is the probability of yes yes out

[215:16]

of this so obviously how you'll write if

[215:19]

you want to find out the probability of

[215:20]

yes out of this see when I say plus that

[215:24]

basically means yes when I say minus

[215:27]

that basically means no so what is the

[215:29]

probability of yes so it is be nothing

[215:32]

but yes plus and minus are specifically

[215:35]

for binary

[215:37]

class this can be positive negative so

[215:40]

the probability with respect to yes can

[215:42]

I write 3x 3 only for this what is the

[215:45]

probability out of this total number of

[215:48]

this is there 3x3 similarly if I go and

[215:51]

see the next term log to the base 2 p+

[215:54]

so again if I go ahead and write over

[215:56]

here log to the base 2 p+ p+ is again

[216:03]

3x3 so then again we have minus and this

[216:07]

is now P minus what is p minus 0 by 3

[216:11]

log base 2 0 by 3 this obviously will

[216:15]

become zero this will obviously become 0

[216:18]

because 0 divid by anything is zero what

[216:21]

will this be 1 log to the base 1 what is

[216:25]

this this is nothing but zero log to the

[216:28]

base 1 is nothing but zero tell me

[216:31]

whether this is a pure split or impure

[216:35]

split so this is a pure split whenever

[216:38]

we have a pure split the answer of the

[216:41]

entropy is going to come to zero so here

[216:44]

I'm going to Define one graph

[216:46]

this is H of s and let's say this is p+

[216:49]

or P minus if my probability of plus see

[216:53]

when I say probability of plus is 0.5

[216:56]

what will be probability of minus it

[216:57]

will also be 0. five right because it's

[217:01]

just like P is equal to 1 - Q right if p

[217:04]

is .5 then Q will be 1 - P same thing

[217:07]

right so when it

[217:09]

is5 obviously my h of s will be 1 let's

[217:14]

say so this is this is the graph that

[217:16]

will basically get formed let's go ahead

[217:19]

and try to calculate the entropy of this

[217:21]

guys what will be the entropy of this

[217:24]

node so here I'm going to just make a

[217:26]

graph h of s minus what is p+ p+ is

[217:31]

nothing but 3x 6 log base 2 3x 6

[217:37]

minus three no are there 3x 6 log base 2

[217:43]

3x 6 so if you compute this

[217:46]

log base 2 to the^ of 1 if you do the

[217:50]

calculation here I'm actually going to

[217:52]

get one so when I'm getting one when I'm

[217:55]

actually getting one when you have three

[217:57]

yes and three NOS what is the

[217:59]

probability it is 50/50% right so when

[218:02]

your p+ is5 that basically means your h

[218:06]

of s is coming as one so from this graph

[218:09]

you can see that I'm getting one if this

[218:11]

is zero this is one this is zero and

[218:13]

this is one I hope everybody is able to

[218:15]

to understand guys 0o and one if your p+

[218:20]

is

[218:21]

zero or if your p+ is one that basically

[218:24]

means it becomes a pure split so in h of

[218:26]

s you are going to get

[218:29]

zero so always understand your entropy

[218:33]

will be between 0 to

[218:36]

1 if I have a impure this is a

[218:39]

completely impure split because here you

[218:42]

have 50% probability of getting yes 50%

[218:45]

probability of getting no h ofs is

[218:48]

entropy this is entropy for the sample H

[218:52]

ofs notation that I'm using is H ofs so

[218:56]

if whenever the split is happening the

[218:59]

first thing is done the purity test the

[219:02]

purity test is done with the help of

[219:04]

entropy right now I'll also show guinea

[219:07]

guinea impurity don't worry so with the

[219:09]

entropy you'll be able to find if I am

[219:11]

getting one that basically means it is a

[219:14]

impure split and if I'm getting zero it

[219:18]

is pure split so this is the graph okay

[219:22]

this is the graph and this graph is

[219:24]

basically the entropy graph again

[219:26]

understand if your probability of

[219:28]

getting yes or no is 0.5 that basically

[219:30]

means 50/50 is there 3s and three NOS

[219:34]

then your entropy is going to be 1 h of

[219:37]

s if your probability is completely one

[219:39]

that basically means either you're

[219:40]

getting completely yes or completely no

[219:43]

so your your entropy will be zero that

[219:46]

basically means it is pure split so in

[219:48]

the case of probability .5 you're

[219:50]

getting plus one then it'll keep on

[219:52]

reducing now let's go ahead and let's

[219:54]

try to understand so here you have

[219:56]

understood about purity test definitely

[219:58]

you'll use entropy try to find out

[220:00]

whether it is pure or impure if it is

[220:02]

impure you go ahead with the further

[220:04]

shift further division of the categories

[220:08]

again you take another feature divide it

[220:10]

because here from this two which split

[220:13]

you will do further you will do this

[220:14]

split as further if you are getting 6 6

[220:18]

is this specific value then you probably

[220:20]

go and draw over here this is your

[220:23]

entropy if your probability is here

[220:25]

which

[220:26]

is.3 then you will go here and create

[220:29]

this this may be0 4 or3 something like

[220:32]

this it will be between 0 to 1 let's go

[220:35]

ahead and discuss about the second issue

[220:37]

I hope everybody is discussed about we

[220:40]

have discussed about checking the pure

[220:42]

split or not and we have understood this

[220:45]

much but the next thing is that okay

[220:47]

fine chish this is very good we have

[220:49]

explained well I know many people will

[220:51]

say that but there are some people I

[220:53]

can't help let's say that I have some

[220:55]

features okay now coming to the second

[220:58]

problem how do we consider which node to

[221:02]

cap which which feature to take and

[221:05]

split because here I may have one one

[221:08]

split so again let's see that what is

[221:10]

the second problem which feature to take

[221:14]

to split right this is the second

[221:16]

problem that we are trying to solve

[221:18]

let's say that I have one feature one

[221:19]

over here and I have two categories

[221:22]

let's say this is there C1 and C2 here

[221:25]

let's say that I have 9 years 5 Nos and

[221:29]

then I have 6 years 2 NOS here I have

[221:32]

basically three yes and three NOS let's

[221:34]

say and in my data set I have features

[221:36]

like F1 FS2 F3 now let's say that

[221:40]

another split I can actually start with

[221:42]

feature two also and in feature two I

[221:45]

may have probably three categories like

[221:47]

C1 C2 C3 so with respect to the root

[221:52]

node and all the other features because

[221:54]

after this also I may have to split

[221:56]

right I may have to take another feature

[221:58]

and keep on splitting right based on the

[222:01]

Pure or impure split how do I decide

[222:03]

should I take fub1 first or F2 first or

[222:07]

F3 first or any other feature first how

[222:10]

should I decide that which feature

[222:12]

should I take and probably do the split

[222:15]

that is the major question so for this

[222:18]

we specifically use something called as

[222:20]

Information Gain so here I'm just going

[222:22]

to say here we basically use Information

[222:26]

Gain now what is this Information Gain

[222:29]

I'll talk about it so Information Gain

[222:31]

first of all I will write the formula we

[222:33]

basically write gain with sample first

[222:37]

with feature one I will compute so first

[222:40]

with feature one I will compute suppose

[222:42]

this is my first split of my data and

[222:44]

probably I'm Computing over here this

[222:46]

can be written as h of s I'll discuss

[222:50]

about each and every parameter don't

[222:51]

worry summation of V belong to values s

[222:56]

of V don't worry guys if you have not

[222:58]

understood the formula I will explain it

[223:01]

then the sample size H of SV I'll

[223:04]

discuss about each and every parameter

[223:06]

let's say that I'm taking this feature

[223:09]

one split I have you have already seen

[223:11]

what is feature one so this is my

[223:13]

feature one I have two categories C1 C2

[223:18]

this has 9 yes 5 NOS this has 6s and two

[223:24]

Nos and this has 3 yes and three NOS now

[223:27]

I will try to calculate the information

[223:29]

gain of this specific split now I will

[223:32]

go ahead and probably take this up now

[223:35]

see over here we'll try to understand

[223:37]

what is this now if I want to compute

[223:40]

the gain of s of F1 first is first first

[223:43]

thing that I need to find out is H of s

[223:45]

now this h of s is specifically of the

[223:48]

root node so I need to first of all

[223:50]

calculate what is h of s h ofs is

[223:52]

nothing but entropy entropy of the root

[223:56]

node so if I want to compute the entropy

[223:58]

of the node node tell me how should I

[224:00]

compute h of s is equal to minus p + log

[224:04]

base 2 p+ calculate guys along with me -

[224:07]

P minus log base to P minus so I hope

[224:11]

everybody knows this so here I'm going

[224:13]

to compute by what is ability of plus

[224:16]

over here in this specific root node it

[224:18]

is nothing but 9 by4 then I have log

[224:22]

base 2 again 9

[224:24]

by4 then I have P minus what is p minus

[224:28]

5x4 log base 2 5 by4 so this calculation

[224:34]

I will probably get it as

[224:36]

94 approximately equal to 94 just check

[224:40]

it whether you're getting this or not

[224:42]

again you can use calculator if you want

[224:44]

now now I have definitely found out this

[224:47]

this is specifically for the root node

[224:50]

now let's see the next thing the next

[224:51]

important thing which is this part what

[224:54]

is s of v and what is s and what is h of

[224:57]

SV now very important just have a look

[225:01]

everybody see this graph okay see this

[225:05]

graph I will talk about h of SV first of

[225:07]

all I'll talk about h of SV okay this

[225:10]

one this is the entropy of category one

[225:13]

you need to find and entropy of category

[225:15]

2 you need to find so if I write h of SV

[225:19]

of category 1 so what is category 1 for

[225:22]

this I'll write SC1 let's say I'm going

[225:25]

to write like this quickly calculate the

[225:28]

H of SV of this and this separately you

[225:31]

need to calculate so h of SV of C1 okay

[225:35]

so here again you'll write - 6X 8 log

[225:38]

base 2 6X

[225:41]

8us 2x 8 log base to 2x 8 I hope

[225:46]

everybody knows this how we got it so h

[225:50]

of SV basically means I'm going to

[225:51]

compute the entropy of this category and

[225:54]

this category so for that I will

[225:56]

basically write h of so here I will

[225:59]

write - 6 by8 log base 2 6X 8 - 2x 8 log

[226:08]

base 2 2x 8 so if I get it I'm actually

[226:12]

going to get 81 and similarly if I if I

[226:15]

calculate h of C2 quickly calculate how

[226:18]

much you are going to get guys 6X 8 6X 8

[226:21]

with respect to this we need to find out

[226:24]

so now we have all these values we'll

[226:25]

start equating them to this equation so

[226:29]

here we have finally gain of s comma

[226:33]

fub1 so let's say that here I'm going to

[226:36]

basically add

[226:38]

94 minus see minus summation of okay

[226:42]

summation of what is s s of V understand

[226:46]

s of V basically means that how many

[226:48]

samples I have over here let's say for

[226:51]

category one how many samples I have for

[226:54]

category one over here simple if you

[226:56]

really want to just calculate it is

[226:58]

nothing but eight and total number of

[227:01]

sample is how much if I go and see over

[227:03]

here there are 9 years five NOS okay 9

[227:07]

years and five NOS that basically means

[227:10]

14 total sample here you have eight

[227:13]

sample Okay so this will become

[227:17]

8x4 then you multiply by what see see

[227:21]

from this equation you multiply by h of

[227:23]

SV so h of SV is nothing but the entropy

[227:26]

of category 1 so entropy of category 1

[227:29]

is nothing but 81 plus then you go again

[227:33]

back to the graph and try to see that

[227:36]

for C2 how much how many total number of

[227:39]

samples are there 3 + 3 is 6 so 6 by 14

[227:42]

it will

[227:43]

become multiplied by 1 right so this is

[227:49]

your entire thing so here after all the

[227:52]

calculation you are going to get

[227:54]

0.041 so this is my gain with s comma F1

[227:59]

so here I have got this value amazing I

[228:02]

did this with feature one only what

[228:05]

about feature two let's say that this

[228:07]

was my split for feature two and suppose

[228:10]

I get the gain for S comma feature 2 as

[228:17]

.51 if I get this now tell

[228:21]

me in using which feature should I start

[228:25]

splitting first whether it should be

[228:28]

fub1 or whether it should be FS2 based

[228:31]

on this value you know that over here

[228:35]

the gain the information gain of s comma

[228:38]

F2 is greater than gain of s comma fub1

[228:43]

so your answer is very much simple we

[228:45]

will definitely use feature 2 to start

[228:48]

the split the thing over here you are

[228:51]

trying to understand that if I really

[228:52]

want to select which feature to select

[228:54]

to start my splitting then I have to

[228:58]

basically calculate the information gain

[229:00]

and go throughout the all the paths and

[229:03]

whichever path has the highest

[229:04]

Information

[229:05]

Gain then we will select that specific

[229:09]

thing now the question Rises Kish

[229:12]

obviously this is good but you had

[229:14]

written about guinea impurity what is

[229:16]

the purpose of that please explain us

[229:19]

and why Guinea impurity is basically

[229:20]

used so let me go ahead with guine

[229:22]

impurity I told that yes you can

[229:25]

obviously

[229:26]

use you can obviously use entropy but

[229:29]

why Guinea impurity so guine impurity

[229:32]

formula which I have specifically

[229:34]

written as 1 minus summation of IAL 1

[229:38]

2 N

[229:41]

p² now what is this p² suppose let's say

[229:45]

that in my n n is the number of outputs

[229:47]

right now how many outputs I have I have

[229:49]

two outputs yes or no so I will expand

[229:52]

this 1 minus since this is summation I

[229:55]

equal to 1 to n I'm basically going to

[229:57]

basically say that okay fine I will

[230:00]

write probability of plus whole

[230:03]

Square uh plus probability of minus

[230:07]

whole Square so this is the formula for

[230:10]

guinea impurity now you may be thinking

[230:14]

okay fine the calculation will be

[230:16]

obviously very much equal easy right

[230:18]

suppose if I have a node sorry if I have

[230:21]

a node which which has 2 yes two NOS now

[230:25]

in this particular case how do I

[230:26]

calculate my this probability if I have

[230:29]

two yes or two NOS suppose let's say

[230:31]

that I have a node over here which is my

[230:33]

split and this is having two yes and two

[230:36]

no so how do I calculate I will write 1

[230:38]

minus what is probability of square 1X 2

[230:41]

square sorry not 1 by two

[230:45]

yeah 1X 2 squ + 1 by 2

[230:49]

squ right then I will say 1 by 1X 4 + 1X

[230:54]

4 is nothing but 2x 4 which is nothing

[230:56]

but 1X 2 so I will be getting 0.5 now

[231:00]

here here you understand this is a

[231:02]

complete impure split right if you have

[231:06]

an impure split in entropy the output

[231:10]

you getting it as one whereas in the

[231:13]

case of Guinea impurity

[231:15]

as Z sorry

[231:17]

0.5 so if I go ahead with the graph that

[231:21]

I probably had created here so my Guinea

[231:24]

impurity line will look something like

[231:27]

this so it will be looking something

[231:29]

like this for zero obviously I'll be

[231:31]

getting zero but whenever my probability

[231:34]

of plus is 0.5 I'm going to get 0.5 over

[231:38]

here and that is the difference between

[231:40]

Guinea

[231:42]

impurity and entropy but the re but you

[231:45]

may be seeing Kish when to use what now

[231:48]

let's understand that when to use Guinea

[231:51]

and when to use entropy tell me guys if

[231:55]

I consider this formula of guine

[231:58]

impurity and if I probably

[232:01]

consider if I consider entropy this

[232:05]

formula where do you think more time

[232:09]

will take for execution for this

[232:11]

particular formula whether for entropy

[232:14]

it will take or for guinea impurity it

[232:18]

will take more time where it will

[232:21]

probably take for the execution purpose

[232:24]

see understand decision tree is having a

[232:29]

worst time complexity because if you

[232:32]

have 100 features probably you'll keep

[232:34]

on comparing by dividing many many

[232:37]

features then probably compute a

[232:38]

Information Gain like this if you have

[232:40]

just 100 features so which is faster

[232:43]

entrop

[232:45]

or guine impurity understand in entropy

[232:48]

you have log function here you have log

[232:52]

function here you have simple maths the

[232:56]

more amount of time out of entropy and

[232:59]

guine impurity the more amount of time

[233:01]

basically is taken

[233:03]

by

[233:06]

entropy so if you have huge number of

[233:10]

features like 100 200 features and you

[233:12]

are planning to apply decision Tre I

[233:15]

would suggest try to use Guinea impurity

[233:18]

then entropy if you have small set of

[233:20]

features then you can go ahead with

[233:23]

entropy so over here definitely with

[233:25]

respect to fast Guinea is greater than

[233:31]

entropy now let's go ahead and

[233:33]

understand with respect to you may be

[233:36]

thinking Kish okay fine you have

[233:38]

basically explained us about categorical

[233:41]

variables over here see over here you

[233:44]

have you have explained about

[233:45]

categorical variables what if I have

[233:47]

numerical feature let's say I have F1

[233:51]

over here which is a numerical

[233:53]

feature I have an F1 feature which is

[233:56]

numerical feature and I may have values

[233:58]

let's say that I have sorted all the

[234:00]

values over here okay let's say that I

[234:02]

have F1 and output okay so this F1 let's

[234:06]

say that I have values

[234:07]

like ass sorted order values I'm sorting

[234:10]

this features I'm basically doing this

[234:12]

let's say that initially I have this

[234:15]

features like this and let's say I have

[234:17]

values like 2.3 1.3 4 5 7 3 let's say I

[234:23]

have this features now this is a

[234:26]

continuous

[234:27]

feature this is a continuous feature so

[234:29]

for a continuous feature how probably

[234:32]

the decision tree entropy will be

[234:34]

calculated and the Information Gain will

[234:37]

get calculated so here you'll be able to

[234:39]

see that I will first of all sort these

[234:41]

values so in F1 the decision tree will B

[234:44]

basically first of all sort this values

[234:45]

so I have 1.3 then you have 2.3 then you

[234:49]

have four then you have three three then

[234:53]

you have four then you have five and

[234:55]

then you have six now whenever you have

[234:57]

a continuous feature so how the

[234:59]

continuous feature will basically work

[235:01]

in this case first of all your decision

[235:04]

tree node will say

[235:06]

that we'll take this one only one first

[235:10]

record and say that if it is less than

[235:12]

or equal to 1.3

[235:14]

okay if it is less than or equal to 1.3

[235:16]

so you here you'll be getting two

[235:18]

branches yes or no so yes and no

[235:22]

definitely your output over here will be

[235:25]

put over here right and then for the no

[235:28]

here you'll be having another node over

[235:30]

here how many number of Records you'll

[235:31]

be having in this particular case you'll

[235:33]

be having one record in this particular

[235:35]

case you will be having around five to

[235:36]

six records and here also you'll be able

[235:38]

to see right how many yes and NOS are

[235:40]

there definitely this will be a leaf

[235:42]

node so in the first instance they will

[235:45]

go ahead and calculate the information

[235:47]

gain of this then probably once the

[235:50]

Information Gain Is got then what

[235:51]

they'll do they will take the first two

[235:54]

records and again create a new decision

[235:57]

tree let's say that this will be my

[236:00]

suggestion where they'll say it is less

[236:02]

than or equal to 2.3 so I will get one

[236:05]

and one over here so in this now you'll

[236:07]

be having two records which will

[236:09]

basically say how many yes and no are

[236:10]

there and remaining all records will

[236:12]

come over here then again Information

[236:16]

Gain will be computed here then again

[236:17]

what will happen they'll go to the next

[236:19]

record then then again they'll create

[236:21]

another feature where they'll say less

[236:22]

than or equal to three and they will

[236:24]

create this many nodes again they'll try

[236:28]

to understand that how many yes or no

[236:29]

are there and then they'll again compute

[236:31]

The Information Gain like this they'll

[236:34]

do it for each and every record and

[236:36]

finally whichever Information Gain is

[236:38]

higher they will select that specific

[236:40]

value in that feature and they'll split

[236:42]

the node so in a continuous feature

[236:45]

whenever you have a continuous feature

[236:47]

this is how it will basically have and

[236:50]

then it will try to compute who is

[236:51]

having the highest Information Gain the

[236:54]

best Information Gain will get selected

[236:57]

and from there the splitting will

[236:59]

happen now let's go ahead and understand

[237:01]

about the next topic is that how this

[237:04]

entirely things work in decision tree

[237:07]

regressor because in decision tree

[237:09]

regressor my output is an continuous

[237:13]

variable so suppose if I have one

[237:15]

feature one feature two and this output

[237:17]

is a continuous feature it will be

[237:20]

continuous any value can be there so in

[237:23]

this particular case how do I split it

[237:27]

so let's say that f1c feature is getting

[237:30]

selected now in this f1c feature what

[237:32]

value will come when it is getting

[237:34]

selected first of all the entire mean

[237:38]

will get calculated of the output mean

[237:40]

will get calculated so here I will have

[237:42]

the mean and here here the cost function

[237:45]

that is used is not Guinea coefficient

[237:48]

or guinea impurity or entropy here we

[237:51]

use mean squared

[237:53]

error or you can also use mean absolute

[237:56]

error now what is mean squared error if

[237:58]

you remember from our logistic linear

[238:00]

regression how do we calculate 1 by 2 m

[238:03]

summation of I = 1 to n y hat minus y

[238:08]

whole Square y hat of i y - y whole

[238:12]

Square this is what is mean square error

[238:14]

so what it will do first based on F1

[238:17]

feature it will try to assign a mean

[238:20]

value and then it will compute the MSE

[238:23]

value and then it'll go ahead and do the

[238:26]

splitting now when it is doing splitting

[238:29]

based on categories of continuous

[238:31]

variable I will be having different

[238:33]

different categories now in this

[238:35]

categories what will happen after split

[238:37]

some records will go over

[238:40]

here then I will be having a mean value

[238:42]

of this over here

[238:45]

that will be my output and then again

[238:47]

the MSC will get calculated over here as

[238:50]

the msse gets reduced that basically

[238:53]

means we are reaching near the leaf

[238:55]

note and the same thing will happen over

[238:57]

here so finally when you follow this

[239:00]

path whatever mean value is present over

[239:02]

here that will be your output this is

[239:05]

the difference between the decision tree

[239:06]

regressor and the classifier here

[239:09]

instead of using entropy and all you use

[239:12]

mean squar error or mean absolute error

[239:14]

and this is the formula of mean square

[239:16]

error now let's go to the one more topic

[239:19]

which is called as the hyperparameters

[239:22]

tell me decision tree if I keep on

[239:25]

growing this to any depth what kind of

[239:28]

problem it will face regressor part you

[239:31]

want me to explain okay let's

[239:33]

see okay let's let's do the

[239:36]

regression decision

[239:39]

tree

[239:41]

regressor let's say I have feature F1

[239:44]

and this is my output let's say I have

[239:46]

values like 20 24 26 28 30 and this is

[239:53]

my feature one with category one

[239:56]

category one let's

[239:58]

say some categories are there let's say

[240:01]

I have done

[240:03]

the division by

[240:06]

F1 that is this feature initially tell

[240:09]

me what is the mean of this that mean

[240:12]

value will get assigned over here then

[240:14]

using msse that is mean squar error here

[240:18]

you will try to calculate suppose I get

[240:20]

an msse of some 37 47 something like

[240:23]

this and then I will try to split this

[240:27]

then I will be getting two more nodes or

[240:29]

three more nodes it depends then that

[240:31]

specific nodes will be the part of this

[240:33]

again the mean will change again the

[240:36]

mean will change over here suppose this

[240:38]

two is there this two records goes here

[240:41]

right then again MC will get calculated

[240:44]

I'm just taking as an example over here

[240:46]

just try to assume this thing now if I

[240:48]

talk about hyper parameters see this is

[240:51]

what is the formula that gets applied

[240:52]

over MSC now let's see in this hyper

[240:56]

parameter always understand decision

[240:58]

tree leads to overfitting because we are

[241:00]

just going to divide the nodes to

[241:03]

whatever level we want so this obviously

[241:06]

will lead to

[241:07]

overfitting now in order to prevent

[241:10]

overfitting we perform two important

[241:12]

steps one is post pruning and one is

[241:16]

pre- pruning so this two post pruning

[241:18]

and pre pruning is a condition let's say

[241:21]

that I have done some

[241:23]

splits I have done some splits let's say

[241:26]

over here I have seven yes and two

[241:28]

no and again probably I do the further

[241:31]

split like this now in this particular

[241:33]

scenario you know that if 7 yes and two

[241:35]

NOS are there there is a maximum there

[241:37]

is more than 80% chances that this node

[241:40]

is saying that the output is yes so

[241:43]

should we further do more

[241:46]

pruning the answer is no we can close it

[241:49]

and we can cut the branch from here this

[241:52]

technique is basically called as post

[241:54]

pruning that basically means first of

[241:57]

all you create your decision tree then

[241:59]

probably see the decision tree and see

[242:01]

that whether there is an extra Branch or

[242:03]

not and just try to cut it there is one

[242:06]

more thing which is called as

[242:07]

pre-pruning now pre-pruning is decided

[242:10]

by hyperparameters what kind of hyper

[242:13]

parameters you can basically say that

[242:15]

how many number of decision tree needs

[242:17]

to be used not number of decision tree

[242:20]

sorry over here you may say that what is

[242:22]

the max

[242:24]

depth what is the max depth how many Max

[242:27]

Leaf you can

[242:28]

have so this all parameters you can set

[242:31]

it with grid SE

[242:33]

CV and you can try it and you can

[242:36]

basically come up with a pre- pruning

[242:38]

technique so this is the idea about

[242:41]

decision tree uh regressor yes yes it is

[242:44]

possible your guinea value will be one

[242:45]

no this graph is there

[242:47]

no Guinea value are you talking about

[242:50]

this Guinea entropy it will not be one

[242:51]

it will always be between 0

[242:53]

to.5 so the first thing first as usual

[242:57]

what we should do we should import the

[242:59]

libraries so here I will go ahead and

[243:02]

import the librar so I'll say

[243:04]

import pandas as NP PD import matplot

[243:10]

li. pyplot as PLT

[243:14]

uh

[243:16]

import so this basic things I have with

[243:19]

me so I will go and take any data set

[243:22]

that I want from SK

[243:24]

learn. data sets import let's say that

[243:28]

I'm going to take load Iris data set and

[243:31]

then I'm going to upload the iris data

[243:33]

set so I'm going to write load Iris

[243:36]

there is my Iris data set then the next

[243:38]

step uh once you get your iris data set

[243:41]

so this is my iris. dat

[243:45]

okay these are all my features the four

[243:47]

features will be there these four

[243:49]

features are petal length petal width

[243:51]

SLE length and SLE width this is my

[243:54]

independent features then if I really

[243:56]

want to apply

[243:58]

for classifier so decision tree

[244:03]

classifier so I can first of all import

[244:06]

from

[244:08]

skarn do tree import decision let's see

[244:13]

where decision tree present in a scalon

[244:16]

decision tree

[244:17]

classifier the name is absolutely fine

[244:20]

but I was not getting over here

[244:23]

so so this is got no module SK okay SK

[244:29]

skar

[244:31]

skn learn so here you have

[244:35]

classifier right now I'm just going to

[244:37]

overfit the data then I'll probably show

[244:38]

you how you can go ahead with uh

[244:42]

pruning so by default what are the

[244:44]

parameters over here if you probably go

[244:46]

and see in in the classifier over here

[244:49]

you have Criterion see this the first P

[244:52]

parameter is Criterion by default it is

[244:54]

Guinea then you have Splitter Splitter

[244:57]

basically means how you're going to

[244:58]

split and there also you have two types

[245:01]

best and random you can randomly select

[245:04]

the features and do it okay you should

[245:06]

always go with

[245:07]

best max depth is a hyper parameter

[245:11]

minimum sample lift is a hyper parameter

[245:13]

Max Fe features how many number of

[245:14]

features we are going to take in order

[245:16]

to fix that that is also an hyper

[245:17]

parameter so all these things are hyper

[245:19]

parameter okay so I will just by default

[245:22]

executed whatever is giving me in

[245:24]

decision tree and the next thing that

[245:26]

I'm actually going to do is create a

[245:28]

decision tree so for this I will be

[245:31]

using plot. fig size plot. figure inside

[245:35]

figure I have this fix

[245:38]

size okay and I will probably show in

[245:41]

some better figure size so that

[245:43]

everybody body will be able to see it so

[245:45]

here let me say that I'm going to take

[245:47]

an area of

[245:49]

1510 and then probably I'm going to say

[245:51]

tree Dot

[245:54]

Plot and here I'm going to say a

[245:57]

classifier and it should be filled the

[246:00]

coloring should be filled with this so

[246:04]

tree sorry Tre Tre Tre Tre

[246:09]

Tre it should be classifi tree. plot

[246:12]

okay I have to also import uh tree so I

[246:16]

have to basically import tree so from SK

[246:20]

learn

[246:22]

import three again I'm getting

[246:26]

error has no attribute plot

[246:29]

why let me just see the documentation

[246:32]

guys so this plot function is like plot

[246:34]

uncore tree dot tab plot _ tree now what

[246:40]

is the error we are getting okay not

[246:42]

fitted yet

[246:44]

sorry so I'm going to say

[246:47]

classifier do fit on data what data

[246:53]

iris.

[246:55]

data and then I'm going to fit with Iris

[246:58]

dot

[247:00]

Target so once this is done I think now

[247:03]

it will get

[247:04]

executed so this is how your graph will

[247:07]

look like guys so here you can see this

[247:10]

is how your graph looks like now if I

[247:12]

show you the graph over here see you can

[247:14]

see some amazing things over here three

[247:18]

outputs are actually there in this when

[247:21]

you see in this left hand side this

[247:23]

become a leaf node so this first one is

[247:25]

probably vers color uh versol flower

[247:29]

okay if you go on the right hand side

[247:31]

here you can see 50/50 is there so based

[247:32]

on one feature based on one feature here

[247:35]

you'll be able to see that you are

[247:37]

getting a leaf node based on another

[247:39]

Branch here you are getting

[247:41]

05050 so again you have two more

[247:44]

features getting splitted over here so

[247:46]

here you have 495 here you have

[247:48]

471 do we require this split anybody

[247:51]

tell me from here do we require any any

[247:54]

more split just try to think this is

[247:56]

after post pruning I want to find out

[247:59]

whether more splits are required or not

[248:01]

now in this particular case you see this

[248:03]

after this do you require any

[248:05]

split you do not require right here you

[248:08]

are basically getting 47 and one I guess

[248:11]

after this also you require no split

[248:14]

understand this so this is basically

[248:15]

post pruning so you can then decide your

[248:19]

level and probably do it gu value is

[248:22]

more than

[248:24]

0.5 okay this side H this is coming as

[248:29]

0.5 greater than 0.5 it should not had

[248:33]

here it is

[248:34]

0.5 no maximum .5 can come 0 to.5 only

[248:39]

should come I don't know why this is

[248:41]

coming as 667

[248:44]

I'll have a look onto this guys but

[248:47]

anywhere you see other than that you're

[248:50]

everywhere you're getting less

[248:51]

than5 the plotting graph is very much

[248:54]

easy you use SK learn import tree then

[248:57]

you basically do this get classify and

[248:59]

field is equal to true and you can just

[249:02]

do this so the agenda let me Define the

[249:05]

agenda what all things are there first

[249:08]

we'll understand about

[249:11]

emble techniques in this assemble

[249:13]

techniques we are basically going to

[249:15]

discuss about what is the difference

[249:17]

between

[249:19]

bagging and boosting

[249:22]

second what we are basically going to

[249:24]

discuss about is so uh the agenda of

[249:27]

this session is emble techniques bagging

[249:29]

and boosting then we are probably going

[249:31]

to cover random forest and then probably

[249:35]

we will try to cover adab boost and if I

[249:39]

have more energy I will also try to

[249:40]

cover XG boost so all this Al lthms

[249:43]

we'll discuss about it so let's go ahead

[249:46]

and let's start the

[249:48]

topics the first topic that we are going

[249:50]

to discuss is about emble

[249:52]

techniques now what exactly is emble

[249:55]

techniques and we are going to discuss

[249:58]

about it okay so emble techniques what

[250:01]

exactly is emble techniques till now we

[250:03]

have solved two different kind of

[250:04]

problem statement one is

[250:07]

classification and regression and you

[250:09]

have learned about different different

[250:11]

algorithms like uh linear regression

[250:13]

logistic regression we have discussed

[250:15]

about KNN we have discussed about

[250:17]

yesterday what disc what did we discuss

[250:19]

about n bias different different

[250:21]

algorithms we have already finished now

[250:24]

with respect to classification

[250:25]

regression Problem whatever algorithm we

[250:27]

are discussing there was only one

[250:28]

algorithm at a time we were discussing

[250:31]

one algorithm at a time we are

[250:32]

discussing and we are trying to either

[250:33]

solve a classification or a regression

[250:35]

problem now the next thing is over here

[250:38]

is that can we use multiple algorithms

[250:42]

mul multiple algorithm to solve a

[250:44]

problem multiple algorithms basically

[250:46]

means can we I'll just talk about it

[250:49]

okay now the if I ask this specific

[250:52]

question can we use multiple algorithms

[250:54]

to solve a problem at that point of time

[250:57]

I will definitely say yes we can because

[250:59]

we are going to use something called as

[251:00]

emble techniques there now what this

[251:03]

emble techniques is okay so emble

[251:06]

techniques in emble techniques we

[251:08]

specifically use two different ways one

[251:12]

is one one way is that we specifically

[251:15]

use and the other one I'll just go to

[251:16]

write it over here so one that we

[251:19]

basically use is something called as

[251:20]

bagging technique and the other one we

[251:23]

specifically use is something called as

[251:25]

boosting technique so in bagging

[251:27]

Technique we what exactly we can do and

[251:31]

in boosting technique what we can

[251:32]

actually do and how we are combining

[251:34]

multiple models to solve a problem so

[251:36]

let's first of all discuss about bagging

[251:39]

now how does bagging work let's say that

[251:42]

I have a specific data set so this is my

[251:44]

data set with uh with features rows

[251:48]

columns everything like this I have this

[251:50]

specific data set just imagine I have

[251:52]

many many features over here like this

[251:54]

fub1 F2 F3 and probably I have my output

[251:57]

so this is my data set D let's consider

[251:59]

it now what we do in bagging is that we

[252:04]

create models and this model can be

[252:06]

anything it can be logistic it can be

[252:08]

linear for a classification problem

[252:10]

let's say that this is logistic model so

[252:12]

this is my model M1 let's say I have

[252:14]

another model M2 then I may have another

[252:17]

model M3 let's say that this is

[252:20]

logistic and this is probably the other

[252:23]

model which is like decision tree and

[252:25]

then probably we use this model as KNN

[252:29]

classification and this model can again

[252:31]

be decision tree it's fine let's use

[252:34]

another decision tree so now here you

[252:36]

can see that we have used so many models

[252:39]

okay so many models are there now with

[252:41]

respect to this particular model what I

[252:42]

will do is that the first step that I

[252:44]

will do from this particular data set I

[252:46]

will just take up some rows so I'll

[252:48]

basically do row

[252:50]

sampling and I'll take a row sampling of

[252:53]

D Dash D Das basically means this D Das

[252:55]

is always less than D some of the rows

[252:58]

I'll push it to M1 okay I can also use n

[253:01]

fine so what I'll do is that some of the

[253:03]

rows I'll push it to model one this

[253:05]

model one will be training let's say

[253:07]

that for out of this 10,000 record th000

[253:09]

rows I'm actually doing a row sampling

[253:11]

of th rows and giving it to M1 to train

[253:14]

it then what I'm actually going to do

[253:16]

over here I'm basically going to give

[253:18]

this specific model M2 and again I'm

[253:21]

going to do row row sampling and I'm

[253:24]

again going to sample some of the rows

[253:25]

and give it to model two and again

[253:27]

remember some of the rows may get

[253:29]

repeated from this D Dash to next dble

[253:31]

Dash similarly I will do row sampling

[253:33]

and give it to this and again I may have

[253:35]

d triple Dash and D4 Dash so different

[253:38]

different different different rows data

[253:41]

points when I say row sampling basically

[253:42]

I'm talking about data points different

[253:45]

different data points I will give it to

[253:47]

separate separate model and this model

[253:49]

will specifically train when I say D

[253:52]

Dash that basically means uh suppose I

[253:54]

say th 10,000 are my total number of

[253:56]

data points when I say D Dash This D

[253:59]

Dash may be th000 points then D Double

[254:02]

Dash may be another th000 points and

[254:04]

some of the rows may get repeated over

[254:05]

here dle Dash here also I can basically

[254:08]

use so here specifically row sampling

[254:10]

will be used now when I have this many

[254:12]

specific each and every model will be

[254:14]

trained with different kind of data now

[254:17]

how the inferencing will happen for the

[254:18]

test data so first thing first let's say

[254:21]

that I'm going to get a new test data

[254:23]

over here now new test data will be

[254:25]

passed to M1 and this M1 suppose it

[254:28]

gives zero as my output suppose let's

[254:30]

say that I'm doing a binary

[254:31]

classification it gives a Zer as an

[254:33]

output so this is my output of zero next

[254:37]

M2 for the new test data gives one M3

[254:40]

gives one and M4 also gives one as the

[254:43]

the output now in this particular case

[254:46]

in this particular case what will happen

[254:49]

now you can see over here it's simple

[254:51]

what what do you think the output may be

[254:53]

in this particular case now M1 has

[254:55]

predicted for this particular test data

[254:56]

as zero the model M2 has predicted 1 M3

[255:00]

has predicted 1 and M4 has predicted one

[255:02]

so finally all these outputs are going

[255:04]

to get

[255:06]

aggregated are going to get aggregated

[255:08]

and a simple thing that gets applied is

[255:11]

majority voting majority voting so tell

[255:14]

me what will be the output for with

[255:16]

respect to this the output will

[255:18]

obviously be one because the majority

[255:19]

voting that you can see three people are

[255:21]

basically saying it as one so my output

[255:24]

over here will be one okay this is the

[255:26]

concept of bagging wherein you are

[255:29]

providing different different rows with

[255:31]

probably all the features in this case

[255:33]

and giving it to different different

[255:34]

model again which is a classification

[255:36]

model and then finally you are combining

[255:38]

them based on majority voting and you're

[255:40]

getting the answer as one so this step

[255:43]

is called as bootstrap aggregator that

[255:45]

basically means you're aggregating all

[255:48]

the output that is basically coming from

[255:50]

all the specific models all the specific

[255:52]

models now many people will say Krish

[255:54]

what about Tai guys like this kind of

[255:56]

situation you know we will be having

[255:58]

more than 100 to 200 models so it is

[256:01]

very very difficult that it will be a

[256:03]

tie who are repeating questions they

[256:05]

will be put up in time out so what if

[256:09]

you're saying that if the 50% of model

[256:12]

says yes 50% of our models says no

[256:14]

always understand guys we will be having

[256:17]

more than 100 to 200 plus models so in

[256:19]

this particular case there will be high

[256:21]

probability that always there will be a

[256:23]

majority voting available it will always

[256:25]

not be in that specific scenario so this

[256:28]

was the concept about bagging now some

[256:30]

people will be saying that Krish why are

[256:31]

you using different different models

[256:34]

guys I'm not discussing about random

[256:35]

Forest over here random Forest uses only

[256:37]

one type of model that is decision tree

[256:39]

but if we think as an concept of bagging

[256:43]

you can have different different models

[256:44]

over here and you can basically combine

[256:46]

them so this is a technique of emble

[256:49]

techniques and this is basically called

[256:51]

as bagging okay now tell me one point I

[256:54]

missed out fine this is with respect to

[256:56]

the classification problem with respect

[256:58]

to the regression problem what will

[257:00]

happen in case of a regression problem

[257:02]

let's say that I got here 120 here 140

[257:06]

here 122 here 148 as my output so in

[257:09]

regression what will happen is that the

[257:11]

entire mean will be taken mean will be

[257:15]

taken the output mean will be basically

[257:18]

taken and that will be your output of

[257:20]

the model average or mean very simple

[257:22]

right so average or mean will be

[257:25]

basically taken up and here based on the

[257:27]

average you'll be able to solve the

[257:29]

regression problem great now let's go

[257:31]

ahead and try to understand with respect

[257:34]

to bagging and boosting how many

[257:36]

different types of algorithm are but

[257:37]

before that I need to make you

[257:39]

understand what exactly is boosting now

[257:41]

here in bagging you have seen that you

[257:43]

have parallel models right one one one

[257:46]

independent you have parallel models

[257:48]

you're giving some row samples in

[257:49]

different different models and basically

[257:51]

are able to find out the output now in

[257:53]

case of boosting boosting is a

[257:56]

sequential combination of models like

[257:59]

this you have lot of sequential models

[258:03]

like this and one after the model like

[258:06]

first I'll give my training data to this

[258:07]

particular model then it will go to this

[258:09]

data then this model then this model so

[258:12]

this will be my M1 M2 M3 M4 and finally

[258:16]

I will be getting my output so here you

[258:18]

can basically say that boosting is all

[258:21]

about and this M1 M2 M3 we basically

[258:24]

mention it as weak Learners so this will

[258:26]

be weak learner weak learner weak

[258:29]

learner weak learner and finally when we

[258:32]

go till here it it'll if I combine all

[258:35]

these weak ners weak

[258:38]

learner weak learner okay once I combine

[258:41]

all this weak learner it becomes a it

[258:43]

becomes a strong learner finally if I

[258:46]

try to combine this this will basically

[258:47]

become a strong learner so here you have

[258:50]

all the models sequentially one after

[258:52]

the other and then you will probably try

[258:55]

to provide your uh input from one model

[258:58]

to the next model to the next model and

[259:00]

these all models will be a very simpler

[259:01]

weak learner model which will not be

[259:03]

able to predict properly but when you

[259:05]

combine all this particular models

[259:08]

together sequentially it becomes a

[259:09]

strong learner how this specifically

[259:11]

works I'll take an example example of AD

[259:13]

boost XG boost I will show you that okay

[259:16]

week learner basically means the

[259:17]

prediction is very bad but as you go

[259:19]

sequentially you combine them they

[259:21]

become a strong learner okay one example

[259:24]

I want to give you let's say that you

[259:26]

are a data scientist right let's say

[259:30]

that this model one may be a teacher

[259:33]

with respect to physics then this model

[259:35]

two may be a teacher with respect to

[259:37]

chemistry let's say model 3 is basically

[259:40]

a teacher of maths and model four is a

[259:43]

teacher of geography now suppose if you

[259:46]

are trying to solve one problem

[259:48]

obviously if the physics teacher is not

[259:50]

able to solve that particular problem

[259:51]

then probably chemistry can help or

[259:54]

maths can help or geography can help or

[259:56]

someone can help so when we combine this

[259:58]

many expertise together they will be

[260:01]

able to give you the output in an

[260:03]

efficient way Sumit I'll talk about it

[260:05]

where whether all the features are

[260:07]

basically passed to all the models or

[260:08]

not I'll just talk about it just give me

[260:10]

some time okay but I just want to give

[260:12]

you an idea about in short if someone

[260:14]

asks you in an interview what exactly is

[260:17]

boosting okay boosting is you can just

[260:21]

say that it is a sequential set of all

[260:23]

the models combined together and these

[260:25]

all models that I initialized are

[260:27]

usually weak Learners and when they are

[260:29]

combined together they become a strong

[260:30]

learner and based on the strong learner

[260:32]

they gives an amazing output and right

[260:35]

now if I say in most of the kaggle

[260:37]

competition they use different types of

[260:39]

boosting or bagging technique so we have

[260:42]

basically as I said

[260:44]

bagging and boosting in bagging what

[260:47]

kind of algorithm we specifically use we

[260:49]

use something called as random forest

[260:54]

classifier and the second model that we

[260:57]

specifically use is something called as

[260:59]

random

[261:00]

Forest regress so we specifically use

[261:04]

these two kind of models which I'm

[261:05]

actually going to discuss right now

[261:06]

after this and then in boosting we

[261:09]

basically use techniques like ad boost

[261:12]

gradi Boost number three is Extreme

[261:15]

gradient boost which we also say it as

[261:17]

XG boost extreme gradient boost so let's

[261:20]

go ahead and let's discuss about the

[261:22]

first algorithm which is called as

[261:24]

random forest classifier and regressor

[261:28]

now first thing first let's understand

[261:31]

some things from the yesterday's class I

[261:33]

hope uh what is the main problem with

[261:35]

respect to decision tree whenever we

[261:37]

create a decision tree without any

[261:39]

hyperparameter it does it not lead to

[261:42]

overit

[261:43]

does it not lead to overfitting uh

[261:45]

whenever you probably have a decision

[261:48]

tree right it leads to something like

[261:50]

overfitting why overfitting because it

[261:53]

completely splits all the feature till

[261:55]

it's complete depth overfitting

[261:57]

basically means for training data the

[261:58]

accuracy is high for test data the

[262:00]

accuracy is low so training data when

[262:02]

the accuracy is high I may basically say

[262:04]

it as high bias and then I may basically

[262:07]

say it as sorry not high bias low bias

[262:11]

and high V variance so low bias and high

[262:14]

variance yes obviously we can do pruning

[262:16]

and all guys but again understand

[262:18]

pruning is an extensive task probably if

[262:21]

your if you have 100 features if you

[262:23]

have data points which is like 1 million

[262:25]

to do pruning also it is very much

[262:27]

difficult yes pre pruning can be done

[262:29]

but again we cannot confirm that it may

[262:31]

work well or not so right now with

[262:33]

respect to decision tree you have this

[262:35]

specific problem that is low bias and

[262:37]

high variance now in low Biance and high

[262:39]

variance you know that my model is

[262:41]

basically the generalized model that I

[262:43]

should get it should have low bias and

[262:46]

low variance so if somebody asks you why

[262:49]

do you use random Forest you can

[262:51]

basically explain about decision trees

[262:52]

like this now my main aim is to convert

[262:54]

this High variance to low variance now I

[262:58]

will be able to convert this High

[262:59]

variance to low variance using random

[263:01]

forest classifier or random Forest

[263:03]

regressor now what does random Forest do

[263:06]

random Forest is a bagging technique

[263:08]

similarly I have a data set over here

[263:10]

let's say that I have this data set

[263:13]

and then here I will be having multiple

[263:15]

models like

[263:16]

M1

[263:19]

M2

[263:21]

M3 M4 let's say I have this four models

[263:24]

like this we have many many models now

[263:27]

with respect to this models this models

[263:29]

all the models are actually decision

[263:31]

Tree in random forest all are decision

[263:34]

trees you don't have a different model

[263:37]

over there so over here you can see that

[263:39]

all the models are decision trees that

[263:41]

is going to get used used in random

[263:43]

Forest so decision trees always gets

[263:45]

used in random Forest the first thing

[263:47]

that you should know now whenever we are

[263:49]

using decision trees you know that

[263:51]

decision tree if I by default if we try

[263:53]

to create it it may lead to overfitting

[263:56]

and because of that every decision tree

[263:58]

will basically create low V low bias and

[264:01]

high variance but if we combine in the

[264:04]

form of bootstrap aggregator this High

[264:07]

variance will be getting converted to

[264:08]

low variance because why because

[264:10]

majority of voting we will be taking

[264:12]

from this particular decision trees like

[264:14]

there will be many many decision tree so

[264:16]

they lot of outputs will be coming and

[264:19]

with the help of majority voting

[264:20]

classifier this High variance will get

[264:22]

converted to low variance now in random

[264:24]

Forest how it works in the first case if

[264:27]

I talk about random Forest over here two

[264:29]

things basically happen with respect to

[264:30]

the D- data set let's say in first model

[264:34]

we do some kind of row

[264:36]

sampling plus

[264:38]

Feature Feature

[264:40]

sampling that basically means we have to

[264:42]

select some set of rows and some set of

[264:45]

features and give it to M1 similarly you

[264:48]

do row sampling and feature sampling and

[264:50]

give it to M2 then you do row sampling

[264:52]

and feature sampling you give it to M3

[264:54]

and then you do row sampling and feature

[264:56]

sampling you give it to M4 now when you

[265:00]

do this so what will happen

[265:01]

independently you're giving some

[265:03]

features along with some rows now there

[265:05]

may be a situation that your features

[265:07]

may also get repeated it may also get

[265:09]

repeated your records or data points may

[265:11]

also get repeated so when you are

[265:13]

probably training your model with this

[265:15]

specific data sets and specific features

[265:18]

this model become expert in predicting

[265:20]

something right as I said one example

[265:23]

over here I'm giving a physics model

[265:25]

some data I'm giving chemistry data

[265:27]

chemistry model with some data similarly

[265:29]

here I'm giving some information to some

[265:31]

model so the model will be an expert

[265:33]

with respect to that specific data So

[265:36]

based on all this particular data

[265:38]

whenever I get a new test data so what

[265:40]

will happen suppose let's say that this

[265:42]

this is a classification problem the M1

[265:44]

model will be predicting zero this will

[265:46]

be predicting one this will be

[265:47]

predicting zero and this will be

[265:49]

predicting zero now in this particular

[265:51]

case again the majority voting

[265:53]

classifier or majority voting will

[265:55]

happen in the case of classification

[265:57]

problem and then here you will be

[266:01]

specifically able to get the output as

[266:03]

zero so I hope everybody is able to

[266:06]

understand all the models over here are

[266:07]

decision trees and based on that you

[266:10]

will be doing see when in I interview

[266:12]

should be very very uh things the things

[266:15]

that I'm telling you over here is all

[266:17]

all the points are very much important

[266:19]

and similarly if you tell the

[266:21]

interviewer definitely your interview is

[266:22]

cracked in this kind of algorithm I've

[266:25]

seen some of my students saying that

[266:26]

okay uh Kish um when the interviewer

[266:29]

asked me that which is my favorite

[266:30]

algorithm I said random Forest I told

[266:32]

why did you say like that because he

[266:34]

said that because that person let me let

[266:36]

him ask any questions in random Forest

[266:38]

I'm very much confident about it and I'm

[266:40]

also going to prove him you know

[266:42]

why they are very very good so with this

[266:45]

specific case here you can basically see

[266:47]

that because of the overfitting

[266:49]

condition of the decision tree you're

[266:50]

combining multiple decision tree so that

[266:52]

you get a generalized model which has

[266:54]

low bias and low variance so I hope

[266:57]

everybody is able to understand boost

[266:59]

feature sampling basically means suppose

[267:00]

if I have 1 2 3 four feature for the

[267:04]

first model I may give two features for

[267:06]

the second model I may get three

[267:07]

features for the fourth model I may give

[267:09]

four features or uh any one feature ALS

[267:12]

I can give to a specific model so

[267:13]

internally that random Forest it take

[267:15]

carees of over here these things are

[267:18]

there and this is how random Forest

[267:19]

Works only the difference between random

[267:21]

Forest classify and regression is that

[267:23]

in regression again whatever output you

[267:25]

are basically getting you basically do

[267:26]

the mean that's it average you just do

[267:29]

the average you'll be able to get the

[267:31]

output based on all the models output

[267:33]

that you are actually getting now let's

[267:34]

talk about some of the important points

[267:36]

in random Forest the first thing first

[267:38]

question is that is normalization

[267:41]

required in random Forest then the next

[267:43]

question is that in KNN is normalization

[267:47]

when I say normalization or

[267:49]

standardization I I'll just talk about

[267:51]

standardization is standardization is

[267:54]

required so this will be my another

[267:56]

question so is normalization required in

[267:59]

random forest or decision tree you here

[268:01]

you can also say it as decision tree is

[268:03]

it required so for this the answer will

[268:06]

be no because understand decision tree

[268:09]

will basically do the splits if you Mini

[268:12]

minimize the data also that split won't

[268:14]

be that much important but if I talk

[268:17]

about KNN whether standardization

[268:19]

normalization required over here the

[268:21]

answer is yes because here we use two

[268:23]

things one is ukan distance and

[268:26]

Manhattan distance because of this you

[268:28]

definitely have to apply standardization

[268:30]

so that the computation or distance

[268:32]

becomes easy so this is one of the most

[268:34]

common interview questions that is

[268:36]

basically asked in random Forest coming

[268:38]

to the third question is random Forest

[268:40]

impacted by outlier

[268:43]

over here the answer will be no just

[268:46]

check it out outside basically means

[268:48]

Google and check it out check it out in

[268:50]

Google okay perfect so I hope I've

[268:53]

covered most of the things in random

[268:54]

Forest is random Forest impacted by

[268:57]

outliers this is the third question is

[268:59]

KNN impacted by

[269:00]

outliers is this KNN algorithm impacted

[269:04]

by outliers is KNN impacted Byers the

[269:07]

answer is yes big yes perfect so so

[269:12]

these all are the interview questions

[269:13]

that needs to be covered now let's go

[269:15]

ahead and discuss about adab boost now

[269:18]

in bagging most of the time we

[269:20]

specifically use random forest or you

[269:23]

can also create custom bagging

[269:25]

techniques custom bagging techniques

[269:27]

means whatever algorithm you want use

[269:29]

the combination of them and try to give

[269:32]

the output this also you can do it

[269:33]

manually with the help of hands okay

[269:36]

guys so second thing uh we are going to

[269:38]

discuss about is boosting technique in

[269:40]

this

[269:42]

the first thing that uh first algorithm

[269:44]

that we are going to discuss about is

[269:45]

adab Boost so adab boost we going to

[269:48]

discuss about how does adab Boost uh

[269:50]

work now let's solve uh the first

[269:53]

boosting technique which is called as

[269:54]

adab boost okay and uh this is a

[269:57]

boosting technique um in the boosting

[270:00]

technique you have heard that we have to

[270:02]

basically solve in a sequential way this

[270:05]

at least you know I know there is a lot

[270:07]

of confusion within you all but we'll

[270:09]

try to solve a problem let's say so

[270:11]

suppose I have a data set which looks

[270:12]

like this fub1 F2 F3 F4 so these are my

[270:16]

features and probably these are my

[270:18]

output okay so let's say that I'm having

[270:20]

this features like this and this is my

[270:22]

output like yes or no like this so let's

[270:25]

say that how many records I have over

[270:27]

here three

[270:30]

4 5 6 and one more is there 7 so this

[270:36]

seven records are there now in adab

[270:38]

boost the first thing is that

[270:40]

specifically with adab Boost uh you

[270:42]

really need to understand that what all

[270:43]

things we can basically do how do we

[270:45]

solve this classification problem that

[270:47]

we are going to understand the first

[270:49]

thing first is that we Define a weight

[270:51]

and the weight is very much simple

[270:53]

initially to all the records to all this

[270:55]

input records we provide an equal weight

[270:58]

now how do we provide an equal weight we

[270:59]

just go and count how many number of

[271:01]

records are there now in this particular

[271:03]

case the total number of records are one

[271:06]

2 3 4 5 6 7 now every record I have to

[271:12]

provide an equal weight that is between

[271:15]

0 to 1 so the overall sum should be one

[271:19]

so in this particular case what I can do

[271:20]

if I make 1X 7 1X 7 1X 7 to everyone

[271:24]

this will definitely become

[271:26]

a equal weights to all right and if I do

[271:30]

the total sum it will obviously be one

[271:32]

let's go to the next one now after this

[271:34]

what do we do okay after this in adab

[271:37]

the first thing that we do is that we

[271:39]

take any of this feature how do you

[271:41]

decide which feature to take whether we

[271:42]

should go with F1 or whether we should

[271:44]

go with FS2 or whether we should go with

[271:46]

F3 this we can do it with the help of

[271:49]

Information Gain and Information Gain

[271:53]

and entropy or guinea right based on

[271:56]

this we can definitely understand

[271:57]

whether we should start making decision

[271:59]

here also you specifically make decision

[272:01]

trees so here what you do is that you

[272:04]

probably have to determine by using

[272:06]

which feature I have to start my

[272:07]

decision tree so suppose out of all this

[272:09]

feature one feature two feature three

[272:11]

you have selected that okay the

[272:12]

information gain and entropy of feature

[272:13]

one is higher so I'm going to use

[272:15]

feature one and probably divide this

[272:17]

into decision trees now when I divide

[272:21]

this into decision tree let's say that

[272:22]

I'm dividing like this into decision

[272:23]

tree this decision tree depth will be

[272:26]

only one one depth and this depth since

[272:29]

it has only one depth we basically call

[272:31]

it as stumps so what we do over here

[272:34]

specifically we will create a decision

[272:36]

Tre by taking only one feature and we

[272:37]

will only divide it to one level okay

[272:39]

one level or one depth that's

[272:42]

and this is specifically called as stump

[272:45]

what we are going to do next is that

[272:46]

from this particular stump okay the

[272:48]

stump is basically getting created only

[272:51]

one so that is adab Boost right we say

[272:52]

it as weak Learners because this is weak

[272:54]

learner weak learner why there is a

[272:57]

reason we say this as weak learner so

[273:00]

only weak learner so that is the first

[273:02]

thing with respect to uh this particular

[273:06]

adab boost so the first step is that

[273:07]

this is a weak learner so for the weak

[273:09]

learner we basically create a stump

[273:12]

stump basically means one level decision

[273:14]

tree that's it based on the information

[273:17]

gain and entropy I have selected the

[273:18]

feature and then I just made a decision

[273:21]

tree with only one level why it is

[273:24]

called as it is called as weak learner

[273:27]

okay so that is the reason we use only

[273:28]

stum that is just a one level decision

[273:31]

tree now the next step happens is that

[273:33]

we provide all the specific records to

[273:36]

this F1 and we train this specific model

[273:39]

only with one level decision tree we

[273:41]

train them

[273:42]

now after we train them let's say that

[273:44]

we are going to pass all these

[273:45]

particular records to find out how many

[273:47]

are correct and how many are wrong this

[273:49]

decision this decision tree is basically

[273:51]

giving so let's say that out of this

[273:53]

entire records one

[273:55]

record one record was just given as

[273:59]

wrong let's say that this is the this is

[274:01]

the record which was given as wrong okay

[274:04]

so let's say that this record output was

[274:07]

predicted wrong from this particular

[274:09]

model only one wrong was there after

[274:11]

training the model now what we need to

[274:14]

do in this specific case understand a

[274:16]

very important thing so let's say that

[274:18]

we have done this and probably after

[274:20]

this what we are actually going to do we

[274:22]

are going to calculate the total error

[274:24]

so how many error this particular model

[274:26]

made let's say that in this particular

[274:28]

case only one was wrong so this was only

[274:31]

wrong right one was wrong so if I want

[274:35]

to calculate the total error how will I

[274:37]

calculate how many how many of them are

[274:39]

wrong how many of them are wrong only

[274:40]

one is wrong what is the weight of this

[274:42]

so I will go and write 1X 7 so this is

[274:45]

specifically my total error out of this

[274:47]

specific model which is my stump over

[274:49]

here okay which is my F1 stop now this

[274:53]

is my first

[274:54]

step the second step is that I need to

[274:57]

see the performance of stump which stump

[274:59]

this specific stump and the performance

[275:02]

is basically checked by a formula which

[275:04]

is 1 by log e 1us total error divided

[275:09]

total error why we are doing this

[275:11]

everything will make sense okay in just

[275:13]

time every every in just a small time

[275:16]

everything will make sense the first

[275:18]

step that we do in adaab boost is that

[275:20]

we try to find out the total error the

[275:22]

second step we try to find out the

[275:24]

performance of stump now in this

[275:26]

particular case it will be 1 by log e 1

[275:29]

- 1 by 7 / 1X 7 so once I calculate it

[275:35]

it will be coming as

[275:37]

895 F2 and F3 see again understand out

[275:42]

of all these features I found out from

[275:43]

Information Gain and entropy that this

[275:45]

is the best feature let's say that I

[275:47]

have calculated this

[275:49]

as895 so this is my second step the

[275:51]

first step is find out the total error

[275:53]

the second step is performance of stum

[275:55]

what is te te basically means total

[275:57]

error te basically means total error now

[276:01]

see see the steps okay see the steps

[276:03]

whenever I'm discussing about boosting

[276:05]

I'm going to combine weak Learners

[276:07]

together to get a strong learner now

[276:09]

what is the next step out of this now

[276:11]

what what will be my third step

[276:13]

understand over here my third step will

[276:16]

be to update all these weights and that

[276:19]

is the reason why I'm calculating this

[276:20]

total error and performance of Step so

[276:23]

my third step will basically be new

[276:26]

sample weight from the decision tree one

[276:29]

which is my stump so I'll say new sample

[276:32]

weight is equal to I need to update all

[276:34]

these weights why I need to update all

[276:36]

these weights again understand I'll I'll

[276:39]

talk about it just a second so if I want

[276:41]

to up update the sample weights first

[276:44]

update I will do it for correct records

[276:46]

see for correct records whichever are

[276:49]

correct like these all records are

[276:51]

correct these all records are correct

[276:53]

now when I update the weights of this

[276:55]

update the weights of this particular

[276:57]

record it should reduce and when the the

[277:00]

the wrong records that I have this

[277:02]

update should increase why because

[277:06]

because if I increase this weights then

[277:08]

the wrong records that are there that

[277:11]

record should go to the next week

[277:12]

learner that is the reason why I'm doing

[277:14]

it now how to update this particular

[277:17]

weights for correct records for correct

[277:19]

records the formula looks something like

[277:21]

this weight multiplied by weight

[277:25]

multiplied by E to the^ of

[277:28]

minus this specific performance okay

[277:31]

this specific performance so e to the

[277:33]

power of PS I'll write performance of

[277:35]

stump and then I will basically be able

[277:38]

to write 1X 7 * e to the^ of minus

[277:43]

895 if I do the calculation everybody

[277:45]

try to do it the answer will be

[277:48]

05 now this is for correct records what

[277:50]

about incorrect records for the

[277:52]

incorrect

[277:53]

records the the weights that is going to

[277:56]

the formula that we going to apply is

[277:58]

multiplied by E to the^ of plus PS not

[278:02]

minus PS plus PS so here I'll write 1 by

[278:05]

7 multiplied e to the^ of

[278:08]

895 so if I go and probably calcul this

[278:12]

I'm going to get it

[278:13]

as 349 so this two are the weights that

[278:18]

I have got that basically means all

[278:20]

these records now which are correct 1X 7

[278:23]

the new updated weights will be 05 05

[278:28]

05

[278:30]

05 sorry not for the wrong

[278:33]

records then this will be 05 then 05 and

[278:38]

05 so let me just see what is 1x 7 so

[278:41]

here you can see initially it was. 142

[278:45]

now it has got reduced to 05 because all

[278:47]

these records are correct but the wrong

[278:50]

record value is 349 so my weights will

[278:53]

now become over here as 349 now I will

[278:56]

just go and go ahead and write over here

[278:58]

my new weight my new weight is nothing

[279:01]

but 05

[279:06]

055

[279:07]

05 05 05 1 2 how many 1 2 3 okay fourth

[279:14]

record is here fourth record is there 1

[279:18]

2 3 4 05 05 okay how many records are

[279:22]

there 1 2 3 4 5 6 7 so my fourth record

[279:27]

will basically become the new value that

[279:30]

I'm having is something called as

[279:34]

349 now tell me guys if I do the

[279:37]

summation of all these weights is this

[279:39]

is it one so prob

[279:41]

no I don't think so it is one because if

[279:44]

I try to add it up it is not one but if

[279:46]

I go and see over here these all are one

[279:48]

if I combine all the things 1 2 3 4 5 6

[279:51]

7 these all are one so here I have need

[279:53]

to find out my normalized weight now in

[279:55]

order to find out the normalized

[279:57]

weight all I have to do is that what I

[280:00]

have to do because the entire sumission

[280:03]

should be one so we have to

[280:05]

normalize now in order to normalize all

[280:08]

you have to do is that go and find out

[280:10]

what is the sum of all this things the

[280:12]

summation of all these things will be

[280:15]

0 649 all you have to do is that divide

[280:18]

all the numbers

[280:20]

by 649 divided by

[280:24]

649

[280:26]

649 like this divide all the numbers by

[280:28]

649 and tell me what will be the answer

[280:30]

that you'll be getting so here your

[280:32]

normalized weight will now look like

[280:35]

077 07 and this value will be somewhere

[280:39]

around uh

[280:41]

537 I guess in this case then this will

[280:44]

be 07

[280:47]

077 here we are going to divide by all

[280:50]

this 64 649 now this is my normalized

[280:53]

weight now after you get a normalized

[280:56]

weight we will try to create something

[280:57]

called as buckets because see one

[281:00]

decision tree we have already created

[281:02]

which is a stump and you know from this

[281:04]

particular stum what you're going to get

[281:06]

okay as an output then in the sequential

[281:09]

model we will go and combine another

[281:11]

model over here now it's the time that I

[281:13]

have to create this specific model now

[281:15]

in order to create this specific model I

[281:17]

need to provide some specific rows only

[281:19]

to this model to train because this

[281:21]

model is giving one wrong now what I

[281:24]

have to do is that whatever is wrong

[281:26]

along with other data points I need to

[281:28]

provide this specific model with those

[281:30]

records so that this model will be able

[281:33]

to train on this and probably be able to

[281:35]

get the output now let's create buckets

[281:38]

now based on buckets how the buckets

[281:39]

will be created over here I will take 07

[281:43]

until

[281:45]

sorry whatever is the value over here

[281:48]

normal we value okay so I will start

[281:50]

creating my buckets buckets basically

[281:52]

from 0 to

[281:53]

07 what did I say now for this decision

[281:57]

tree or stump I need to provide some

[282:00]

records so the maximum number of record

[282:02]

that should be going should be the wrong

[282:05]

records that should go over here now how

[282:07]

do we decide that okay there should be a

[282:09]

way that we should be able to say that

[282:11]

that specific wrong number of Records

[282:13]

should go to that decision tree so for

[282:16]

that purpose what we do is that this

[282:18]

decision tree will randomly create some

[282:20]

numbers between 0 to 1 randomly create

[282:25]

those numbers between 0 to 1 and

[282:27]

whichever bucket it will come in like 07

[282:30]

to 014 014 to 07 basically means 0 2 1

[282:37]

then 0 2 1 2 see how the bucket is

[282:40]

getting cre this value is getting added

[282:42]

to this so that becomes this bucket 021

[282:45]

+3 537 how much it is it is nothing but

[282:50]

470 747 then 747

[282:55]

to

[282:57]

751 like this you create all the buckets

[283:00]

okay you can create all the buckets now

[283:02]

tell me which record is basically having

[283:04]

the biggest bucket size obviously this

[283:07]

record so if I randomly create a number

[283:10]

between 0 to one what is the highest

[283:13]

probability that the values will be

[283:15]

going in so in this particular case most

[283:17]

of the wrong records will be passed

[283:18]

along with the other records obviously

[283:20]

other records there are chances that

[283:22]

other records will go to the next

[283:24]

decision tree but understand maximum

[283:26]

number will go with the wrong records

[283:28]

because the bucket is high over here so

[283:31]

the bucket is high over here so most of

[283:32]

the time this specific record will get

[283:35]

create selected and then it will be gone

[283:37]

to the second tree now suppose I have

[283:40]

this all records

[283:41]

so this is my first stump this is my

[283:44]

second stump this is my third stump

[283:47]

similarly the third stump from the

[283:48]

second stump whichever wrong records

[283:50]

will be going maximum number of Records

[283:52]

will go over here then again it will be

[283:54]

trained like this we'll be having lot of

[283:56]

stumps minimum 100 decision trees can be

[283:59]

added you know that every decision tree

[284:01]

will give one output for a new test data

[284:03]

new test data this week learner will

[284:05]

give one output this week learner will

[284:07]

give one output this week learner and

[284:09]

this will week learner will be giving

[284:10]

one output obviously the time complexity

[284:12]

will be more now from this particular

[284:14]

output suppose it is a binary

[284:16]

classification I will be getting 0 1 1 1

[284:19]

so again over here majority voting will

[284:21]

happen and the output will be one in

[284:24]

case of regression problem I will be

[284:25]

having a continuous value over here and

[284:28]

for this the average average will be

[284:31]

computed and that will give me an output

[284:33]

over here so for regression the average

[284:36]

will be done for classification what

[284:39]

will happen majority will be be

[284:41]

happening so everywhere that same part

[284:43]

will be going on buckets is very much

[284:45]

simple guys buckets basically means

[284:47]

based on this weights normalized weight

[284:49]

we are going to create bucket so that

[284:51]

whichever records has the highest bucket

[284:53]

based on this randomly creating code you

[284:55]

know it will select those specific

[284:57]

buckets and put it into random Forest

[284:59]

understand why this bucket size is Big

[285:02]

the other wrong records which are

[285:03]

present right suppose they are have more

[285:05]

than four to five wrong records their

[285:06]

bucket size will also be bigger and

[285:08]

because based on this randomly creating

[285:10]

num between 0 to 1 most of the wrong

[285:12]

records will be selected and given to

[285:14]

the second stum similarly this

[285:16]

particular decision tree will be doing

[285:17]

some mistakes then that wrong records

[285:19]

will get updated all the weights will

[285:20]

get updated and it will be passed to the

[285:22]

next decision tree guys when I say wrong

[285:24]

record the output will be same only no

[285:26]

zero and one so interesting everyone I

[285:29]

hope you understood so much of maths in

[285:31]

adab boost and how adab boost actually

[285:33]

work three main things one is total

[285:35]

error one is performance of stump and

[285:37]

one is the new sample weight these

[285:39]

things are getting calculated extensive

[285:41]

max normalized weight was basically used

[285:43]

because the sum of all these weights are

[285:45]

approximately equal to one when boosting

[285:48]

why not take the last output no no no we

[285:50]

have to give the importance of every

[285:52]

decision tree output every decision tree

[285:55]

output are important okay let me talk

[285:57]

about one model which is called as

[285:59]

blackbox model versus white box what is

[286:03]

the difference between blackbox model

[286:04]

and white box if I take an example of

[286:07]

linear regression tell me what kind of

[286:09]

model it is is is it a white box model

[286:12]

or black box if I take an example of

[286:14]

random

[286:15]

Forest is this a white box or black box

[286:18]

if I take an example of decision tree it

[286:21]

is a white box of blackbox model if I

[286:23]

take an example of a Ann is it a white

[286:26]

box of blackbox model linear regression

[286:28]

is basically called as an wide Box model

[286:30]

because here you can basically visualize

[286:33]

how the Theta value is basically

[286:35]

changing and how it is coming to a

[286:36]

global Minima and all those things in

[286:38]

random Forest I will say this as

[286:40]

blackbox model because it is impossible

[286:42]

to see all the decision tree how it is

[286:44]

working so that is the reason the maths

[286:46]

is so complex inside this if I talk

[286:49]

about decision tree this is basically a

[286:50]

white box model because in decision tree

[286:52]

we know how the split are basically

[286:54]

happening with the help of paper and pen

[286:55]

you'll be able to do it in the case of

[286:58]

an Ann this is a blackbox model because

[287:00]

here you don't know like how many

[287:02]

neurons are there how they are

[287:03]

performing and how the weights are

[287:05]

getting updated so this is the basic

[287:07]

difference between the blackbox and uh

[287:10]

uh white box model this entire thing is

[287:13]

the agenda of today's session so let's

[287:15]

start uh the first algorithm that we are

[287:17]

probably going to discuss today is

[287:19]

something called as K

[287:21]

means

[287:23]

clustering K means clustering and this

[287:26]

is a kind of unsupervised machine

[287:28]

learning now always remember

[287:31]

unsupervised machine learning basically

[287:33]

means that uh the one and the most

[287:35]

important thing is that in unsupervised

[287:38]

machine learning

[287:41]

in unsupervised ml you don't have any

[287:44]

specific output so you don't have any

[287:46]

specific output so suppose you have

[287:48]

feature one and feature two and suppose

[287:50]

you have datas different different data

[287:53]

you know and based on this data what we

[287:55]

do we basically try to create clusters

[287:58]

this clusters basically says what are

[288:00]

the similar kind of data so this is what

[288:03]

we basically do from uh clustering and

[288:06]

there are various techniques like K

[288:08]

Mains uh it is hierle clustering and all

[288:10]

so first of all we'll try to understand

[288:12]

about K means and how does it

[288:14]

specifically work it's simple uh suppose

[288:17]

you have a data points like this okay

[288:20]

let's say that this is your F1 feature

[288:21]

F2 feature and based on this in two

[288:23]

dimensional probably I will be plotting

[288:26]

this points and suppose this is my

[288:28]

another points so our main purpose is

[288:31]

basically to Cluster together in

[288:34]

different different groups okay so this

[288:36]

will be my one group and probably the

[288:38]

other group will be this group right so

[288:40]

two groups because obviously you can see

[288:42]

from this clusters here you have two

[288:44]

similar kind of data which is basically

[288:47]

grouped together right this is my

[288:49]

cluster one and this is my cluster 2 let

[288:51]

me talk about this and why specifically

[288:54]

it'll be very much useful then we'll try

[288:56]

to understand about math intuition also

[288:58]

now always understand guys uh where does

[289:00]

clustering gets used okay in most of the

[289:03]

Ensemble techniques I told you about

[289:05]

custom emble technique right so custom

[289:08]

emble techniques in custom assemble

[289:11]

techniques you know whenever we are

[289:13]

probably creating a model first of all

[289:15]

on our data set what we do is that we

[289:18]

create clusters so suppose this is my

[289:20]

data set during my model creation the

[289:22]

first algorithm we will probably apply

[289:24]

will be clustering algorithm and after

[289:26]

that it is obviously good that we can

[289:28]

apply regression or classification

[289:30]

problem suppose in this clustering I

[289:32]

have two or three groups let's say that

[289:34]

I have two or three groups over here for

[289:36]

each group we can apply a separate

[289:40]

supervis machine learning algorithm if

[289:42]

we know the specific output that we

[289:44]

really want to take ahead I'll talk

[289:46]

about this and uh give you some of the

[289:48]

examples as I go ahead now let's go on

[289:51]

go ahead and focus more on understanding

[289:53]

how does kin's clustering algorithm work

[289:56]

so let's go over here the word K means

[289:59]

has this K value this K are nothing but

[290:02]

this K basically means centroids K

[290:05]

basically means centroids so suppose if

[290:08]

I have a data set which looks like this

[290:10]

let's say that this is my data set now

[290:12]

over here just by seeing the data set

[290:14]

what are the possible groups you think

[290:16]

definitely you'll be saying K is equal

[290:18]

to 2 So when you say k is equal to two

[290:20]

that basically means you will be able to

[290:22]

get two groups like this and each and

[290:24]

every group will be having a centroid a

[290:28]

centroid Point here also there will be a

[290:30]

centroid point so this centroid will

[290:32]

determine basically this is a separate

[290:34]

group over here this is a separate group

[290:36]

over here so over here here you can

[290:38]

definitely say that fine this is two

[290:40]

groups but but how do we come to a

[290:41]

conclusion that there is only two groups

[290:44]

okay we cannot just directly say that

[290:46]

okay we'll try to just by seeing the

[290:48]

data because your data will be having a

[290:50]

high dimension data right right now I'm

[290:52]

just showing your two Dimension data but

[290:55]

for a high dimension data definitely

[290:56]

you'll not be able to see the data

[290:58]

points how it is plotted so how do you

[291:00]

come to a conclusion that only two

[291:02]

groups are there so for this there is

[291:03]

some steps that we basically perform in

[291:05]

K mins the first step is that we try

[291:08]

with different K values we try with

[291:11]

different K values and which is the

[291:13]

suitable K value K is nothing but

[291:15]

centroids okay it is nothing but

[291:18]

centroids we try with different

[291:20]

different centroids in this particular

[291:22]

case let's say that I have this

[291:24]

particular data point and I actually

[291:27]

start with k is equal 1 or 2 or 3 any

[291:29]

one you want let's say that I'm going to

[291:31]

start with k is equal 2 how to come up

[291:34]

with this K is equal to 2 as a perfect

[291:37]

value that I'll talk about it we need to

[291:39]

know there is a concept which is called

[291:41]

as within cluster sum of square so when

[291:43]

we try different K values let's say that

[291:45]

for K is equal to 2 what will happen the

[291:47]

first step we select a we try K values

[291:50]

so let's say that we are considering K

[291:52]

is equal to 2 the second step is that we

[291:54]

initialize K number of centroids now in

[291:57]

this particular case I know my K value

[291:59]

is 2 so we will be initializing randomly

[292:02]

let's say that K is equal to 2 so what

[292:05]

we can actually do let's say that this

[292:07]

is this is my one centroid I will I'll

[292:09]

put it in another color so this will be

[292:11]

my one centroid and let's say that this

[292:13]

is my another centroid so I have

[292:15]

initialized two centroids randomly in

[292:17]

this space now after this particular

[292:19]

centroid what we have to do is that

[292:21]

after initializing this centroid what we

[292:23]

have to do is that we have to basically

[292:26]

find out which points are near to the

[292:29]

centroid and which points are near to

[292:31]

this centroid now in order to find out

[292:33]

it is a very easy step we can basically

[292:35]

use ukan distance to find out the

[292:38]

distance between the points in an easy

[292:40]

way if I really want to show you that

[292:44]

you know like how many points I want to

[292:46]

in an easy way what I can do I can

[292:48]

basically draw a straight line over here

[292:50]

let's say that I'm drawing a straight

[292:51]

line over here in another color I can

[292:54]

draw a straight line and I can also draw

[292:56]

one parallel line like this so This

[292:58]

basically indicates that whichever

[293:01]

points you see over here suppose if I

[293:03]

draw a straight line in between all

[293:05]

these points you will be able to see

[293:07]

that let's say that I'm drawing one more

[293:09]

parallel line

[293:11]

which is intersecting together so from

[293:14]

this you can definitely find out let's

[293:16]

say that these are all my points that

[293:17]

are nearer to this green line Green

[293:20]

Point so what I'm actually going to do

[293:21]

in this particular case all these points

[293:24]

that you are seeing near the green it

[293:26]

will become green color so that

[293:28]

basically means this is basically nearer

[293:30]

to this centroid and whichever points

[293:33]

are nearer to this particular point that

[293:35]

will become red point so that basically

[293:38]

means this belongs to this group okay

[293:40]

this belongs to this group so I hope

[293:42]

everybody's clear till here then what

[293:44]

will happen is that this summation of

[293:48]

all the values then we initialize the K

[293:51]

number of centroids that is done then we

[293:53]

try to calculate the distance we try to

[293:55]

find out which all points is nearer to

[293:57]

the centroid let's say that this is my

[293:58]

one centroid this is my another centroid

[294:01]

and we have seen that okay these all

[294:02]

points belong to this centroid it near

[294:05]

to this particular centroid so this is

[294:07]

becoming red so that is based on the

[294:09]

shortage distance and here it is

[294:11]

becoming green now the next step let's

[294:13]

see what is the next step after this so

[294:15]

I am going to remove this thing now the

[294:17]

next step will be that the entire points

[294:20]

that is in red color all the average

[294:22]

will be taken so here again the average

[294:25]

will be taken now third step here I'm

[294:28]

going to write here we are going to

[294:30]

compute the average the reason we

[294:32]

compute the average is that because we

[294:34]

need to update the centroid so compute

[294:37]

the average to update centroid to update

[294:40]

centroids so here you'll be able to see

[294:42]

that what I'm actually doing as soon as

[294:45]

we compute the average this centroid is

[294:47]

going to move to some other location so

[294:50]

what location it will move it will

[294:51]

obviously become somewhere in Center so

[294:53]

here now I'm going to rub this and now

[294:56]

my new centroid will be this point where

[294:58]

I am actually going to draw like this

[295:00]

let's say this is my new centroid now

[295:02]

similarly this thing will happen with

[295:04]

respect to the green color so with

[295:06]

respect to the green color also it will

[295:08]

happen and this green will also Al get

[295:10]

updated so I'm going to rub this and

[295:12]

this will be my new Green Point which

[295:14]

will get updated over here then again

[295:16]

what will happen again the distance will

[295:18]

be calculated and again a perpendicular

[295:20]

line will be calculated here you can see

[295:22]

that now all the points are towards

[295:25]

there okay again the centroid based on

[295:27]

this particular distance again it will

[295:29]

be calculated and here you can see that

[295:31]

all the points are in its own location

[295:33]

so here now no update will actually

[295:36]

happen let's say that there was one

[295:38]

point which was red color over here

[295:41]

then this would have become green color

[295:42]

but since the updation has happened

[295:44]

perfectly we are not going to update it

[295:46]

and we are not going to update the

[295:48]

centroid right so now you can understand

[295:51]

that yes now we have actually got the

[295:53]

perfect centroid and now this will be

[295:56]

considered as one group and this will be

[295:58]

basically considered as the another

[296:00]

group it will not intersect but right by

[296:02]

default here intersection is happening

[296:05]

so I hope everybody's understood the

[296:07]

steps that you have actually followed in

[296:09]

initializing the centroids in updating

[296:12]

the centroids and in updating the points

[296:14]

is it clear everybody with respect to K

[296:17]

means now let's discuss about one

[296:20]

point how do we decide this K value okay

[296:24]

how do we decide this K value so for

[296:26]

deciding the K value there is a concept

[296:27]

which is called as elbow method so here

[296:31]

I'm going to basically Define my elbow

[296:32]

method now elbow method says something

[296:35]

very much important because this will

[296:37]

actually help us to find out what is the

[296:40]

optimized K value whether the K value

[296:42]

should be two whether uh the K value is

[296:45]

going to be three whether the K value is

[296:47]

going to become four and always

[296:49]

understand suppose this is my data set

[296:51]

suppose this is my data set initially

[296:53]

let's say that I have my data points

[296:54]

like this we cannot go ahead and

[296:57]

directly say say that okay K is equal to

[296:59]

2 is going to work so obviously we are

[297:01]

going to go with iteration for I is

[297:04]

equal to probably 1 to 10 I'm going to

[297:06]

move towards iteration from 1 to 10

[297:09]

let's say so for every iteration we will

[297:11]

construct a graph with respect to K

[297:14]

value and with respect to something

[297:16]

called as W CSS now what is this W CSS W

[297:20]

CSS basically means within cluster sum

[297:23]

of

[297:24]

square okay this is the meaning of wcss

[297:27]

within cluster sum of square now let's

[297:30]

say that initially we start with one

[297:33]

centroid so one centroid let's say it is

[297:35]

initialized here one centroid is

[297:37]

basically initialized here if we go and

[297:39]

compute the distance

[297:40]

between each and every points to the

[297:43]

centroid and if we try to find out the

[297:45]

distance will the distance value be

[297:47]

greater or it will be smaller will it be

[297:50]

smaller or greater tell me if you try to

[297:53]

calculate this distance from this

[297:55]

centroid to every point this is what is

[297:57]

within cluster sum of square it will

[298:00]

always be very very much greater so

[298:02]

let's say that my first point has come

[298:04]

somewhere here it is going to be

[298:06]

obviously greater let's say that my

[298:07]

first point is coming over here find

[298:10]

So within K is equal to 1 initially we

[298:12]

took and we found out the distance of w

[298:14]

CSS and it is a very huge value okay

[298:17]

because we're going to compute the

[298:18]

distance between each and every point to

[298:20]

the centroid now the next thing that I'm

[298:23]

actually going to do is that now we'll

[298:26]

go with next value that is K is equal to

[298:28]

2 now in K is equal to 2 I will

[298:31]

initialize two points okay I will

[298:34]

initialize two points and then probably

[298:36]

I will do the entire process which I

[298:38]

have written on the top now tell me me

[298:40]

whichever points is nearer to this green

[298:42]

point if we compute the distance and

[298:46]

whichever points is nearer to the red

[298:48]

point if you compute the distance like

[298:52]

this now this summation of the distance

[298:55]

will be lesser than the previous W CSS

[298:57]

or not obviously it is going to be

[299:00]

lesser than the previous W CSS so what

[299:02]

I'm actually going to do probably with K

[299:04]

is equal to 2 your value may come

[299:06]

somewhere here then with K is equal to 3

[299:09]

your value May come somewhere here then

[299:10]

K is equal to 4 will come here to 5 6

[299:13]

like this it will go so here if I

[299:15]

probably join this line you'll be able

[299:17]

to see that there will be an Abrupt

[299:19]

changes in the W CSS value in the wcss

[299:23]

value there will be an Abrupt changes

[299:25]

and this this is basically called as

[299:27]

elbow curve now why we say it as elbow

[299:30]

curve because it is in the shape of

[299:32]

elbow and here at one specific point

[299:34]

there will be an Abrupt change and then

[299:36]

it will be straight so that is the

[299:38]

reason why we basically say this as

[299:41]

elbow okay so this is a very important

[299:43]

thing see in finding the K value we use

[299:46]

elbow method but for validating purpose

[299:49]

how do we validate that this model is

[299:52]

performing well we use silard score that

[299:54]

I'll show you just in some time but

[299:57]

understand that in K means clustering we

[300:00]

need to update the centroids and based

[300:02]

on that we calculate the distance and as

[300:05]

the K value keep on increasing you'll be

[300:07]

able to see that the distance will

[300:09]

become normal or the wcss value will

[300:12]

become normal and then we really need to

[300:14]

find out which is the phys K value where

[300:17]

the abrupt change see over here suppose

[300:20]

abrupt change is there and then it is

[300:21]

normal then I will probably take this as

[300:24]

my K value so obviously the model

[300:26]

complexity will be high because we are

[300:28]

going to check with respect to different

[300:30]

different K values and wcss values and

[300:33]

this basically means that the value that

[300:36]

we'll probably get first of all we need

[300:38]

to construct this elbow curve then see

[300:40]

the changes where it is basically

[300:42]

happening we'll need to find out the

[300:43]

abrupt change and once we get the abrupt

[300:46]

change we basically say that this may be

[300:49]

the K value so K is equal to 4 as an

[300:52]

example I'm telling you so unless and

[300:54]

until if you really want to find the

[300:56]

cluster it is very much simple we take a

[300:59]

k value we initialize K number of

[301:01]

centroids we compute the average to

[301:03]

update the centroids then again we try

[301:05]

to find out the distance try to see that

[301:07]

whether any points has changed and

[301:08]

continue that process unless and until

[301:10]

we get separate groups okay so this is

[301:14]

the entire funa of claim in clustering

[301:16]

so finally you'll be able to see that

[301:18]

with respect to the K value we will be

[301:20]

able to get that many number of groups

[301:22]

if my K value is four that basically

[301:24]

means I will be probably getting four

[301:26]

different groups like this 1 two right

[301:30]

three like this and four I will be

[301:32]

getting four groups like this with K is

[301:34]

equal to 4 that basically means K is

[301:35]

equal to four clusters and every group

[301:38]

will be having its own centroids okay

[301:41]

every group will be having okay

[301:42]

centroids are very much important yes

[301:45]

I'll try to show you in the coding also

[301:47]

guys let's go towards the second

[301:48]

algorithm the second algorithm that we

[301:51]

will be probably discussing is called as

[301:54]

hierarchical clustering now hierarchal

[301:56]

clustering is very much simple guys all

[301:58]

you have to do is that let's say this is

[302:00]

your data points this is your data

[302:01]

points and this is my P1 let's say P2

[302:04]

now hierle clustering says that we will

[302:07]

go step by step the first thing is that

[302:10]

we will try to find out the most nearest

[302:12]

Value let's say this is my X and Y let's

[302:15]

say these are my points like this is my

[302:18]

P1 point this is my P2 point this is my

[302:21]

P3 point this is my P4 Point P5 Point P6

[302:25]

point p7 point okay so these are my

[302:28]

points that I have actually named over

[302:29]

here let's say that this may be the

[302:31]

nearest point to each other so what it

[302:32]

will do it will combine this together

[302:34]

into one cluster this we have computed

[302:37]

the distance so it will C create one

[302:39]

cluster now what will happen on the

[302:41]

right hand side there will be another

[302:42]

notation which you may be using in

[302:45]

connecting all the points one so suppose

[302:46]

this is my P1 this is my P2 this is my

[302:50]

P3 P4 let's say that I have this many

[302:53]

points and probably I will also try to

[302:56]

make

[302:57]

p7 so these are my points p7 now you

[303:00]

know that the nearest point that we are

[303:02]

having okay this will probably be

[303:04]

distance 1 2 3 this is distance okay 4 5

[303:09]

6 like this we have lot of distance so

[303:12]

hierle clustering will first of all find

[303:14]

out the nearest point and try to compute

[303:17]

the distance between them and just try

[303:18]

to combine them together into one what

[303:21]

do we do we basically combine them into

[303:23]

one group okay so P1 and P2 has been

[303:26]

combined let's say then it'll go and

[303:29]

find out the other nearest point so

[303:31]

let's say P6 and p7 are near so they are

[303:33]

also going to combine into one group so

[303:35]

once they combine into one group then we

[303:37]

have P6 and p7 which will be obviously L

[303:40]

greater than the previous distance and

[303:42]

we may get this kind of computation and

[303:44]

another combination or cluster will form

[303:47]

get formed over here then you have seen

[303:49]

that okay P3 and P5 are nearer to each

[303:52]

other so we are going to combine this so

[303:54]

I'm going to basically combine P3 and

[303:57]

P5 okay and let's say that this distance

[303:59]

is greater than the previous one because

[304:02]

we are basically going to sh start with

[304:03]

the shortest distance and then we are

[304:05]

going to capture the longest distance

[304:07]

now this is done now you can see that

[304:08]

the next point that is near right to

[304:11]

this particular group is P4 so we are

[304:13]

going to combine this together into one

[304:15]

group so once we combine this into one

[304:17]

group this P4 will get connected like

[304:20]

this let's say it is getting connected

[304:23]

like this P4 has got connected then what

[304:25]

is the nearest Point whether it is P6 p7

[304:28]

group or P1 P2 obviously here you can

[304:30]

see that P1 P2 is there so I am probably

[304:32]

going to combine this group together

[304:34]

that basically means P1 P2 let's say I'm

[304:38]

just going to combine this group group

[304:40]

together again circle is coming so I

[304:42]

will make a dot let's say I'm going to

[304:43]

combine this group together because

[304:45]

these are my nearest groups so what will

[304:47]

happen P1 and P2 will get combined to P5

[304:50]

sorry P4 P5 this one so I will be

[304:53]

getting another line like this and then

[304:55]

finally you'll be seeing that P6 p7 is

[304:57]

the nearest group to this so this will

[305:00]

totally get combined and it may look

[305:02]

something like this so this will become

[305:05]

a total group like

[305:07]

this so all the groups are combined so

[305:10]

finally you'll be able to see that there

[305:11]

will be one more line which will get

[305:13]

combined like

[305:14]

this this is basically called as

[305:17]

dendogram dendogram okay which is like

[305:21]

bottom root to top now the question

[305:24]

arises is that how do you find that how

[305:25]

many groups should be here how do you

[305:27]

find out that how many groups should be

[305:29]

here the funa is very much Clear guys in

[305:32]

this is that you need

[305:34]

to find the longest

[305:41]

vertical line you need to find out the

[305:43]

longest vertical line that has no

[305:46]

horizontal line pass through it no

[305:49]

horizontal

[305:51]

line passed through it this is very much

[305:54]

important that has no horizontal line

[305:56]

pass through it now what this is

[305:58]

basically meaning is that I will try to

[306:00]

find out the longest line longest

[306:03]

vertical line in such a way that none of

[306:06]

the horizontal line passes through it

[306:07]

what is horizontal line suppose if I

[306:09]

consider this vertical line This

[306:11]

vertical line over here if you see that

[306:13]

if I extend this green line it is

[306:15]

passing through this if I extend this

[306:17]

line it is passing through this right if

[306:20]

I'm extending this line it is passing

[306:21]

through this right so out of this the

[306:25]

longest line that may be passing in such

[306:27]

a way that no horizontal line probably

[306:29]

is this line that I can actually see so

[306:31]

what you do over here is that you

[306:33]

basically just create a straight line

[306:35]

over this and then you try to find out

[306:37]

that how many clusters it will be there

[306:39]

by understanding that how many lines it

[306:41]

is passing through if it is passing

[306:42]

through this one line two line three

[306:44]

line four line that basically means your

[306:47]

clusters will be four

[306:49]

clusters this is how we basically do the

[306:52]

calculation in heral clustering again

[306:56]

here it may not be the perfect line I've

[306:58]

just drawn with some assumptions but if

[307:00]

you are trying to do this probably you

[307:02]

have to do in this specific way okay

[307:04]

I've already uploaded a lot of practical

[307:06]

videos with respect to highill

[307:08]

clustering and all now now tell me

[307:11]

maximum effort or maximum time is taken

[307:15]

by is taken

[307:18]

by K

[307:20]

means or hierle clustering this is a

[307:25]

question for you yes guys number of

[307:26]

clusters may be three but here I'm just

[307:29]

showing you that how many lines it may

[307:31]

be passed by how do you basically

[307:34]

determine whether maximum time will be

[307:36]

taken by kin or Hier clustering this is

[307:38]

an interview question the maximum time

[307:40]

that will be taken is by hierarchical

[307:45]

clustering why because let's say that I

[307:48]

have many many many data points at that

[307:51]

point of time hierle clustering will

[307:53]

keep on constructing this kind of

[307:55]

dendograms and it will be taking many

[307:58]

many many time lot time right so hierle

[308:02]

clustering will take more time maximum

[308:05]

time that it is going to basically take

[308:07]

so it is very much important that that

[308:09]

you understand which is making basically

[308:12]

taking more time so if your data set is

[308:15]

small you may go ahead with hierle

[308:18]

clustering if your data set is large go

[308:21]

with K means clustering go with K means

[308:23]

clustering in short both will take more

[308:25]

time but K Min will perform better than

[308:28]

hle clustering see guys you will be

[308:30]

forming this kind of dendograms right

[308:33]

and just imagine if you have 10 features

[308:34]

and many data points how you're going to

[308:37]

do it it will be a cubers some process

[308:40]

you'll not be even able to see this

[308:42]

dendogram properly and manually

[308:44]

obviously you cannot do it so this was

[308:46]

with respect to K means clust swing and

[308:49]

H mean clust swing I hope everybody's

[308:51]

understood now the next topic that we'll

[308:53]

focus on is that how do we

[308:56]

validate see how do we validate a

[308:59]

classification problem we use

[309:00]

performance metric like confusion Matrix

[309:03]

accuracy um different different true

[309:05]

positive rate Precision recall but how

[309:07]

do we validate clustering model Model S

[309:10]

we are going to use something called as

[309:12]

so we are going to basically use

[309:14]

something called as

[309:15]

Sil score I'll show you what Sid score

[309:19]

is I'm going to just open the Wikipedia

[309:21]

so this is how a CID score looks like a

[309:25]

very very amazing topic okay how do we

[309:28]

validate whether my model basically has

[309:32]

perfect three or four model perfect

[309:35]

three suppose if I find out my K value

[309:37]

is three how do we find out now see one

[309:40]

more one more issue with K means one

[309:42]

issue with K means which I forgot to

[309:44]

tell you let's say that I have a data

[309:46]

point which looks like this and suppose

[309:49]

I have some data points like this I have

[309:51]

some data points which looks like this

[309:55]

let's say I have like this now in this

[309:58]

one issue will be that suppose I try to

[310:01]

make a cluster over here obviously

[310:03]

you'll be saying my K value will be two

[310:05]

okay in this particular case suppose

[310:07]

this is one cluster this is my another

[310:08]

cluster

[310:10]

right because of my wrong initialization

[310:13]

of the points okay understand because

[310:16]

suppose if I initialize just randomly

[310:18]

some centroids like this then what may

[310:20]

happen is that there is a possibility

[310:22]

that we may also have three clusters

[310:24]

like like like this kind of clusters one

[310:27]

cluster will be here one cluster will be

[310:29]

here one cluster will be here so this

[310:32]

initialization of the centroids one

[310:35]

condition is that it should be very very

[310:37]

far if we initialize our centroids very

[310:41]

very far at that point of time we will

[310:43]

be able to find the centroid exactly in

[310:46]

the center because it will keep on

[310:47]

updating it'll keep on going ahead right

[310:50]

but if we don't initialize that very far

[310:53]

then there will be a situation that

[310:55]

probably if I wanted to get only the

[310:57]

real thing was to get only two centroids

[310:59]

I was probably getting three centroids

[311:01]

right so this is a problem so for this

[311:04]

there is an algorithm which is called as

[311:06]

K means Plus+ and what this K means

[311:08]

Plus+ will do which I will probably show

[311:10]

you in Practical this will make sure

[311:12]

that all the centroids that are

[311:14]

initialized it is very very

[311:16]

far okay all the in centroids that is

[311:19]

basically there it is initialized very

[311:21]

very far we'll see that in practical

[311:23]

application where specifically those

[311:26]

centroids are basically used now let me

[311:28]

go ahead and let me show

[311:30]

you with respect to Sid clust string now

[311:34]

what is the solo color string I'm going

[311:36]

to explain you in an amazing way this is

[311:38]

important

[311:39]

if someone says you how do we validate

[311:43]

how do we validate cluster

[311:46]

model then at that point of time we

[311:48]

basically use this site it will be used

[311:51]

in it will be used with respect

[311:55]

to it will be used with respect to K

[311:58]

means it can be used in hierle mean

[312:00]

right if you want to validate how do we

[312:03]

validate okay that is what we are

[312:04]

basically going to see over here now in

[312:08]

C's clustering

[312:09]

what are the most important things the

[312:12]

first and the most important thing is

[312:13]

that we will try to find out we will try

[312:16]

to find out a ofi we will try to find

[312:19]

out a of I now what is this a ofi see

[312:22]

this a ofi that you basically see a ofi

[312:25]

is nothing but see three major steps

[312:28]

happens in order to validate cluster

[312:30]

model with the help of solo first thing

[312:33]

is that I will probably take one cluster

[312:36]

okay there will be one point

[312:39]

which will be my centroid let's say and

[312:42]

then what I'm going to do I'm just going

[312:44]

to whatever points are there inside this

[312:46]

cluster I'm going to compute the

[312:49]

distance between them so I'm going to do

[312:52]

the summation and I'm also going to do

[312:54]

the average of all this distance so here

[312:57]

you can see that when I said distance of

[312:59]

I comma J I basically means this point J

[313:03]

basically means all these points I is

[313:06]

nothing but it is the centroid so here

[313:08]

is nothing but this this is the centroid

[313:09]

let's say that I'm having the centroid

[313:11]

so I'm going to compute all the distance

[313:13]

over here which is mentioned by this and

[313:15]

this value that you see that I'm

[313:17]

actually dividing by C of I minus one in

[313:20]

Short I am actually trying to calculate

[313:22]

the average

[313:24]

distance so this is the first point

[313:26]

where I'm actually Computing the a ofi

[313:28]

now similarly what I will do is

[313:31]

that what I will do is that the next

[313:34]

point will be that suppose I have

[313:36]

computed a ofi the next the next that we

[313:39]

need to compute is B ofi now what is b

[313:41]

ofi b ofi is nothing but there will be

[313:44]

multiple clusters in a k means problem

[313:47]

statement we will try to find out the

[313:50]

nearest cluster okay suppose let's say

[313:52]

that this is the nearest cluster and in

[313:54]

this I have all the variety of points

[313:58]

then B ofi basically says that I will

[314:00]

try to compute the distance between each

[314:03]

point and the other point in this

[314:06]

centroid sorry in this cluster so this

[314:08]

is my cluster one this is my cluster two

[314:12]

so what I'm actually going to do is that

[314:14]

here I'm going to compute the distance

[314:16]

between this point to this point then

[314:17]

this point to this point then this point

[314:20]

to this point this point to this point

[314:22]

this point to this point this point to

[314:24]

this point every point I'm actually

[314:26]

going to compute the distance once this

[314:28]

point is done we will go ahead with the

[314:30]

next point and we'll try to compute the

[314:31]

distance and once we get all this

[314:34]

particular distance what we are going to

[314:35]

do we are going to do the average of

[314:37]

them average

[314:39]

now tell me if I try to find out the

[314:42]

relationship between a of I and B of I

[314:45]

if my cluster model is good will a of

[314:50]

I will be greater than b of I or

[314:54]

will B of I will be greater than a ofi

[314:58]

if I have a good clustering model if I

[315:01]

have a good clustering model will a of I

[315:05]

is greater than b of I will be greater

[315:08]

than b of I or whether B of I will be

[315:10]

greater than a of I out of this if we

[315:13]

have a really good model obviously the

[315:16]

distance between B of I will be greater

[315:19]

than a of I in a good model that

[315:22]

basically means if I talk about sloid

[315:24]

clustering the values will be between -1

[315:27]

to +1 the more the value is towards +1

[315:32]

that basically means the good the model

[315:34]

is the good the clustering model is the

[315:37]

more the values towards negative one

[315:39]

that basically means this condition is

[315:40]

getting applied now what does this

[315:42]

condition basically say that basically

[315:43]

means that this distance is far than the

[315:46]

cluster distance this is what this

[315:48]

information is getting portrayed and

[315:51]

this is the importance of CID

[315:53]

clustering finally when we apply the

[315:55]

formula of CID clustering you'll be able

[315:57]

to see that sloid clustering is nothing

[316:00]

but let me rub this everything guys for

[316:03]

you let me just show you what is CID

[316:05]

clustering CID clustering formula will

[316:08]

be something like this this B of I so

[316:11]

here you have solid clustering this is

[316:13]

the formula B of I minus a of I Max of a

[316:18]

of I comma B of I if C of I is greater

[316:21]

than one right so by this you will be

[316:24]

getting the value between -1 to + 1 and

[316:28]

more the value is towards + one the more

[316:31]

good your model is more the values

[316:33]

towards minus1 more bad your model is

[316:36]

because if it is towards minus1 that

[316:38]

basically means your a of I is obviously

[316:41]

greater than b of I so this is the

[316:43]

outcome with respect to cot crust string

[316:46]

if s is equal to zero that basically

[316:47]

means still your model needs to be uh

[316:50]

per basically the clustering needs to be

[316:52]

improved what is I over here I is

[316:54]

nothing but one data point you you can

[316:56]

just read this guys data point in I in

[316:59]

the cluster C of I so I hope everybody's

[317:01]

understood this now let's go ahead and

[317:03]

let's discuss about the next topic we

[317:05]

have obviously finished up solart

[317:07]

clustering over here let's discuss about

[317:09]

something called as DB

[317:11]

scan so for DB scan clustering this is

[317:14]

an amazing clustering algorithm we'll

[317:17]

try to understand how to actually do DB

[317:20]

clustering and probably you'll be able

[317:22]

to understand a lot of things from this

[317:24]

now in DB scan clustering what are the

[317:27]

important things so let's start with

[317:29]

respect to DB scan clustering and let's

[317:32]

understand some of the important points

[317:33]

over here the first point that you

[317:35]

really need to remember is something

[317:37]

called as score point points I'll also

[317:39]

talk about when do you say core points

[317:42]

or when do you say other points as such

[317:44]

so the first point that I will probably

[317:46]

discuss about is something called as Min

[317:49]

points the second point that I will

[317:51]

probably discuss about is something

[317:53]

called as score points the third thing

[317:56]

that I will probably discuss about is

[317:57]

something called as border points and

[318:00]

the fourth point that I will definitely

[318:02]

talk about is something called as noise

[318:04]

Point okay guys now tell me in C's

[318:07]

clustering

[318:09]

if I have this kind of groups don't you

[318:11]

think with the help of two different

[318:14]

clusters I may combine this two like

[318:16]

this with the help of two different

[318:18]

clusters I may combine something like

[318:22]

this right but understand over here what

[318:25]

what problem is basically happening with

[318:27]

the second clustering this is actually

[318:30]

an outliers let's say that let's say one

[318:32]

thing very nicely I will put okay let's

[318:35]

say I have one point over here I have

[318:38]

one point over here here so if I do

[318:39]

clustering probably I will get one

[318:41]

cluster

[318:43]

here and I may get another cluster which

[318:45]

is somewhere here now understand one

[318:47]

thing this point is definitely an

[318:50]

outlier even though this is an outlier

[318:53]

with the help of K means what I'm

[318:54]

actually doing I'm actually grouping

[318:56]

this into another group so can we have a

[318:59]

scenario wherein a kind of clustering

[319:01]

algorithm is there where we can leave

[319:03]

the outlier separately and this outlier

[319:06]

in this particular algorithm and this is

[319:08]

B basically uh we will be using DB scan

[319:11]

to relieve the outlier and this point

[319:13]

will be called as a noisy Point noisy

[319:15]

point or I can also say it as an outlier

[319:18]

so this will be a noise point for this

[319:20]

kind of algorithm where you want to skip

[319:22]

the outliers we can definitely use DB

[319:25]

scan that is density based spatial

[319:27]

clustering of application with noise a

[319:31]

very amazing algorithm and definitely I

[319:33]

have tried using this a lot nowadays I

[319:36]

don't use K means or Hier means instead

[319:38]

use this kind of algorithm now see this

[319:41]

what are the important things over here

[319:42]

first of all you need to go ahead with

[319:44]

Min points Min points so first thing is

[319:47]

that you need to have Min points this

[319:50]

Min points is a kind of

[319:52]

hyperparameter this basically says what

[319:55]

does hyper parameter says and there is

[319:57]

also a value which is called as

[319:59]

Epsilon which I forgot I will write it

[320:01]

down over here this is called as Epsilon

[320:04]

now what does epsilon mean Epsilon

[320:06]

basically means if I have a point like

[320:08]

this

[320:09]

and if I take Epsilon this is nothing

[320:11]

but the radius of that specific Circle

[320:13]

radius of that specific Circle okay so

[320:16]

Epsilon is nothing but radius over here

[320:19]

in this specific T what does minimum

[320:21]

points is equal to 4 mean let's say that

[320:24]

I have I have taken a point over here

[320:26]

let's say that this is my

[320:28]

point and I have drawn a circle which

[320:31]

looks like this and let's say that this

[320:33]

is my Epsilon

[320:34]

value okay this is my Epsilon value if I

[320:37]

say my Min point point is equal to 4

[320:40]

which is again a hyper

[320:41]

parameter that basically means I can if

[320:45]

I have four at least four points over

[320:47]

here near to this particular Circle

[320:49]

based on this Epsilon value then what

[320:52]

will happen is that this point this red

[320:55]

point will actually become a core

[320:58]

point a core point which is basically

[321:01]

given over here if it has at least that

[321:04]

many number of Min points inside or near

[321:07]

to this particular within this

[321:09]

Epsilon okay within this particular

[321:11]

cluster suppose this is my cluster with

[321:14]

the help of Epsilon I have actually

[321:15]

created it is there a particular unit of

[321:17]

Epsilon or we simply take the unit of

[321:19]

distance no Epsilon value will also get

[321:21]

selected through some way I I'll show

[321:23]

you I'll show you in the practical

[321:24]

application don't worry now the next

[321:26]

thing is that let's say let's say I have

[321:28]

another another point over here let's

[321:30]

say that I have another point over here

[321:32]

and this is my circle with respect to

[321:35]

Epsilon I have created it let's say that

[321:38]

here I have only one

[321:41]

point I have only one point inside this

[321:45]

particular cluster at that point this

[321:48]

point becomes something called as border

[321:52]

Point border Point border point also we

[321:55]

have discussed over here right so border

[321:58]

point is also there so here I'm saying

[322:00]

that at least one at least one if it is

[322:04]

only one it is present then it will

[322:06]

become a border point if it has Force

[322:08]

definitely this will become a core Point

[322:10]

core Point like how we have this red

[322:11]

color so and there will be one more

[322:14]

scenario suppose I have this one cluster

[322:16]

let's say this is my Epsilon and suppose

[322:19]

if I don't have any points near this

[322:21]

then this will definitely become my

[322:23]

noise point and this noise point will

[322:26]

nothing be but this will be a

[322:28]

cluster okay so here I have actually

[322:30]

discussed about the noise point also so

[322:33]

I hope everybody is able to understand

[322:34]

the key terms now what is basically

[322:36]

happening is that whenever we have a

[322:39]

noise Point like in this particular

[322:40]

scenario we have a noise point and we

[322:42]

don't find any points inside this any

[322:45]

core point or border point if you don't

[322:47]

find inside this then it is going to

[322:49]

just get neglected that basically means

[322:52]

this is basically treated as an outlier

[322:55]

I hope everybody is able to understand

[322:57]

here this point will be treated as an

[322:59]

outlier or it can also be treated as a

[323:02]

noise point and this will never be taken

[323:05]

inside a group okay it will never never

[323:08]

be taken inside a group suppose I have

[323:10]

this set of points which you see

[323:12]

basically over here red core and all and

[323:14]

there is also a border Point by making

[323:18]

multiple circles over here here you can

[323:20]

definitely say that how we are defining

[323:22]

core points and the Border points and

[323:24]

this can be combined into a single group

[323:27]

okay this can be combined into a single

[323:29]

group because how the connection is now

[323:31]

see this this yellow line is basically

[323:33]

created by one sorry this yellow point

[323:35]

is basically created by one Epsilon and

[323:37]

we have one One Core point over here

[323:40]

remember over here it should be at least

[323:43]

one core Point okay not one point but

[323:47]

one core point at least if it is having

[323:50]

one core point then it will become a

[323:52]

border point this will become a border

[323:54]

point that basically means yes this can

[323:56]

be the part of this specific group so

[323:59]

what we are doing Whenever there is a

[324:00]

noise we are going to neglect it

[324:02]

wherever there is a broader and core

[324:03]

points we are going to combine it so

[324:05]

I'll show you one more diagram which is

[324:06]

an amazing diagram which will help you

[324:09]

understand more in this a k means

[324:10]

clustering and Hier mean clustering now

[324:12]

see this everybody now the right hand

[324:15]

side of diagram that you see is based on

[324:19]

DB scan clustering and the left hand

[324:21]

side is basically your traditional

[324:23]

clustering method let's say that this is

[324:25]

K means which one do you think is better

[324:28]

over here you see this these all

[324:30]

outliers are not combined inside a group

[324:34]

But whichever are nearer as a core point

[324:37]

and the broader point separate separate

[324:38]

groups are actually

[324:40]

created right so this is how amazing a

[324:44]

DB scan clustering is a DB scan

[324:47]

clustering is pretty much amazing that

[324:50]

is basically the outcome of this here in

[324:53]

C's clustering you can see this all

[324:54]

these points has also been taken as blue

[324:57]

color as one group because I'll be

[324:58]

considering this as one group but here

[325:00]

we are able to determine this in a

[325:03]

amazing groups so in I'm saying you guys

[325:06]

directly use DB scan with without

[325:08]

worrying about anything so now let's

[325:10]

focus on the Practical part uh I'm just

[325:12]

going to give you a GitHub link

[325:14]

everybody download the code guys I've

[325:16]

given you the GitHub link quickly

[325:18]

download and keep your file ready I'm

[325:20]

going to open my anaconda prompt

[325:23]

probably open my jupyter notbook we'll

[325:25]

do one practical problem I've given you

[325:28]

the link guys please open it so this is

[325:30]

what we are going to do today this will

[325:32]

be amazing here you'll be able to see

[325:34]

amazing things how do you come to know

[325:37]

that over fitting or underfitting is

[325:39]

happening you don't know the real value

[325:41]

right so in in clustering there will not

[325:43]

be any underfitting or overfitting so uh

[325:46]

what all things we'll be importing first

[325:48]

is that we'll try cin clustering we'll

[325:50]

do silot scoring and then probably we'll

[325:52]

see the output and um and we'll do DB

[325:56]

scan Also let's say DB scan is also

[325:58]

there so uh what are the things we have

[326:00]

basically imported one is the cin

[326:03]

clustering one is the Sout samples and

[326:05]

Sout scores these all are present in the

[326:08]

SK learn and it is present in metrics

[326:11]

that basically means we use this

[326:12]

specific parameter to validate

[326:15]

clustering models okay now we'll try to

[326:18]

execute this and apart from that mat

[326:20]

plot lib we are just trying to import

[326:22]

numai we are trying to import and all

[326:24]

here we are executing it perfectly the

[326:26]

next thing is that here the next step is

[326:29]

that generating the sample data from

[326:31]

make underscore blobs first of all we

[326:33]

are just trying to generate some samples

[326:35]

with some two features and we are saying

[326:37]

that okay should have four centroids or

[326:39]

C centroids itself with some features

[326:43]

I'm trying to generate some X and Y data

[326:45]

randomly and this particular data set

[326:47]

will basically be used in performing

[326:50]

clustering algorithms okay forget about

[326:52]

range undor ncore clusters because we

[326:54]

need to try with different different

[326:55]

clusters and try to find out the solid

[326:57]

score so right now I just initialized

[326:59]

with 2 3 4 5 6 values it is very simple

[327:02]

so if I go and probably see my X data so

[327:05]

my X data will look something like this

[327:07]

so this is my X data with two features

[327:09]

and this is my Y data with one feature

[327:12]

which is my output which belongs to a

[327:13]

specific class okay so that you can

[327:16]

actually do with the help of make

[327:17]

underscore blobs let's say how to apply

[327:21]

kin's clustering algorithm so as I said

[327:23]

that I will be using W CSS W CSS

[327:26]

basically means within cluster sum of

[327:28]

square so I'm going to import K means

[327:30]

over here for I in range 1A 11 that

[327:33]

basically means I'm going to use

[327:35]

different different K values or centroid

[327:37]

values and try to C which is having the

[327:39]

minimal wcss value and I'll try to draw

[327:42]

that graph which I had actually shown

[327:44]

you with respect to Elbow method so here

[327:47]

I will basically be also using K means

[327:50]

number of clusters will be I and

[327:52]

initialization technique I will will be

[327:54]

using K means Plus+ so that the points

[327:57]

the centroids that are initialized those

[327:59]

those points are very very far and then

[328:01]

you have random state is equal to zero

[328:03]

then we do fit and finally we do wcss do

[328:06]

upend cins doin inertia okay this dot

[328:10]

inertia will give you the distance

[328:13]

between the centroids and all the other

[328:16]

points and this is what I'm going to

[328:18]

append in this wcss value and finally

[328:20]

I'll just plot it now here you can see

[328:22]

that I'm just plotting it obviously by

[328:25]

seeing this graph this graph looks like

[328:27]

an elbow okay this graph looks like an

[328:29]

elbow so the point that I'm actually

[328:31]

going to consider over here see which is

[328:34]

the last abrupt change so if I talk

[328:36]

about the last abrupt change here I have

[328:38]

the specific value with respect to this

[328:41]

okay I have one specific value with

[328:43]

respect to this this is my abrupt change

[328:45]

from here the changes are normal so I'm

[328:48]

going to basically select K is equal to

[328:50]

4 now what I'm actually going to do with

[328:52]

the help of sart with the help of s CL

[328:57]

score we are going to compare whether K

[329:00]

is equal to 4 is valid or not so that is

[329:03]

what we are going to do valid or not so

[329:06]

here we are going to do this now let's

[329:09]

go ahead and let's try to see it how we

[329:11]

are going to do it so here you can see n

[329:13]

clusters is equal to 4 then I'm actually

[329:15]

able to find out the prediction and this

[329:17]

is specifically my output okay this is

[329:19]

done now see this code okay this code is

[329:23]

a huge code I have actually taken this

[329:24]

code directly from the SK learn page of

[329:28]

Silo if you go and see this this code is

[329:30]

directly given over there but I'm just

[329:33]

going to talk about like what are the

[329:35]

important things we need to see over

[329:37]

here with respect to different different

[329:39]

clusters see see this clusters 2 3 4 5 6

[329:43]

I'm going to basically compare whether

[329:46]

the K value should be four or not with

[329:48]

the help of solid scoring so let's go

[329:51]

here and here you can see that I'm

[329:54]

applying this one first I will go with

[329:56]

respect to for Loop for ncore clusters

[329:59]

in range underscore clusters different

[330:00]

different cluster values are there first

[330:02]

we'll start with two so here you can see

[330:04]

initialize the cluster with and cluster

[330:06]

value and a random generator seed of 10

[330:09]

for reproducibility so ncore clusters

[330:12]

first I take took it as two and then I

[330:14]

did fit predict on X after I did fit

[330:17]

predictor on X I'm using this score on X

[330:21]

comma cluster label now what this is

[330:23]

going to do understand in Solo what did

[330:25]

we discuss it will it will try to find

[330:27]

out all the Clusters the Clusters over

[330:30]

here like this and it'll try to

[330:32]

calculate the distance between them

[330:34]

which is the a of I then it'll try to

[330:36]

compute the B of I then finally it'll

[330:39]

try to compute the score and if the

[330:41]

value is between minus1 to +1 the more

[330:43]

the Valu is towards + one the more

[330:45]

better it is right so these all things

[330:47]

we have already discussed and that is

[330:49]

what this specific function will do and

[330:51]

this will give my solo average value

[330:53]

over here solid value will be over here

[330:55]

okay this we have done and then we can

[330:58]

continuously do it for another another

[331:00]

things you can actually find it over

[331:02]

here and this value that you see this

[331:05]

code that you see is nothing nothing so

[331:08]

complex okay this is just to display the

[331:11]

data properly in the form of graphs okay

[331:15]

in the form of graphs so again I'm

[331:17]

telling you I did not write this code

[331:18]

I've directly taken it from the uh SK

[331:22]

learn page of solid okay so just try to

[331:25]

see this particular uh plotting diagrams

[331:27]

and all that you can definitely figure

[331:29]

out but let's see I will try to execute

[331:31]

it and try to find out the output now

[331:33]

see for ncore cluster is equal to 2 the

[331:37]

average solid score is 70 I told you the

[331:40]

value will be between -1 to +1 and I'm

[331:43]

actually getting 704 which is very very

[331:46]

good and then for ncore cluster is equal

[331:48]

to 3 588 then ncore cluster is equal to

[331:52]

4 I'm getting 65 which is pretty much

[331:54]

amazing and then for ncore cluster equal

[331:57]

to 5 the average score is 563 and ncore

[332:00]

cluster is equal to 6 you are saying

[332:02]

.45 here directly you can actually say

[332:05]

that fine for _ cluster equal to 2 I'm

[332:08]

getting an amazing score of

[332:10]

704 obviously you're you're getting the

[332:12]

highest value over this so should we

[332:14]

select ncore cluster isal to two Okay we

[332:17]

should not directly conclude from it

[332:19]

because here we need to also see that

[332:21]

any feature value or any cluster value

[332:24]

is also coming as negative value that

[332:26]

also we need to check so here we will go

[332:28]

down over here you will see the first

[332:30]

one over here with respect to the first

[332:32]

one you see that I'm get getting the

[332:35]

value from 0 to 1 it is not going going

[332:38]

to Min -.1 so definitely two clusters

[332:40]

was able to solve the problem so I'll

[332:43]

keep it like this with me I definitely

[332:45]

have a chance that this may this may

[332:48]

perform well I may have a chance that

[332:50]

this K uh K is equal to 2 May perform

[332:53]

well okay so I may have a chance let's

[332:55]

see to the next one to the next one over

[332:57]

here you can see that for one of the

[332:59]

cluster the value is negative if the

[333:01]

value is negative that basically means

[333:03]

the AI is obviously greater than b ofi

[333:06]

so I'm not going to prer this because it

[333:08]

is having some negative values even

[333:10]

though my cluster looks better but again

[333:13]

understand what is the problem with

[333:14]

respect to this cluster is that if I

[333:17]

take this cluster and probably compute

[333:19]

the distance between this point to this

[333:20]

point and if I probably compute from

[333:22]

this point to this point or this point

[333:24]

to this point this point is obviously

[333:26]

nearer to this right it is obviously

[333:29]

nearer to this so that is the reason why

[333:31]

I'm getting a negative value over here

[333:33]

okay negative value over here this is my

[333:36]

uh output my score this point that you

[333:40]

see dotted points this is my score 58

[333:43]

what whatever it is this is basically my

[333:45]

score so obviously this basically

[333:46]

indicates that this point is near the

[333:48]

other cluster point is nearer to this so

[333:50]

I'm actually getting a negative value

[333:52]

right so this you really need to

[333:54]

understand okay now similarly if I go

[333:56]

with respect to ncore Cluster is equal

[333:58]

to 4 this looks good because here I

[334:00]

don't have any negative value and here

[334:03]

you can see how cooly it has basically

[334:06]

divided the points amazing inly with the

[334:08]

help of k equal to 4 right and similarly

[334:11]

if I go with five obviously you can see

[334:13]

some negative values are here some

[334:15]

dotted line negative value are there

[334:17]

with respect to six you also have some

[334:18]

negative values so definitely I'll not

[334:21]

go with six I may either go with four or

[334:23]

I may either go with two now whenever

[334:26]

you have this options always take a

[334:27]

bigger number instead of two take four

[334:30]

because four is greater than two because

[334:32]

it will be able to create a generalized

[334:34]

model so from this I'm actually going to

[334:37]

take and is equal to 4 K is equal to 4

[334:39]

now should we compare with this with the

[334:41]

elbow method here also I got four right

[334:44]

so both are actually matching so this

[334:47]

indicates that with the help of this

[334:49]

clustering this siluette score we can

[334:51]

definitely come to a conclusion and

[334:53]

validate our clustering model in an

[334:55]

amazing way so I hope everybody is able

[334:57]

to understand and this way you basically

[335:00]

validate a model and definitely you can

[335:03]

try it out you can understand this code

[335:04]

definitely I but till here you have

[335:06]

understood that here I'm going to get

[335:08]

the average value then for iore clusters

[335:12]

whatever cluster this is matching it is

[335:14]

just mapping over there and it is

[335:16]

basically giving so this was the session

[335:19]

and uh yes in today's session we

[335:22]

efficiently covered many topics we

[335:24]

covered kin hierle clustering solid

[335:27]

score DB clustering in tomorrow's

[335:29]

session the topics that are probably

[335:31]

pending is first I'll start with svm and

[335:34]

svr second I will go ahead with XG boost

[335:37]

and and third I will cover up PCA let's

[335:40]

see whether I'll be able to complete

[335:41]

this session uh one one amazing thing

[335:45]

that I want to teach you guys because

[335:46]

many people ask me the definition of

[335:48]

bias and variance so guys uh many people

[335:52]

get confused when we talk about bias and

[335:55]

variance you know because let's say that

[335:57]

uh I have a model for the training data

[335:59]

set it gives us somewhere around 90%

[336:02]

accuracy let's say I'm getting a 90%

[336:04]

accuracy for the test data I may

[336:07]

probably getting somewhere around 70%

[336:10]

accuracy now tell me which scenario is

[336:12]

basically this most of the people will

[336:14]

be saying that okay fine it is

[336:16]

overfitting now when I say overfitting I

[336:19]

basically mention overfitting by low

[336:23]

bias and high

[336:25]

variance right so many people get

[336:28]

confused Krish tell me just the exact

[336:30]

definition of bias and variance low bias

[336:33]

obviously you are saying that because

[336:34]

the training is performed like the model

[336:37]

is performing well with the help of

[336:39]

training data set but with respect to

[336:41]

the test data set the model is not

[336:43]

performing well with respect to training

[336:45]

data set why do we always say bias and

[336:48]

with respect to test data set why do we

[336:50]

always say variance so for this you need

[336:52]

to understand the definition of bias so

[336:54]

let me write down the definition of bias

[336:56]

over here so here I can definitely write

[336:59]

that bias it is a

[337:02]

phenomena that

[337:05]

skews the

[337:08]

result of an

[337:13]

algorithm in

[337:15]

favor in favor or against an

[337:20]

idea against an idea I'll make you

[337:23]

understand the definition uh um but

[337:27]

understand the understand understand

[337:28]

what I have actually written over here

[337:30]

it is a phenomena that skewes the result

[337:32]

of an algorithm in favor or against an

[337:34]

idea whenever I say this specific idea

[337:37]

this idea I will just talk about the

[337:39]

training data set initially now when we

[337:42]

train a specific model suppose if I have

[337:44]

this specific model over

[337:46]

here and I'm training with this specific

[337:49]

training data set so this is my training

[337:52]

data set now based on the definition

[337:54]

what does it basically say it is a

[337:55]

phenomenon that skews the result of an

[337:57]

algorithm in favor or against an idea or

[338:00]

a this specific training data set so

[338:03]

even though I'm training this particular

[338:04]

model with this training data set

[338:07]

with this data set it may it may be in

[338:11]

favor of that or it may be against of

[338:12]

that that basically means it may perform

[338:14]

well it may not perform well if it is

[338:15]

not performing well that basically means

[338:17]

the accuracy is down if the accuracy is

[338:19]

better at that point of time what will

[338:21]

say see if the accuracy is better that

[338:23]

time what we'll say we we'll come up

[338:25]

with two terms from here obviously you

[338:27]

understand okay there are two scenarios

[338:28]

of bias now here if it is in favor that

[338:32]

basically means it is performing well

[338:33]

with respect to the training data set I

[338:35]

will basically say that it has high bu

[338:38]

if it is not able to perform well with

[338:40]

the training data set then here I will

[338:42]

say it as low

[338:44]

bias I hope everybody is able to

[338:46]

understand in this specific thing

[338:47]

because many many many people has this

[338:49]

kind kind of confusion now similarly if

[338:51]

I talk about variance let's say about

[338:53]

variance because you need to understand

[338:55]

the definition a definition is very much

[338:59]

important okay if I if I just talk about

[339:01]

the definition of variance I'm just

[339:03]

going to refer like this the variance

[339:07]

refers to the changes in the model when

[339:13]

using when using different

[339:16]

portion of the

[339:20]

training or test

[339:23]

data now let's understand this

[339:25]

particular

[339:27]

definition variance refers to the

[339:29]

changes in the model when using

[339:31]

different proportion of the test

[339:32]

training data or test data we obviously

[339:34]

know that whenever initially if I have a

[339:38]

model understand from the definition

[339:39]

everything will make sense I am

[339:41]

basically training initially with the

[339:43]

training

[339:44]

data okay because we divide our data set

[339:47]

see our data set whenever we are working

[339:49]

with we divide this into two parts one

[339:52]

is our train data and test data okay

[339:56]

because this is a tra test data is a

[339:58]

part of that particular data set right

[340:00]

and suppose in this particular training

[340:02]

data it gets trained and performs well

[340:04]

here I'm actually talking about bias but

[340:07]

when we come with respect to the

[340:09]

prediction of the specific model at that

[340:12]

point of time I can use other training

[340:14]

data that basically means that training

[340:15]

data may not be similar or I can also

[340:18]

use test data now in this test data what

[340:21]

we do we do some kind of predictions

[340:23]

these are my predictions and in this

[340:25]

prediction again I may get two

[340:27]

scenario I may get two scenario which is

[340:30]

basically mentioned by variance it

[340:32]

refers to the changes in the model when

[340:34]

using when using different portion of

[340:37]

the training or test data refers to the

[340:40]

changes basically means whether it is

[340:42]

able to give a good prediction or wrong

[340:44]

predictions that's it so in this

[340:46]

particular scenario if it gives a good

[340:48]

prediction I may definitely say it as

[340:50]

low variance that basically means the

[340:53]

accuracy with the accuracy with respect

[340:55]

to the test data is also very good if I

[340:58]

probably get a bad if I probably get a

[341:01]

bad accuracy at that time I basically

[341:04]

say it as high variance so if I talk

[341:06]

about three scenarios over here let's

[341:08]

say this is my model one and this is my

[341:11]

model

[341:12]

two and this is my model

[341:15]

three now in this scenario let's

[341:18]

consider that my model one has the

[341:21]

training

[341:23]

accuracy of 90% and test accuracy of

[341:30]

75% similarly I have here as my train

[341:33]

accuracy of 60% and my test accuracy

[341:38]

of

[341:40]

55% now similarly if I have my train

[341:44]

accuracy of 90% And my test accuracy of

[341:49]

92% now tell me what what things you

[341:51]

will be getting here obviously you can

[341:54]

directly say that fine your training

[341:56]

accuracy is better now you're talking

[341:58]

about bias so this basically indicates

[342:00]

that this has low

[342:02]

bias and since your test accuracy is bad

[342:07]

because it is when compared to the train

[342:08]

accuracy it is less so here you are

[342:10]

basically going to say high

[342:14]

variance understand with respect to the

[342:16]

definition similarly over here what

[342:18]

you'll say high

[342:20]

bias High variance because obviously it

[342:22]

is not performing

[342:25]

well this is another scenario last the

[342:28]

last scenario is that this is the

[342:30]

scenario that we want because it is low

[342:33]

bias and low variance

[342:37]

okay many many people have basically

[342:39]

asked me the definition with respect to

[342:41]

bias and variance and here I've actually

[342:43]

discussed and this indicates this gives

[342:45]

me a generalized model and this is what

[342:50]

is our aim when we are working as a data

[342:53]

scientist so I hope you have understood

[342:55]

the basic difference between V bias and

[342:57]

variance and I was able to give you lot

[343:00]

of examples lot of understanding with

[343:02]

respect to this so I hope you have

[343:05]

actually got this particular uh

[343:08]

understanding of this uh two terms which

[343:10]

we specifically talk about high bias low

[343:12]

bias High variance low variance right so

[343:16]

this was it from my side guys uh and uh

[343:19]

I hope you have understood

[343:22]

this

[343:29]

okay so let's take let's consider a data

[343:34]

set credit

[343:37]

and let's say this is a

[343:39]

approval so we are going to take this

[343:42]

sample data set and understand how does

[343:43]

XG boost work suppose salary is less

[343:47]

than or equal to 50 and the credit is

[343:50]

bad so approval the loan approval will

[343:52]

be zero that basically means he he or

[343:54]

she will not get if it is less than or

[343:56]

equal to 50 if the credit score is good

[344:00]

then probably approval will be one if it

[344:02]

is less than or equal to 50 if it is

[344:06]

good

[344:07]

again then it is going to get one if it

[344:10]

is greater than

[344:12]

50 and if it is bad then obviously

[344:16]

approval will be

[344:19]

zero if it is greater than

[344:22]

50 if it is good we are going to get it

[344:25]

as one if it is greater than

[344:29]

50k and probably if it is normal then

[344:33]

also we are going to get

[344:35]

it so this is this is my data set so how

[344:38]

does XG boost classifier work understand

[344:41]

the full form of XG boost is

[344:44]

Extreme gradient

[344:47]

boosting extreme gradient boosting so we

[344:50]

will basically understand about extreme

[344:52]

gradient boosting now extreme gradient

[344:55]

boosting uh will be actually used to

[344:58]

solve both classification and the

[345:00]

regression problem statement so first of

[345:02]

all let's understand how it is basically

[345:05]

exib basically how it actually if you if

[345:08]

you just talk about XG boost you

[345:10]

understand that it is a boosting

[345:11]

technique and internally it tries to use

[345:13]

decision tree so how does this decision

[345:16]

Tre is basically getting constructed in

[345:18]

the case of XV boost and how it is

[345:20]

basically solved we are going to discuss

[345:21]

about it so whenever we start exib boost

[345:24]

classifier understand that first of all

[345:26]

we create a specific base model suppose

[345:29]

if I say this is my base model and this

[345:32]

base model will be a weak learner okay

[345:36]

and this base model will always give an

[345:38]

output of probability of 0.5 in the case

[345:42]

of classification problem so suppose if

[345:45]

I say this is probability 0.5 then I

[345:48]

will try to create a field over here

[345:50]

this field is called as residual field

[345:53]

so first base model what I'm going to do

[345:55]

any data set that you give from here to

[345:57]

train it will always give you the output

[345:59]

as 0.5 so this is just a dummy base

[346:02]

model now tell me if my probability

[346:06]

output is is 0.5 if I want to calculate

[346:08]

the residual that basically means I need

[346:10]

to subtract approval minus this

[346:12]

particular value so what will be the

[346:16]

value over here 0 -.5 will be

[346:20]

-.5 1 -.5 will be5 1 -.5 will

[346:25]

be5 and 0 -.5 will be -.5 and this 1 -.5

[346:31]

will

[346:32]

be uh 0.5 and this will also be 0.5

[346:36]

let's consider that I have one more

[346:37]

record uh and this specific record can

[346:40]

be anything uh because I want to keep

[346:43]

some more records over here so let's

[346:45]

consider that I have one more record

[346:46]

which is less than or equal to 50K and

[346:49]

if the credit scod is normal you're

[346:51]

going to get zero so here also if I try

[346:53]

to find out the residual it will be

[346:55]

minus5 now the first step I hope

[346:58]

everybody's understood we have to create

[346:59]

a base model okay this base model is

[347:01]

very much important because we have to

[347:04]

create all the decision Tree in a

[347:06]

sequential manner so the first

[347:09]

sequential base tree which is again this

[347:11]

is also a decision tree kind of thing

[347:12]

you can consider but this is a base

[347:15]

model which takes any inputs and gives

[347:17]

by default the probability as 05 now

[347:20]

let's go ahead and understand what are

[347:22]

the steps in constructing decision tree

[347:24]

after creating the base model the first

[347:27]

step is that create uh binary decision

[347:31]

tree so I'm going to write it down all

[347:34]

the steps please make sure that you note

[347:35]

it down so so create a binary tree

[347:39]

binary decision tree using the features

[347:43]

second step we basically Define we we we

[347:47]

say it as okay Second Step what we do we

[347:50]

actually calculate the similarity weight

[347:54]

we calculate the similarity weight I'll

[347:57]

talk about this similarity weight what

[347:59]

exactly it is if I want to use this a

[348:02]

formula it is summation of residual

[348:05]

Square

[348:07]

divided

[348:08]

by summation of probability 1 minus

[348:13]

probability plus Lambda I'll talk about

[348:16]

this what is exactly Lambda it is the

[348:18]

kind of hyperparameter again so that it

[348:20]

does not overfit the third thing is that

[348:23]

we calculate the Information Gain okay

[348:26]

Information Gain so these are the steps

[348:28]

we basically use in constructing or in

[348:32]

solving uh in creating an HD boost

[348:34]

classifier the first step is that we

[348:36]

create a inary decision tree using the

[348:38]

feature then we go ahead with

[348:40]

calculating the similarity weight and

[348:42]

finally we go ahead and calculate the

[348:43]

information gain so how does it go ahead

[348:46]

let's understand over here and let's try

[348:47]

to find out okay now let's go ahead and

[348:50]

let's try to construct the decision tree

[348:53]

as I said that let's consider that I'm

[348:55]

considering salary feature So based on

[348:58]

using salary feature what I'm actually

[348:59]

going to do I am going to take this as

[349:02]

my node and I'm going to split this up

[349:05]

and remember whenever we are creating

[349:07]

decision Tree in this particular case it

[349:09]

will be a binary decision tree let's say

[349:13]

that in salary one is less than or equal

[349:15]

to one is greater than 50 so this two

[349:18]

you obviously have in the case of binary

[349:20]

in case of credit where there are three

[349:22]

categories I'll also show you how that

[349:25]

further split will happen and how that

[349:27]

will get converted into a binary team so

[349:29]

here you have less than or equal to 50K

[349:32]

and greater than 50k now let's go ahead

[349:35]

and understand how many vales are there

[349:37]

in this salary so if I see before the

[349:40]

split you can definitely see that I'm

[349:42]

going to use this residual and probably

[349:45]

train this entire model now if I really

[349:48]

wanted to find out the residual

[349:49]

initially these are my residuals over

[349:51]

here so one resid is -.5 then I have 0.5

[349:56]

over here then I have .5 then again I

[349:59]

have -.5 then again I have 0.5 then

[350:03]

again I have 0.5 and finally I have

[350:06]

minus .5 so these are my total residuals

[350:09]

that are there suppose if I make this

[350:11]

split less than or equal to 50 First

[350:14]

less than or equal to 50 the residuals

[350:16]

what are things are there so here I'm

[350:18]

going to have minus5 then less than or

[350:21]

equal to 50 again I'm going to have 05

[350:23]

then again less than or equal to 50 I'm

[350:25]

going to have 0.5 and less than or equal

[350:27]

to again one more 0.5 is there I'm just

[350:30]

going to remove this the last5 which is

[350:33]

nothing but Min -.5 so I hope you

[350:35]

understood this split so half of the

[350:37]

things came over here the remaining half

[350:40]

will be greater than or equal to greater

[350:41]

than 50 so you have one value here one

[350:44]

value here one value here so it will be

[350:46]

Min -.5 then you have 0.5 and then

[350:50]

finally you have 0.5 residuals how do we

[350:53]

get it guys see from the base model

[350:55]

which is by default giving 0.5 first my

[350:58]

data goes over here by default

[351:00]

probability I'm going to get 0.5 so

[351:02]

residual is basically calculated from

[351:04]

this probability and approval so this

[351:07]

probability minus approval so if you

[351:09]

subtract 0 -.5 sorry I'm just going to

[351:12]

rub this so if you subtract 0 -.5 you're

[351:16]

going to get -.5 1 -.5 you're going to

[351:19]

get .5 1 -.5 you're going to get .5 so

[351:22]

everybody I hope is very much clear with

[351:24]

respect to this so this is the first

[351:26]

step we constructed a binary tree now in

[351:28]

the second step it says calculate the

[351:30]

similarity weight now how to calculate

[351:33]

the similarity weight similarity weight

[351:35]

formula is sum of residual Square now

[351:37]

what is residual Square let's say that

[351:39]

I'm going to calculate the the the uh

[351:43]

I'm going to calculate for this okay

[351:45]

similarity weight now in this particular

[351:47]

case if I go and calculate my similarity

[351:49]

weight it will be summation of residual

[351:52]

Square this is my residual values this

[351:55]

is my residual Valu so I'm going to do

[351:57]

the summation of this Square okay this

[352:01]

value square you can see over here sum

[352:03]

of residual Square everybody you can see

[352:06]

sum of of residual squares so what do

[352:08]

you think sum of residual squares will

[352:09]

be in this particular case how I have to

[352:12]

do it I will just take up this all

[352:14]

values like

[352:16]

-.5

[352:17]

+5

[352:20]

+5 and

[352:22]

-.5 whole square right I'm just going to

[352:24]

do the squaring of this divided by

[352:27]

understand what it is divided by it is

[352:29]

divided by probability of 1 minus

[352:31]

probability now where do we get this

[352:33]

probability value where do we get this

[352:35]

probability value value we get this

[352:37]

probability value from our base model

[352:40]

right so here I'm basically going to say

[352:42]

that we are going to do the summation of

[352:44]

probability of 1 minus probability 1

[352:47]

minus probability that basically means

[352:50]

for each and every point for each and

[352:52]

every Point what is the probability see

[352:54]

probability is basically coming from the

[352:56]

base model so for each Pro each point

[352:59]

I'm going to come compute two things one

[353:01]

is the probability and then 1 minus

[353:04]

probability and this I'm going to do the

[353:06]

summ

[353:07]

like this I will do it four times 1 -.5

[353:10]

then .5 * 1 -.5 and finally you'll be

[353:15]

able to see one more will be there which

[353:17]

is

[353:18]

+5 1 -.5 so this will be your total

[353:21]

things with respect to this so I hope

[353:24]

you have understood till here uh where

[353:26]

you are able to understand that what we

[353:28]

have done this is summation of uh

[353:31]

residual square and this is the

[353:33]

remaining probability multiplied by 1

[353:35]

minus probability now tell me what are

[353:39]

you able to find out from this if you

[353:41]

cancel this and this this and this this

[353:44]

value is going to become zero so this

[353:47]

entire value is going to become Zer

[353:48]

because 0 divided by anything is 0er so

[353:51]

here I hope everybody is understood what

[353:53]

is the similarity weight of this

[353:55]

specific node if I want to write it is

[353:57]

nothing but zero now you may be

[353:59]

considering where is Lambda

[354:01]

value okay we will initially initialize

[354:04]

Lambda by 1 I'll talk about this hyper

[354:05]

parameter let's consider it as 1 so here

[354:09]

+ 1 or plus 0 let's let's consider

[354:12]

Lambda value 0 let's say for right now

[354:14]

okay I'm just going to make it Lambda is

[354:16]

equal to0 I'm just going to talk about

[354:19]

it because it is a kind of hyper

[354:21]

parameter by Z -.5 -.5 +5 +5 if I do the

[354:28]

summation if I do the summation here you

[354:31]

will be able to see that I'm going to

[354:32]

get zero so this calculation we have

[354:34]

done and we have got uh the sumission of

[354:36]

weight is equal to Z and let's go ahead

[354:39]

and calculate the sumission of the

[354:40]

weight of the next node no no no it's

[354:43]

not first Square it is whole squar so

[354:46]

here also if I do so it is5 +5 now let's

[354:51]

do it for this if I want to find out the

[354:53]

similarity weight again see I'm going to

[354:55]

repeat it .5 +5 whole squ and since

[355:00]

there are three points so I'm going to

[355:01]

basically use probability 1 minus

[355:04]

probability for one point then plus

[355:08]

probability 1 minus probability second

[355:11]

point and then probability and 1 minus

[355:14]

probability for the third point and

[355:16]

Lambda is zero so I'm not going to write

[355:18]

anything now go let's go and do the

[355:20]

calculation for this node so - 5 - 5 it

[355:24]

becomes zero then .5 whole square right

[355:27]

so here I'm going to get 0.25 here if

[355:30]

you do the calculation here you are

[355:31]

going to get 75 so this value is going

[355:34]

to be 1x3 and which is nothing at33 so

[355:37]

the similarity weight for this node for

[355:40]

this node

[355:42]

is33 so here you can see probability of

[355:45]

multiplied by 1 minus

[355:47]

probability okay now the next step that

[355:50]

we do is that calculate the information

[355:53]

gain now you know how to calculate the

[355:55]

information gain but before that let's

[355:57]

do the computation for this also for

[355:59]

this root node also go ahead and

[356:01]

calculate the similarity weight of

[356:04]

this okay they

[356:06]

why the base model probability is5

[356:09]

because it is just understand that it is

[356:11]

a dummy dummy model I have just put a if

[356:14]

condition there saying that it is going

[356:15]

to give 0.5 now do it for this one guys

[356:17]

root node what it will be see I can

[356:20]

calculate from here only minus1 gone

[356:23]

this is also gone this is also gone this

[356:25]

will be .25 divided by something now

[356:29]

tell me guys what should be for the root

[356:32]

node what is the similarity similarity

[356:34]

weight what is the similarity weight for

[356:36]

for this do this calculation everyone up

[356:39]

one I know it will be. 25 divided by

[356:44]

this will be 1.75 are you getting this

[356:48]

similarity weight which will be nothing

[356:50]

but 1 by 7 and if I divide 1 by 7 if I

[356:54]

say what is 1 by 7 it

[356:57]

is42 so it is nothing but .14 if I want

[357:00]

to calculate the root node similarity

[357:02]

weight over here

[357:05]

is4 so I know 0.14 here 0 here 33 now

[357:09]

see over here we calculate the

[357:11]

Information Gain Next Step the third

[357:13]

step what we do is that we calculate the

[357:15]

information gain now Information Gain is

[357:19]

nothing but in this particular case the

[357:21]

root node similarity weight we'll try to

[357:24]

add up so I will be getting

[357:27]

0.33 minus this particular Top Root node

[357:31]

whatever split has happened that

[357:33]

similarity weight I'll take 0 +33

[357:36]

-14 so Point

[357:39]

-14 and if I do it it is nothing but

[357:42]

just open your calculator again and

[357:46]

33

[357:48]

-14 so it is nothing but .19 I'm getting

[357:52]

.19 as my information gain the

[357:56]

information gain of this specific tree I

[357:59]

got it

[358:00]

as19 obviously you know how the features

[358:03]

will get selected based on the

[358:06]

Information Gain but let's say that the

[358:08]

highest Information Gain that is given

[358:10]

by salary okay now we will go ahead and

[358:13]

do the further split let's go ahead and

[358:16]

do the further split so I I know my

[358:18]

information gain now it is1 n and

[358:20]

Information Gain is basically used to

[358:23]

select that specific node through which

[358:26]

the split will happen now I'll further

[358:27]

go and do the split let's say that I'm

[358:29]

going to do the further split with the

[358:31]

next feature that is which one credit so

[358:33]

I'm going to take credit over here I'm

[358:36]

going to take credit over here and again

[358:39]

I have to do a binary split again but

[358:42]

you may be considering chish here are

[358:43]

only three categories how we are going

[358:45]

to basically do this particular split

[358:48]

right because we don't know how to do

[358:50]

the split because we have three

[358:51]

categories over here so in this case

[358:53]

what I will do is that we what we can

[358:56]

definitely do is that in this particular

[358:58]

case the split that we are probably

[359:00]

going to do is that let's consider two

[359:02]

categories like good and normal at one

[359:04]

side bad at one side so here it becomes

[359:06]

a binary split again now let's go ahead

[359:09]

and let's try to see that how many data

[359:11]

points will fall here and how many data

[359:12]

points will fall here so for writing

[359:14]

down the data points let's say if it is

[359:17]

less than or see go to the path if it is

[359:19]

less than or equal to 50 it'll go this

[359:21]

path and if it is B then we are probably

[359:24]

going to get how much is the residual we

[359:26]

are going to get one residual over here

[359:28]

first of all so this is my one residual

[359:31]

that is -.5 then similarly if I see less

[359:34]

than or equal to 50 good is there right

[359:37]

good or normal is there so here again 0.

[359:39]

five will come I hope everybody is able

[359:42]

to understand see the second record less

[359:44]

than or equal to 50 we go in this path

[359:45]

but it is good we come over here again

[359:48]

less than or equal to 50 good again we

[359:50]

are going to get 1

[359:51]

more5 then go with respect to greater

[359:55]

than or equal to 50 which is coming over

[359:57]

here we'll not worry about it right now

[359:59]

again less than or equal to 50 normal

[360:01]

again it is

[360:03]

-.5 right so this many records

[360:06]

definitely coming over here only one

[360:08]

record is basically coming over here

[360:10]

then again we will start the same

[360:12]

process again we will start the same

[360:14]

process now for the same process what we

[360:16]

are going to do again try to calculate

[360:18]

the similarity weight now in order to

[360:20]

calculate the similarity weight what I

[360:22]

will do I will basically say this is my

[360:24]

similarity weight this will become .25

[360:28]

divided 025 why because this whole

[360:31]

square right this whole Square residual

[360:33]

square right summation of residual

[360:36]

square but here I have only one residual

[360:38]

so this Square it will become and then

[360:40]

what I'm actually going to do I'm going

[360:41]

to basically write .5 - 1 -.5 this is

[360:45]

nothing for only for one data point so

[360:47]

this is nothing but .5 * .5 which is

[360:50]

nothing but 0.25 right now in this

[360:53]

particular case I will get similarity

[360:54]

weight as I hope everybody I'm getting

[360:56]

it as one now what about this similarity

[360:58]

weight if you want to compute it is

[361:00]

again very very simple this and this

[361:02]

will get cancelled then again it will be

[361:03]

025 divided by um if I say one like this

[361:08]

.25 then again it will be 75 then this

[361:11]

will also be 1 by3 that is nothing but

[361:13]

33 so similarity weight will

[361:16]

be33 then again I have to calculate the

[361:19]

information gain of this node what I

[361:21]

will do I will add this up see 1

[361:24]

+33 I'll add like 1

[361:27]

+33 minus 0 why zero because the

[361:30]

information gain the similarity weight

[361:32]

of this uh the up one is basically 0

[361:37]

right for this particular credit node

[361:39]

similarity weight is zero so 1

[361:41]

+33 minus 0 this will be 1.33 so like

[361:45]

this further split will again happen

[361:47]

over here with different different node

[361:49]

and we will only be getting a binary

[361:51]

split but we will be comparing based on

[361:54]

Information Gain which one is coming

[361:55]

good now let's say that I have created

[361:57]

this path I have I have designed I have

[362:00]

developed my entire binary decision tree

[362:02]

which is a speciality in XG boost now

[362:06]

what I'm going to do over here is that

[362:08]

see everybody what I'm going to do let's

[362:10]

consider the inferencing part let's say

[362:12]

this record is going to go how we are

[362:15]

going to calculate the output so this

[362:17]

first of all went to this base model now

[362:21]

let's go ahead and see how the

[362:22]

inferencing will happen suppose This

[362:24]

Record is going right so first of all

[362:26]

this record will go to this base model

[362:29]

the base model is giving the probability

[362:30]

as 0.5 so the first base model is

[362:34]

basically giving 0.5 now base based on

[362:36]

this 05 how do we calculate the real

[362:39]

probability how do we calculate the real

[362:41]

probability in this okay so we apply

[362:43]

something called as logs so we basically

[362:45]

say log of P / 1us P so this is the

[362:49]

formula we basically apply in only the

[362:52]

case of base model so if we try to see

[362:55]

this it is nothing but log

[362:57]

of5 / .5 which is nothing but zero log

[363:01]

of one is nothing but zero so in the

[363:03]

first case whenever any record goes I

[363:05]

will be getting the zero value over here

[363:08]

okay zero value over here then plus why

[363:11]

plus I'm doing because it will now go to

[363:13]

the binary decision tree now this record

[363:15]

will go to my binary decision Tre

[363:17]

whatever value I'm getting from this I'm

[363:19]

actually adding that up and now it will

[363:21]

go over here now when it goes over here

[363:24]

first of all let's see which branch it

[363:25]

is following it is following less than

[363:27]

or equal to 50 Branch first Branch over

[363:29]

here then this is bad it'll go and

[363:32]

follow here so here I can see that the

[363:34]

similarity weight is one now the

[363:36]

similarity weight is basically one in

[363:38]

this case so what we do in the case of

[363:40]

this we pass it to a learning rate

[363:44]

parameter so this specifically is my

[363:46]

learning rate multiplied by 1 one

[363:49]

because why similarity weight is one

[363:51]

over here so this will basically be my

[363:54]

first references and Alpha over here is

[363:57]

my learning rate it can be a very small

[363:59]

value based on the learning parameter

[364:01]

that we use like how we have defined

[364:04]

learning parameters elsewhere on top of

[364:06]

this we apply an activation function

[364:09]

which is called as sigmoid since this is

[364:11]

a classification problem we apply an

[364:14]

activation function which is called as

[364:15]

sigmoid and I hope you know what is the

[364:17]

use of sigmoid based on this based on

[364:20]

the alpha value based on this the output

[364:22]

will be between 0 to 1 now I hope you

[364:25]

getting it guys this is how the entire

[364:27]

inferencing will probably happen now

[364:30]

similarly what I will do I will try to

[364:32]

construct this kind of decision tree

[364:33]

parall so we we can also write our

[364:37]

entire function will look something like

[364:40]

this Alpha 0 + alpha 1 and this will be

[364:46]

your decision tree 1 output then Alpha 2

[364:50]

your decision tree output Alpha 3 your

[364:53]

decision 3 output like this Alpha 4 your

[364:57]

decision 3 output fourth decision tree

[365:00]

like this it will be alpha n your

[365:02]

decision tree n output and this will be

[365:06]

your output finally when you're trying

[365:09]

to inference from any new

[365:12]

record now the reason why we say this as

[365:15]

boosting because see understand we are

[365:17]

going to add each and every decision

[365:19]

tree output slowly to finally get our

[365:22]

output with respect to the working of

[365:23]

the decision tree this is how XG boost

[365:26]

actually work don't credit further needs

[365:28]

to be simplified yes see like this

[365:31]

similarly we can split credit with the

[365:33]

help of like we can make blue green one

[365:35]

side normal at one side But whichever

[365:37]

will be giving the information gain more

[365:40]

that will be taken into consideration

[365:41]

right and this is how your entire X

[365:43]

boost classifier works it is very very

[365:46]

difficult to basically calculate all

[365:48]

those things so that is the reason we

[365:50]

say that XG boost is also a blackbox

[365:53]

model so this is basically a blackb

[365:56]

model it is it prone to overfitting see

[365:59]

at one stage we also need to perform

[366:02]

hyperparameter tuning and this we

[366:05]

specifically say pre- pruning we tend to

[366:08]

do pre pruning and since we are

[366:10]

combining multiple decision trees no no

[366:14]

this decision tree this decision tree is

[366:17]

this one this independent decision tree

[366:19]

which I have created now parall after

[366:21]

this what I'll do I'll create one more

[366:22]

decision tree so it'll be looking like

[366:24]

this see finally how it will look so

[366:26]

this is my base model then my data then

[366:29]

my data will go to this decision tree

[366:31]

which I have actually done as a binary

[366:33]

split on different different records

[366:36]

then again we will make another decision

[366:38]

tree which will again be a binary tree

[366:40]

the splits will look like this then this

[366:43]

is my base model where I'm getting the

[366:45]

value as zero this will be alpha 1

[366:47]

multiplied by decision tree 1 which is

[366:50]

this then this is Alpha 2 multiplied by

[366:53]

decision tree 2 which is this and like

[366:55]

this we will keep on continuously adding

[366:58]

more decision trees unless and until

[367:00]

this entire things becomes a very strong

[367:04]

learner so this is how how we basically

[367:06]

do the combination of all these things

[367:08]

so I hope everybody is able to

[367:10]

understand about the XG boost classifier

[367:14]

now you may be thinking how does

[367:15]

regressor work do you want a regressor

[367:17]

problem statement also the decision tree

[367:19]

will get constructed based on

[367:21]

Independent features and again Lambda

[367:23]

value is a hyperparameter we basically

[367:26]

set up Lambda value with the help of

[367:28]

cross validation now uh let's go ahead

[367:30]

and discuss about ex boost regressor the

[367:33]

second algorithm that we we will

[367:35]

probably discuss about is something

[367:37]

called as XG boost regressor and how

[367:41]

does X boost regressor actually work

[367:43]

some fundamental is follow in random

[367:45]

Forest no in random Forest it is

[367:47]

completely different there bagging

[367:49]

happens bagging happens so over here

[367:52]

let's go ahead with the regressor so

[367:54]

here I'm going to take some example

[367:56]

let's say that I have this many

[367:57]

experience this many Gap and based on

[368:00]

that we need to determine the salary my

[368:02]

salary is my output feature let's say

[368:04]

the experience is 2 2.5 3 4 4.5 okay now

[368:10]

in this Gap let's say it is yes

[368:13]

yes no no yes and let's say that the

[368:17]

salary is somewhere around 40K it is

[368:20]

41k

[368:22]

52k and uh let's see some more data set

[368:25]

over here 60k and 62k now the first step

[368:29]

in classifier we created a base model

[368:32]

here also we'll try to create a base

[368:33]

model first of all this base model what

[368:36]

output it will give it will give the

[368:38]

average of all these values what is the

[368:40]

average of all these values okay what is

[368:42]

the average of all these value 40 81 52

[368:45]

60 62 if I just do the average it is

[368:48]

nothing but 51k so by default I will

[368:50]

create a base model which will take any

[368:52]

input and just give the output as 51

[368:54]

this is the first step now based on this

[368:56]

I will try to calculate my residual now

[368:58]

how do I calculate my residual I will

[369:00]

just subtract 40 by 51k so this will

[369:03]

basically be - 11k

[369:06]

and uh this will be 10 K - 10 K - 10 and

[369:11]

this will be 1 this will be 9 and this

[369:16]

will be 11 I hope everybody's able to

[369:18]

get this let's say that I I make this as

[369:21]

42k okay for just making my calculation

[369:23]

little bit easy so I have 9 over here so

[369:26]

this is my residual then again the first

[369:28]

step is that I construct my uh decision

[369:32]

tree now let's say say that I'm going to

[369:35]

use The Experience over here so this is

[369:37]

my experience node and based on this

[369:39]

experience node I have my features over

[369:42]

here so here I will take up all my

[369:44]

residuals - 11 99 1 99 11 and then how

[369:50]

do I do the split based on experience

[369:52]

this is a continuous feature so I have

[369:56]

to basically do split with respect to

[369:58]

continuous feature which I have already

[369:59]

shown you in decision tree how do we do

[370:01]

so here is my residual here it is 40

[370:04]

minus this

[370:05]

is - 11 K - 9 K uh this is 1 K this is 9

[370:12]

K and

[370:14]

11k - 9k so now I will just create take

[370:17]

up my first node here I'm going to use

[370:20]

my experience feature I know my values

[370:23]

what all things are going to come 11k in

[370:25]

the root node - 9 1 9 and 11 now what we

[370:30]

are going to do over here is that so I'm

[370:32]

going to do again a binary split over

[370:34]

here now the binary split will happen

[370:36]

based on the continuous feature that is

[370:38]

experienced so two types of Records I

[370:40]

may get one is less than or equal to two

[370:42]

and one is greater than 2 less than or

[370:46]

equal to two and one is greater than two

[370:48]

now less than or equal to two when I do

[370:49]

the split let's see how many values we

[370:51]

are getting less than or equal to two I

[370:53]

will get only one value that is -1 and

[370:56]

here I'm actually going to get all the

[370:58]

other values - 9 1 9 11 now what we are

[371:02]

going to do after this is that calculate

[371:04]

the similarity weight now here the

[371:06]

similarity weight will little bit the

[371:08]

formula will change with respect to

[371:10]

regression so similarity weight is

[371:12]

nothing but summation of residual

[371:15]

squares divided by number of residuals

[371:18]

plus Lambda again here we are going to

[371:20]

consider Lambda is zero because this is

[371:22]

a hyper parameter tuning more the value

[371:25]

of Lambda that basically means more more

[371:27]

we are penalizing with respect to the

[371:29]

residuals so this will be the formula

[371:31]

that we are going to apply okay so let's

[371:33]

see for the first number that that we

[371:35]

want to apply so how this will get

[371:37]

applied again I'm going to write this

[371:39]

formula here it'll be better let's say

[371:42]

here similarity weight is equal to

[371:46]

summation of residual square and here

[371:49]

you have number of residuals plus Lambda

[371:52]

see previously we were using probability

[371:54]

and then all those things we are using

[371:56]

so if you want to calculate the

[371:58]

similarity weight of this this will

[371:59]

become 121 divided by number of residual

[372:03]

is 1 plus Lambda is 0 so this is going

[372:06]

to be 121 so here we are going to

[372:08]

calculate the similarity weight which is

[372:10]

nothing but 121 if if we probably take

[372:13]

Alpha let's let's do one thing if we

[372:15]

probably take uh if if we probably take

[372:19]

Alpha is equal to 1 then what will

[372:20]

happen if you take Alpha is equal to 1

[372:22]

just think over here what will what may

[372:23]

happen we may directly penalize the

[372:26]

similarity weight right by just adding

[372:28]

one okay so let's do that also suppose I

[372:30]

say I'm going to take Alpha is equal to

[372:32]

1 so what will happen this will not be

[372:35]

the formula now now what will become 121

[372:38]

divided number of residual is 1 + 1 this

[372:41]

is nothing but 65.5 let's say that I now

[372:44]

have 65.5 as my similarity weight now

[372:47]

similarly I will go ahead and compute

[372:49]

the similarity weight for the next one

[372:52]

so here it will become - 9 + 9 + 9 + 11

[372:58]

whole Square divided 4 + 1 so this and

[373:01]

this will get subtracted 12 squ is

[373:04]

nothing but 14 4 144 divid 5 so if I go

[373:07]

ahead and calculate 144 ID 5 it is

[373:10]

nothing but 28.5 so here I get

[373:15]

28.5 so the similarity weight for this

[373:18]

is

[373:20]

28.5 similarly I can go ahead and

[373:22]

calculate the similarity weight for this

[373:24]

for the top one so it'll be nothing but

[373:27]

what it will be 11 + sorry - 11 - 11 - 9

[373:34]

+ + 1 + 9 + 11 divided 1 2 3 4 5 5 + 1

[373:41]

is 6 so this is getting subtracted this

[373:44]

will be 1X 6 anyhow this will be whole

[373:46]

square right so anyhow it will be 1X 6

[373:48]

only so 1X 6 will be my similarity

[373:51]

weight over here okay 28.8 hits okay now

[373:54]

finally The Information Gain that we

[373:56]

need to compute will be very much simple

[373:58]

what will be the Information Gain 65.5 +

[374:03]

28.8

[374:06]

minus 1X 6 so try to get it whatever we

[374:09]

are trying to get it over here just tell

[374:11]

me what will be the output is it 98.34%

[374:35]

60.5 60.5 + 28 88 then this will change

[374:40]

just a second 89.1 3 understand you

[374:44]

don't have to worry about calculation

[374:46]

automatically that things will be doing

[374:48]

it okay so you don't have to worry now

[374:50]

see we have now further the decision

[374:52]

tree can be splitted into any number of

[374:54]

times probably the next split what we

[374:56]

can do is that we can we can do next

[374:58]

split something like this this will be

[375:00]

my experience the two splits that may

[375:03]

happen with respect to less than or

[375:05]

equal to 2.5 less than or equal to 2.5

[375:08]

or greater than 2.5 now if this probably

[375:11]

gives the Information Gain better then

[375:13]

the split will happen like this

[375:14]

otherwise whichever gives the better

[375:16]

information again the split will

[375:17]

basically happen like this I hope like

[375:20]

let's say that this is this is the split

[375:22]

that is required - 11 - 11 is 9 is over

[375:25]

here and then we have 1 comma 9A 11 okay

[375:28]

because less than or equal to 2.5 this

[375:30]

two records will definitely go over here

[375:32]

and this two This Record will definitely

[375:34]

go over here now if I try to calculate

[375:36]

the similarity weight for this it will

[375:38]

be nothing but - 11 - 9 - 11 - 9 whole S

[375:43]

ided 2 + 1 right now in this particular

[375:46]

case it will be - 20 s / 3 which is

[375:51]

nothing but 400 2 20 into 20 is 400

[375:55]

which is nothing but 3 so if I go and

[375:57]

probably use a

[375:59]

calculator and show it to you

[376:02]

400 / 3 which is nothing but

[376:06]

133.33 so the similarity weight for this

[376:08]

is

[376:10]

133.33 similarly I can go ahead and

[376:12]

compute for this it will be 1 + 9 + 11

[376:15]

whole s / 3 + 1 right so it will be 10 +

[376:19]

11 10 + 11 is nothing but 21 whole s/ 4

[376:24]

so what it is 21 whole square if I open

[376:27]

my calculator 21 s 21 * 21 which is

[376:33]

nothing but 441 divid by 4 divid by 4 so

[376:37]

this will probably 110 110.

[376:41]

2.25 and similarly I can go ahead and

[376:44]

compute for this so if I want to compute

[376:46]

for this what it will be the same thing

[376:49]

that we have got over here that is 1x 6

[376:51]

so this will basically be 1X 6 so

[376:53]

finally if I compute the information

[376:55]

again it will be what it will be 133

[377:01]

1333 +

[377:03]

1.25 - 1X 6 obviously this value will be

[377:06]

greater than the previous one what we

[377:08]

have got that is

[377:10]

8913 so definitely we are going to use

[377:12]

this split which is better than the

[377:14]

previous split right let's say that this

[377:17]

split has been considered finally how do

[377:20]

we see the output okay I hope everybody

[377:23]

is able to understand right let's say

[377:24]

that this split has worked well so I'm

[377:26]

going to rub all these things

[377:29]

11.25 is there now suppose I want to do

[377:33]

the inferencing how the inferencing will

[377:35]

be done

[377:37]

11.25 here 110.2 now suppose any record

[377:41]

comes from here first of all any record

[377:43]

that will go it will go to the base

[377:45]

model so the base model whenever it goes

[377:47]

the value is 51 51 plus alpha 1 this is

[377:51]

my learning rate one suppose if it goes

[377:54]

in this route then what we have we have

[377:56]

- 11 - 9 whenever we go in this rote

[377:59]

which has - 11 and - 9 the average of

[378:02]

both these numbers will be considered

[378:03]

what is average of both these numbers -

[378:05]

11 - 1 9/ 2 this is nothing but - 10

[378:10]

right so - 10 will get multiplied here

[378:13]

suppose if it goes in this route then

[378:15]

here what will happen here will 1 + 9 +

[378:18]

11 divide by 3 average will be taken so

[378:20]

21 divid 3 7 will be there so this will

[378:23]

get replaced by 7 so similarly anything

[378:27]

that you are doing this is with respect

[378:28]

to decision tree 1 like this we will

[378:30]

again construct decision tree separately

[378:33]

and again it will become Alpha 2 by

[378:35]

decision Tre 2 Alpha 3 by decision 3 3

[378:39]

and like this you will be doing till

[378:42]

Alpha and decision 3 n and once you

[378:45]

calculate this this will be your

[378:47]

specific output in a regression tree so

[378:49]

in this particular case what will happen

[378:51]

you're just trying to play with

[378:53]

parameters and you're trying to use in a

[378:55]

different way to compute all this things

[378:57]

everybody clear but again it is a

[378:59]

blackbox model you cannot visualize all

[379:02]

this things now let's go to the third

[379:03]

algorithm which is called as s VM see

[379:05]

svm is almost like decision uh logistic

[379:08]

regression okay so the major aim of svm

[379:12]

is

[379:13]

that major aim of svm is that suppose if

[379:16]

I have a do data points like this okay

[379:20]

we obviously use uh logistic regression

[379:23]

to split this data points right like

[379:25]

this we try to create a best fit line

[379:28]

which looks like this and probably based

[379:30]

on this best fit line we try to divide

[379:32]

the point now in svm what we do is that

[379:36]

we not only create a best fit line but

[379:40]

instead we also create a point which is

[379:44]

called as marginal

[379:45]

planes so like this we create some

[379:48]

marginal

[379:49]

plane so this is your hyper plane and

[379:53]

this is your marginal plane and

[379:55]

whichever plane has this maximum

[379:58]

distance will be able to divide the

[380:01]

points more efficiently but usually in

[380:05]

in a normal scenario you know whenever

[380:07]

we talk about hyper plane or whenever we

[380:10]

talk about marginal plane there will be

[380:11]

lot of overlapping of points right

[380:13]

suppose if I have some specific points I

[380:16]

have one point which looks like this I

[380:18]

may also have another points which may

[380:20]

overlap so it is very difficult to get

[380:23]

an exact straight marginal planes and

[380:26]

split the point based on this now this

[380:28]

specific marginal plane should be

[380:30]

maximum because we can create any type

[380:32]

best fit line and probably

[380:35]

uh use this marginal plane now if we

[380:38]

have this overlapping right if for what

[380:40]

do we call for this kind of plane this

[380:42]

kind of plane is basically called as

[380:44]

hard marginal plane so this is basically

[380:47]

called as hardge marginal plane okay and

[380:51]

similarly if any points are overlapping

[380:54]

suppose this yellow points can also get

[380:56]

overlapped over here and there may be

[380:58]

some kind of Errors so for this

[381:00]

particular case we basically say as soft

[381:02]

marginal plane because here we will be

[381:05]

able to see that errors will be there

[381:07]

now in asvm what we focus on doing is

[381:10]

that we focus on creating this marginal

[381:13]

plane with maximum distance even though

[381:15]

there are some errors we consider it in

[381:17]

solving it by providing some kind of

[381:19]

hyper parameter now how do we go ahead

[381:22]

and basically create this all marginal

[381:24]

planes and how do we go ahead with this

[381:26]

it's very much simple uh just imagine in

[381:29]

this specific way that initially let's

[381:32]

consider that I have this data point

[381:33]

suppose this is my

[381:35]

best fit line how do we give this best

[381:38]

fit line as equation we basically say

[381:40]

yal mx + C right we we basically say

[381:43]

this equation as y mx + C no hard hard

[381:47]

marginal it is impossible in a normal

[381:50]

data set obviously you'll not be able to

[381:52]

get it but definitely we go ahead with

[381:55]

creating a soft marginal plan now Y is

[381:56]

equal to MX plus C what does this m

[381:59]

indicate m is nothing but slope and C

[382:02]

indicates nothing but intercept

[382:05]

can I say that this both equations are

[382:07]

same ax + b y + C isal 0 can I also say

[382:12]

that this is the equation of a straight

[382:14]

line can I say that this is also the

[382:16]

equation of straight line I will say

[382:18]

that both of them are equal can I say

[382:20]

both of them are equal see if I try to

[382:22]

prove this to you if I take this

[382:24]

equation and try to find out y it will

[382:26]

be nothing but minus C Min - c

[382:30]

minus a sorry - a x and this will be

[382:34]

divided by B this will be divided by

[382:37]

B this will be divided by B so here you

[382:40]

can see that it is almost the same in

[382:42]

this particular case my M value will be

[382:44]

- A by B and my C will basically be

[382:47]

minus C by B so both the equation are

[382:49]

almost same

[382:51]

so let's consider that this is my

[382:53]

equation and I am actually and whenever

[382:57]

I say Y is equal to mx + C can I also

[383:00]

write something like this Y is equal to

[383:03]

W1

[383:05]

X1 + W2 X2 plus like this plus C or plus

[383:10]

b same thing no so here also we can

[383:13]

write y w transpose x + B same equation

[383:17]

right we are basically using same

[383:19]

equation yes we can also write it in a

[383:21]

different way but at the end of the day

[383:23]

we are also treating something like this

[383:25]

let's say that this slope is in this

[383:28]

direction if this slope is in this

[383:30]

direction then I can basically say that

[383:32]

let's consider that the slope is minus

[383:33]

one

[383:35]

let's say that this slope is minus one

[383:36]

see it is in the negative Direction

[383:38]

let's say that this slope is minus one

[383:40]

I'm just trying to prove that this slope

[383:42]

is negative value let's consider this

[383:44]

now suppose this is one of my point - 4a

[383:48]

0 and obviously this particular equation

[383:50]

is given by this particular line is

[383:52]

given by this equation now if I really

[383:55]

want to find out the Y value let's say

[383:57]

that this is my

[383:59]

X1 this is my X1 and this is my X2 let's

[384:03]

say that

[384:05]

I want to find out I want to find out

[384:08]

this W transpose x + b the Y value based

[384:12]

on this line if I want to compute the y-

[384:14]

value based on this line how will I

[384:16]

compute W transpose X basically means

[384:18]

what w value what all things will be

[384:20]

there one value is B right B is

[384:23]

intercept right now intercept is passing

[384:25]

from origin can I say my B will be zero

[384:28]

obviously I can assume that b will be

[384:30]

zero now in this particular case if I

[384:32]

talk about w w in this case is minus one

[384:35]

which I have initialized over here so if

[384:37]

I want to do this matrix multiplication

[384:39]

it will be W transpose can be written as

[384:41]

like this and this x value can be

[384:44]

written as -4 comma - 4 and 0 -4 and 0

[384:49]

right so I can basically write like this

[384:52]

now if I do this multiplication what

[384:54]

will my value I get I will basically get

[384:57]

four right so this is a positive

[385:01]

value this is a positive value Now

[385:04]

understand since this is a positive

[385:05]

value any points that are below this

[385:08]

line any points that I consider below

[385:11]

this line and if I try to calculate the

[385:13]

Y can I say that it will always be

[385:15]

positive yes or no similarly if I could

[385:18]

probably consider one point over here as

[385:21]

4A 4A 4 now tell me in this 4A 4 if I

[385:25]

calculate the Y value what will you get

[385:27]

whether you'll get a positive value or a

[385:29]

negative value if I try to calculate the

[385:30]

Y value in this case because here only

[385:32]

positive values will'll be getting right

[385:34]

so if I calculate the Y value will the Y

[385:37]

value be negative or positive just try

[385:39]

to calculate how do you calculate again

[385:41]

I will use y equation this time again my

[385:44]

slope is minus1 my intercept is zero and

[385:46]

here I will have 4 comma

[385:49]

4 now here Min

[385:51]

-4 and then this is + 0 this will be Min

[385:54]

-4 right so this will be a negative

[385:57]

value negative value guys negative see -

[386:00]

4 + 0 negative so any point that I will

[386:05]

probably have in top of this any

[386:08]

points Above This Plane right and if I

[386:12]

try to calculate the Y value it will

[386:13]

always be negative so what two things

[386:16]

you are able to get positive and

[386:17]

negative so you can consider this

[386:19]

entirely one category this another

[386:22]

category at least these two things you

[386:24]

can basically

[386:25]

consider guys I hope everybody is able

[386:27]

to understand this so this will be my

[386:29]

one

[386:30]

category and this will be my another

[386:32]

category obviously so that basically

[386:34]

means I can definitely use a plane and

[386:35]

split this point I hope everybody is

[386:37]

able to understand now let's go ahead

[386:39]

and let's see how this marginal plane

[386:41]

will get created and what is the cost

[386:44]

function to basically do this or what is

[386:46]

the cost function in making sure that

[386:48]

the marginal plane will definitely work

[386:50]

right it becomes difficult right so

[386:52]

suppose let's consider an

[386:55]

example suppose I say that this is my

[386:58]

lines let's say uh I want to basically

[387:01]

create a kind of I have two variety of

[387:03]

points one is this point let's say I

[387:06]

have all this points like this and the

[387:07]

other points I have somewhere here let's

[387:10]

consider I am just using directly good

[387:13]

number of points so that I can split it

[387:15]

okay because I will try to talk about it

[387:17]

what I'm actually trying to prove so

[387:20]

obviously this is my best fit line that

[387:21]

splits and apart from that what I will

[387:24]

do is that I'll also create a marginal

[387:26]

points so in order to create the

[387:27]

marginal point I may use some different

[387:30]

color let's see which color this will be

[387:32]

my one marginal point remember it will

[387:35]

be to the nearest point over here and

[387:38]

basically we will construct like like

[387:40]

this and similarly here we will be

[387:43]

constructing like this I've already told

[387:45]

you guys this equation can be mentioned

[387:48]

at w transpose x + B = 0 right I can

[387:51]

definitely say this because ax + b y + C

[387:55]

is equal to 0 so this I can also write

[387:57]

it as W transpose x equal to 0 sorry

[388:00]

plus b plus b equal to 0 so both are

[388:03]

same okay this I don't have to prove it

[388:05]

I hope everybody's clear with this now

[388:08]

what I'm going to do let's represent

[388:10]

this line also with some equation so

[388:12]

this line if I want to represent this

[388:14]

will be W transpose x + B what value

[388:17]

will come over here positive or negative

[388:19]

C from this line anything above this

[388:21]

plane right any any any distance that we

[388:24]

try to find out it will always be

[388:25]

negative so let's say that I'm using it

[388:27]

as minus one to just read as it is a

[388:30]

negative value and this line that I am

[388:32]

going to mention it it will be W

[388:34]

transpose x + B is equal to + 1 Min -1

[388:37]

above + 1 because we have already

[388:39]

discussed from this point if you're

[388:41]

trying to calculate the Y value it is

[388:43]

always going to be + one this is going

[388:45]

to be minus one here I should definitely

[388:48]

say this as K okay but I'm not

[388:50]

mentioning K in many articles you'll see

[388:53]

it as minus one uh many research paper

[388:55]

also they use it as minus one but I

[388:57]

would like to specify uh minus and plus

[388:59]

K but here let's go and write minus1 and

[389:02]

plus now my aim is to increase this

[389:05]

distance okay this distance I really

[389:07]

want to increase this distance now in

[389:09]

order to increase this if I increase

[389:11]

this distance that basically means my

[389:13]

model is performing well so let's say I

[389:16]

want to find this distance first of all

[389:18]

so if I write w transpose X Plus Bal to

[389:20]

1 and here I will write w transpose x +

[389:23]

B isal minus1 so what I'm going to do

[389:25]

I'm going to do the computation and

[389:28]

subtract it like this so here obviously

[389:31]

this will be my X1 this will be my X2

[389:34]

okay because these are my another points

[389:35]

X2 and X1 so I can write w transpose X1

[389:40]

-

[389:42]

X2 B and B will get cancell and here I

[389:45]

will be writing two right so from here

[389:49]

we can definitely write two different

[389:50]

things let's see what all things we can

[389:52]

write so here this is nothing but the

[389:54]

difference between my this plane and

[389:56]

this plane which is given by like this

[389:58]

okay now always understand whenever we

[390:01]

consider any any vector vors right any

[390:06]

vectors right it also has something

[390:07]

called as

[390:09]

magnitude so if I want to remove this

[390:12]

magnitude I can divide this by W this

[390:16]

magnitude of w then only my Vector will

[390:18]

remain which is indicated like this so

[390:20]

I'm going to basically divide by this

[390:22]

particular operation both both the side

[390:24]

I'm dividing by this magnitude of w and

[390:27]

I don't care about the directions over

[390:29]

here right now we just care about the

[390:30]

vectors now when I write like this what

[390:33]

is our aim our aim is to can I say our

[390:36]

aim is to our aim is to

[390:40]

maximize 2 byw can I say this guys yes

[390:43]

or

[390:46]

no what is our aim our aim is to

[390:49]

basically maximize this right by

[390:52]

updating W comma B value I need to

[390:56]

maximize this yes everybody's clear with

[390:59]

this can I say that yes I want to

[391:01]

maximize this yes or no everybody I want

[391:05]

to maximize this if I maximize this that

[391:07]

basically means my marginal plane will

[391:08]

become bigger my marginal plane will be

[391:10]

bigger okay now can I write along with

[391:13]

this that such that y of I my output

[391:17]

will be dependent on two different

[391:18]

things one is I can say that my y y of I

[391:22]

is plus of uh is + one when w transpose

[391:26]

x + B is greater than or equal to 1

[391:29]

everybody see in this equation what I'm

[391:31]

actually trying to specify such that y

[391:33]

of I is + 1 when w transpose x + B is

[391:36]

greater than 1 and when it is minus 1

[391:38]

that basically means w transpose of X is

[391:40]

B is less than or equal to minus now

[391:42]

what does this basically mean see all my

[391:46]

values whenever I compute W transpose x

[391:49]

+ B is greater than or equal to 1 I'm

[391:51]

obviously going to get this + one when w

[391:54]

transpose X+ B is less than or equal to

[391:56]

1 I'm always going to get the output as

[391:58]

minus one I hope that is the reason why

[392:00]

I have actually written like this so

[392:02]

this two we have already discussed why

[392:03]

we are specifically writing we want to

[392:05]

increase the marginal plane which is

[392:07]

this this is my marginal plane and I'm

[392:09]

writing one condition that my Yi value

[392:11]

will be+ one when w transpose X plus b

[392:14]

is greater than or equal to 1 otherwise

[392:16]

it when it is less than or equal to

[392:17]

minus one it is going to be very much

[392:18]

clear with this transpose condition we

[392:20]

have already done it everybody clear

[392:22]

with this now on top of it we can add

[392:25]

one more very important Point instead of

[392:28]

writing such that and all you can also

[392:30]

say that our major

[392:32]

aim our major aim is that if I multiply

[392:36]

y i multiplied by W transpose X of I + B

[392:41]

If I multiply this two this will always

[392:44]

be able greater than or equal to 1 for

[392:48]

correct points right for correct points

[392:52]

because understand if it is minus one if

[392:55]

I'm multiplying with this and if it is a

[392:57]

correct Point minus into minus will

[392:59]

obviously be greater than or equal to

[393:01]

one only right similarly for this it

[393:03]

will be greater than 1 so I can also

[393:05]

definitely say that my major M If I

[393:07]

multiply y of I with this it will be

[393:10]

always greater than or equal to + 1 U

[393:12]

which is definitely saying that it will

[393:14]

be a positive value so this is just a

[393:16]

representation guys but understand what

[393:19]

is the minimized cost function this is

[393:21]

my minimized cost function maximized

[393:23]

cost function now I'm going to again

[393:26]

write it down

[393:28]

maximize W comma B maximize W comma b 2

[393:33]

by magnitude of w I can also write

[393:37]

something like this minimize W comma B

[393:40]

and I can just inverse this which looks

[393:43]

like this are these both are same or not

[393:45]

because always understand in machine

[393:48]

learning algorithm why do we write

[393:51]

minimize things because we are trying to

[393:54]

minimize something okay both are

[393:57]

equivalent these both are equivalent and

[393:59]

why we specifically write minimization

[394:01]

because in the back propagation when we

[394:03]

we are continuously updating the weights

[394:05]

of w and B so we can definitely write

[394:08]

like this so here my main target is to

[394:12]

minimize this particular value by

[394:14]

changing W and B and I will start adding

[394:17]

some more parameters over here this is

[394:19]

fine till here I think everybody has got

[394:22]

it this is our aim and we are going to

[394:23]

do this but I'm going to add two more

[394:26]

parameters in this Optimizer one is C of

[394:29]

I and one is summation of I equal 1 to n

[394:33]

and here I will use something called as

[394:35]

EA EA of I first of all I'll tell what

[394:38]

is C of I see if I have this specific

[394:41]

data point let's say if some of my

[394:44]

points are over here then is it a right

[394:47]

right prediction or wrong prediction if

[394:49]

some of my points are over here is it a

[394:51]

right prediction or wrong prediction

[394:54]

obviously it is a wrong prediction if my

[394:56]

points are somewhere here is it a WR

[394:58]

prediction wrong wrong incorrect

[394:59]

prediction right so this C value

[395:02]

basically says that how many errors we

[395:04]

can have how many errors we can have if

[395:06]

it says that fine we can have six errors

[395:08]

or seven errors how many errors we can

[395:11]

have even though we are using the

[395:13]

marginal plane how many errors we can

[395:16]

have so here I'm specifically writing

[395:18]

how many errors we can have this is what

[395:21]

is specified by C ofi EA of I basically

[395:24]

says that what is the summation of I'm

[395:26]

going to write it down since we are

[395:28]

doing the sumission this entire term

[395:31]

basically mentions that sumission

[395:34]

of the distance of the values distance

[395:37]

of the wrong points and how do we

[395:39]

calculate the distance from here to here

[395:42]

suppose this is a wrong point I will try

[395:44]

to calculate the distance from here to

[395:45]

here I will do the sumission of this

[395:47]

I'll do the sumission of this I will do

[395:49]

the sumission of this similarly for the

[395:51]

Green Point another sumission will

[395:53]

happen from here to here like this here

[395:56]

to here and we going to do that specific

[395:57]

sumission so we are telling that fine if

[396:01]

you are not able to fit properly try to

[396:05]

apply this two hyperparameters and try

[396:07]

to make sure that this many errors are

[396:10]

also there it is well and good no

[396:11]

problem we will go ahead with that try

[396:14]

to do the submission of the data points

[396:15]

and based on that try to construct the

[396:18]

best fit line along with the marginal

[396:20]

plane like this even though there are

[396:23]

some errors over here or errors over

[396:25]

here we are good to go with respect one

[396:27]

more thing is there which is called as

[396:28]

Al svr svr only one thing is getting

[396:32]

changed in svr only this value will get

[396:36]

changed so I want you all to explore and

[396:38]

just let me know this will be one

[396:40]

assignment for you only this value will

[396:42]

be changing remaining everything are

[396:43]

same so just try to if you change this

[396:46]

particular value that becomes an svr

[396:49]

just try to explore and just try to find

[396:51]

out and just try to let me know so

[396:52]

overall uh did you like the entire

[396:55]

session everyone okay in this one more

[396:57]

thing is there which is called as kernel

[396:59]

Matrix svm kernel we say it as svm

[397:02]

kernel now in s VM kernel what happens

[397:04]

suppose if I have a specific data points

[397:06]

which looks like this which looks like

[397:08]

this so we obviously cannot use a

[397:10]

straight line and try to divide it so

[397:11]

what we do we convert this two Dimension

[397:14]

into three dimensions and then probably

[397:17]

we push our Point like this one point

[397:19]

will go like this and the white point

[397:21]

will go down and then we can basically

[397:24]

use a plane to split it so I uploaded a

[397:26]

video around uh around that and uh you

[397:29]

can definitely have a look onto that and

[397:31]

I have also shown you practically how to

[397:33]

do it that is the reason I've created

[397:35]

that specific video so great uh this was

[397:37]

it from my side I hope you like this

[397:39]

session so thank you everyone have a

[397:41]

great day keep on rocking keep on

[397:43]

learning and never give up

Download Subtitles

These subtitles were extracted using the Free YouTube Subtitle Downloader by LunaNotes.

Download more subtitles
Buy us a coffee

If you found these subtitles useful, consider buying us a coffee. It would help us a lot!

Let's Try!

Start Taking Better Notes Today with LunaNotes!