Download Subtitles and Captions for Any Video Easily

null

9542 segments EN

SRT - Most compatible format for video players (VLC, media players, video editors)

VTT - Web Video Text Tracks for HTML5 video and browsers

TXT - Plain text with timestamps for easy reading and editing

Subtitle Preview

Scroll to view all subtitles

[00:06]

so today's session what all things we

[00:08]

are basically going to discuss so first

[00:10]

of all we going to discuss about

[00:12]

different types of machine learning

[00:13]

algorithm like how many different types

[00:15]

of machine learning

[00:16]

algor understand the purpose of taking

[00:20]

this session is to clear the interviews

[00:23]

okay clear the interviews once you go

[00:25]

for a data science interviews and all

[00:28]

the main purpose is to clear the

[00:29]

interviews I've seen people who knew

[00:32]

machine learning algorithms in a proper

[00:34]

way okay they were definitely able to

[00:36]

clear it because they just explain the

[00:38]

algorithms in a better way to the

[00:40]

recruiter so that they got hired first

[00:42]

of all is the introduction to machine

[00:45]

learning here I'm just specifically

[00:47]

going to talk about AI versus ml versus

[00:51]

DL versus data sign then the second

[00:53]

thing that we are going to talk about

[00:55]

over here is the difference between

[00:58]

supervised MS

[01:00]

and unsupervised ml the third thing that

[01:03]

we are probably going to discuss about

[01:05]

is something called as linear regression

[01:08]

so we are going to clearly understand

[01:10]

the maths and geometric intuition the

[01:13]

next thing that we are probably going to

[01:15]

discuss about is R square and adjusted R

[01:18]

square the fifth topic that we are going

[01:20]

to discuss about is Ridge and lasso

[01:23]

regression the first topic that we are

[01:25]

going to discuss about is AI versus ml

[01:30]

versus DL versus data science so this is

[01:34]

the first topic that we are probably

[01:35]

going to discuss if you really want to

[01:38]

understand the difference between AI

[01:39]

versus ml versus DL versus data science

[01:41]

we will go in this specific format so

[01:43]

just imagine the entire universe so this

[01:46]

entire universe I will probably call it

[01:48]

as an AI now specifically when I say AI

[01:51]

this basically means AI artificial

[01:53]

intelligence whatever role you are in

[01:55]

you are as a machine learning developer

[01:57]

you working as a deep learning developer

[01:59]

Vision developer or a data scientist or

[02:02]

an AI engineer at the end of the day you

[02:05]

are actually creating AI application so

[02:09]

if I really want to Define what is this

[02:11]

artificial intelligence you can just say

[02:13]

that it is a process wherein we create

[02:16]

some kind of applications in which it

[02:19]

will be able to do its task without any

[02:22]

human intervention so that basically

[02:24]

means a person need not monitor this AI

[02:27]

application automatically it'll be able

[02:29]

to make decisions it will be able to

[02:31]

perform its task and it will be able to

[02:34]

do many things so this is what an AI

[02:36]

application is some of the examples that

[02:38]

I would definitely like to consider so

[02:41]

the first example that I would like to

[02:43]

consider AI application AI module

[02:46]

Netflix has an AI module suppose if you

[02:49]

see a kind of action movie for some time

[02:53]

then the kind of AI work or AI work that

[02:56]

is basically implemented over here is

[02:57]

something called as recommendation

[03:00]

so here through this application what

[03:04]

happens is that when you're continuously

[03:06]

seeing the action movies then

[03:08]

automatically the AI module that is

[03:10]

present inside Netflix will make sure

[03:13]

that it gives us recommendation on

[03:15]

action movies second if I take an

[03:18]

example of comedy movie If I

[03:20]

continuously see comedy movie then also

[03:22]

it'll give us the recommendation of the

[03:24]

comedy movie so this through this what

[03:26]

happens is that it understands your

[03:28]

behavior and it is being able to do its

[03:30]

task without asking you anything the

[03:33]

second example that I would like to take

[03:35]

up in is

[03:36]

amazon.in now amazon.in again if you buy

[03:39]

[03:40]

iPhone then it may recommend you a

[03:43]

headphones so this kind of

[03:45]

recommendation is also a part of AI

[03:48]

module that is integrated with the

[03:49]

amazon.in website the ads that you see

[03:52]

probably when you opening my channel

[03:55]

through which I get paid a little bit

[03:56]

from my from a from the hard work that I

[03:59]

do in YouTube right so through that ads

[04:02]

how that is recommended to you uh that

[04:05]

is also an AI engine that is included in

[04:07]

the YouTube channel itself which really

[04:09]

plays it is a business-driven goal

[04:12]

understand it is a business driven

[04:13]

things that we basically do with the

[04:15]

help of AI one more example that I would

[04:17]

like to give you is if I consider it

[04:20]

self-driving cars so here you'll be able

[04:23]

to see self-driving cars if you take an

[04:25]

example of Tesla so self-driving cars

[04:27]

what happens based on the road it is

[04:29]

able ble to drive it automatically who

[04:31]

is doing that there is an AI application

[04:33]

integrated with the car itself right so

[04:36]

if I consider all these things these all

[04:38]

are AI application at the end of the day

[04:42]

whatever role you do you are going to

[04:44]

create an AI application this is the

[04:46]

common mistake what people do you know

[04:48]

like our CEO sudhansu Kumar he has

[04:50]

written in his profile that he's an AI

[04:52]

engineer that basically means his goal

[04:55]

is to create an AI application so

[04:57]

probably in a product based companies

[04:58]

you'll be seeing this kind of roles

[04:59]

called as AI engineer now let's go to

[05:01]

the next role which is called as machine

[05:03]

learning so where does machine learning

[05:04]

comes into existence so if I try to

[05:07]

create this machine learning is a subset

[05:10]

of AI and what is the role of machine

[05:12]

learning it provides stats

[05:15]

tools

[05:17]

to analyze the data visualize the data

[05:22]

and apart from that to do

[05:24]

predictions I'm

[05:27]

forecasting so you will be seeing a lot

[05:29]

of machine learning algorithms so

[05:31]

internally those machine learning

[05:33]

algorithm the equation that we are

[05:34]

basically using it is basically using it

[05:38]

is having a kind of stats tool stat

[05:40]

techniques because whenever we work with

[05:42]

data statistics is definitely very much

[05:44]

important so this exactly is called as

[05:47]

machine learning so it is a subset of AI

[05:50]

this is very much important to

[05:52]

understand ml is a subset of AI so here

[05:55]

you can see that it is a part of this

[05:57]

now let's go to the next one which is

[05:59]

called called as deep learning deep

[06:01]

learning is again a subset of ml now

[06:04]

let's consider why deep learning came

[06:05]

into existence because in 1950s 60s

[06:09]

scientists thought that can we make

[06:11]

machine learn like how we human being

[06:13]

learn so for that particular purpose

[06:16]

deep learning came into existence here

[06:18]

the plan is to basically mimic human

[06:21]

brain so when I say mimicking human

[06:24]

brain that basically means we are trying

[06:26]

to mimic the human brain to implement

[06:28]

something to learn something so for this

[06:31]

you use something called as

[06:32]

multi-layered neural networks so this is

[06:35]

what deep learning is it is a subset of

[06:37]

machine learning its main aim is to

[06:40]

mimic human brain so they actually

[06:42]

create multi-layer neural network and

[06:45]

this multi-layered neural network will

[06:47]

basically help you to train the machines

[06:49]

or applications whatever we are trying

[06:51]

to create and deep learning has really

[06:54]

really done an amazing work with the

[06:56]

help of deep learning we are able to

[06:58]

solve such a complex complex complex use

[07:02]

cases that we will be probably

[07:04]

discussing as we go ahead now if I come

[07:06]

to data science see this is the thing

[07:08]

guys if you want to say yourself as a

[07:10]

data scientist tomorrow you given a

[07:13]

business use case and situation comes

[07:15]

that you probably have to solve that use

[07:17]

case with the help of machine learning

[07:18]

algorithms or deep learning algorithms

[07:20]

again the final goal is to create an AI

[07:22]

application right you cannot say that I

[07:24]

am a data scientist and I'll just work

[07:26]

in machine learning I or I'll work in

[07:29]

deep learning or I may I don't know how

[07:31]

to analyze the data no you cannot do

[07:33]

that when I was working in Panasonic I

[07:36]

got various different kind of task

[07:39]

sometime I was told to use W powerbi to

[07:41]

visualize analyze the data sometime I

[07:43]

was given a machine learning project

[07:45]

sometime I was given a deep learning

[07:46]

project so as a data scientist if I

[07:49]

consider where does data scientist fall

[07:51]

into this it will be a part of

[07:53]

everything so if I talk about machine

[07:56]

learning and deep learning with respect

[07:58]

to any kind of problem statement that we

[08:00]

solve the majority of the business use

[08:03]

cases will be falling in two sections

[08:05]

one is supervised machine learning one

[08:08]

is unsupervised machine learning so most

[08:10]

of the problems that you are basically

[08:12]

solving this is with respect to this two

[08:15]

problem statement two different types of

[08:16]

machine learning algorithms that is

[08:18]

supervised machine learning and deep

[08:20]

learning if I talk about supervised

[08:22]

machine learning two major problem

[08:24]

statements that you are basically

[08:25]

solving here also one is regression

[08:28]

problem

[08:30]

and the other one is something called as

[08:31]

classification problem and in the case

[08:34]

of unsupervised machine learning problem

[08:36]

statement you are basically solving two

[08:37]

different types of problem one is

[08:39]

clustering and one is dimensionality

[08:42]

reduction and there is also one more

[08:44]

type which is called as reinforcement

[08:46]

learning reinforcement learning I can I

[08:50]

I will definitely talk about this not

[08:52]

right now right now we are just focusing

[08:53]

on all these things now understand what

[08:56]

happens in supervised machine learning

[08:58]

let's consider consider a data set so

[09:00]

here I have a data set which says this

[09:03]

is my age and this is my weight suppose

[09:07]

I have these two specific features let's

[09:09]

say that I have values like 24 62 25 63

[09:15]

21 72

[09:19]

257 uh 62 and many more data over here

[09:23]

let's say that my task is to basically

[09:25]

take this particular data and create a

[09:27]

model wherein so suppose my task is that

[09:31]

I need to create a model whenever it

[09:34]

takes the New Age first of all we train

[09:36]

this model with this data and whenever

[09:39]

we take age a new age it should be able

[09:41]

to give us the output of weight this

[09:44]

particular model is also called as

[09:46]

hypothesis okay I'll discuss about this

[09:49]

today when I we discussing about linear

[09:51]

regression now what are the important

[09:53]

components whenever we have this kind of

[09:55]

problem statement first of all you need

[09:56]

to understand there are two important

[09:59]

things one is independent features and

[10:02]

the other one is something called as

[10:03]

dependent features now let's go ahead

[10:05]

and discuss what is independent feature

[10:07]

independent feature basically means in

[10:09]

this particular case since the input

[10:11]

that I'm basically training in all those

[10:13]

features becomes an independent feature

[10:15]

now in this particular case my age is

[10:17]

independent feature and whatever I'm

[10:20]

actually predicting so when I say

[10:21]

predicting I know this is my output okay

[10:24]

this is the what I have to basically

[10:27]

make my model uh give this as a an

[10:29]

output so in this particular casee my

[10:31]

dependent feature becomes weight why we

[10:34]

specifically say a dependent feature

[10:36]

because this is completely dependent on

[10:38]

this value whenever this is increasing

[10:40]

or decreasing this value is basically

[10:41]

getting changed so that is the reason

[10:44]

why we basically say this has

[10:45]

independent and dependent feature

[10:47]

whenever we are solving a problem right

[10:50]

in the case of supervised machine

[10:51]

learning remember they will be one

[10:53]

dependent feature and there can be any

[10:55]

number of independent features now let's

[10:58]

go ahead and let's discuss about

[10:59]

regression and classification what is

[11:01]

the difference between them now let

[11:03]

let's go ahead and let's discuss about

[11:05]

two things one

[11:08]

is let's say I want a regression problem

[11:11]

statement suppose I take the same

[11:14]

example as age and weight so I have

[11:17]

values like as discussed 24 72 23

[11:22]

71 uh 24 or 25

[11:26]

71.5 okay so this kind of data I have

[11:29]

see this is my output variable which is

[11:32]

my dependent feature now in this

[11:34]

particular dependent feature now

[11:36]

whenever I'm trying to find out the

[11:37]

output and in this particular output you

[11:39]

have a continuous variable when you have

[11:42]

a continuous variable then this becomes

[11:44]

a regression problem statement now one

[11:47]

example I would like to give suppose

[11:49]

this is my data set right this is my age

[11:52]

this is my weight suppose I am

[11:54]

populating this particular data set with

[11:56]

the help of scatter plot then in order

[11:58]

to basically solve this problem what

[12:01]

we'll do suppose if I take an example of

[12:03]

linear regression I will try to draw a

[12:05]

straight line and this particular line

[12:08]

is my equation which is called as yal mx

[12:11]

+ C and with the help of this particular

[12:13]

equation I will try to find out the

[12:15]

predicted points so this will be my

[12:17]

predicted point this will be my

[12:18]

predicted point this this any new points

[12:21]

that I see over here will basically be

[12:23]

my predicted point with respect to Y so

[12:26]

in this way we basically solve a

[12:28]

regression problem statement so this is

[12:30]

very much important to understand let's

[12:32]

go to the always understand in a

[12:34]

regression problem statement your output

[12:35]

will be a continuous variable the second

[12:37]

one is basically a classification

[12:40]

problem now in classification problem

[12:42]

suppose I have a data set let's say that

[12:45]

number of hours study number of study

[12:48]

hours number of play

[12:51]

hours so this is my independent feature

[12:54]

let's say a number of sleeping hours and

[12:57]

finally I have my output which will will

[12:59]

be pass or fail so in this I have all

[13:03]

this as my independent features and this

[13:05]

is my dependent feature so I will be

[13:08]

having some values like this and here

[13:11]

either you'll be pass or fail or pass or

[13:15]

fail now whenever you have in your

[13:18]

output fixed number of categories then

[13:21]

that becomes a classification problem

[13:23]

suppose it just has two outputs then it

[13:25]

becomes a binary classification if you

[13:28]

have more than two different categories

[13:30]

at that time it becomes a multiclass

[13:32]

classification so this is the difference

[13:34]

between regression problem statement and

[13:36]

the classification problem statement now

[13:39]

let's go ahead and let's discuss about

[13:40]

something called as unsupervised machine

[13:42]

learning now in unsupervised machine

[13:44]

learning which is my second main topic

[13:47]

over here I'm just going to write

[13:49]

unsupervised machine learning now what

[13:52]

exactly is unsupervised machine learning

[13:54]

here whenever I talk about there are two

[13:56]

main problem statement that we solve one

[13:58]

is clustering

[13:59]

one is dimensionality reduction let's

[14:02]

take one example of a specific data set

[14:04]

over here let's say that my data set is

[14:06]

something called as salary and age now

[14:10]

in this scenario we don't have any

[14:12]

output variable no output variable no

[14:14]

dependent variable then what kind of

[14:16]

assumptions that we can take out from

[14:19]

this particular data set suppose I have

[14:21]

salary and age as my values so in this

[14:23]

particular case I would like to do

[14:25]

something called as clustering now why

[14:28]

clustering is used just understand let's

[14:31]

say I am going to do something called as

[14:33]

customer segmentation now what does this

[14:35]

customer segmentation do clustering

[14:37]

basically means that based on this data

[14:39]

I will try to find out similar groups

[14:41]

groups of people suppose this is my one

[14:44]

group this is my another group this is

[14:46]

my third group let's say that I was able

[14:48]

to create this many groups this many

[14:50]

groups are clusters I'll say cluster 1 2

[14:53]

three each and every cluster will be

[14:56]

specifying some information this cluster

[14:58]

May specify that this person uh he was

[15:01]

very young but he was able to get some

[15:03]

amazing salary this person it may some

[15:06]

specify that these people are basically

[15:07]

having more age and they are getting

[15:10]

good salary these people are like middle

[15:12]

class background where with respect to

[15:14]

the age the salary is not that much

[15:16]

increasing so here what we are doing we

[15:18]

are doing clustering we are grouping

[15:20]

them together main thing is grouping

[15:23]

this word is very much important now why

[15:25]

do we use this suppose my company

[15:28]

launches is a product and I want to just

[15:31]

Target this particular product to rich

[15:33]

people let's say product one is for rich

[15:35]

people product two is for middle class

[15:37]

people so if I make this kind of

[15:40]

clusters I will be able to Target my ads

[15:43]

only to this kind of people let's say

[15:46]

that this is the rich people this is the

[15:48]

middle class people I will be able to

[15:50]

Target this particular ads or this

[15:53]

particular product or send this

[15:55]

particular things to those specific

[15:56]

group of people by that that is

[15:59]

basically called as ad marketing and

[16:00]

this uses something called as customer

[16:04]

segmentation a very important example

[16:07]

and based on this customer segmentation

[16:08]

we can later apply any regression or

[16:10]

classification kind of problem statement

[16:12]

now coming to the second one after

[16:14]

clustering which is called as

[16:15]

dimensionality reduction now in

[16:17]

dimensionality reduction what we are

[16:19]

focusing on suppose if we have th000

[16:22]

features can we reduce this features to

[16:25]

lower Dimensions let's say that I want

[16:27]

to convert this

[16:29]

uh th000 feature to 100 features lower

[16:32]

Dimension so can we do that yes it is

[16:36]

possible with the help of dimensionality

[16:38]

deduction algorithm there are some

[16:40]

algorithms like PCA so I'll also try to

[16:42]

cover this as we go ahead understand

[16:44]

clustering is not a classification

[16:46]

problem clustering is a grouping

[16:48]

algorithm there is no output feature no

[16:50]

dependent variable in clustering sorry

[16:53]

in unsupervised ml so yes I will also

[16:55]

try to cover up LDA we'll cover up PCA

[16:58]

and all as we go ahead so with respect

[17:00]

to supervised and unsupervised so first

[17:03]

thing that we are going to cover is

[17:04]

something called as linear regression

[17:06]

the second algorithm that we will try to

[17:08]

cover after linear regression is

[17:10]

something called as Ridge and lasso

[17:12]

third that we are going to cover is

[17:14]

something called as logistic regression

[17:16]

the fourth that we are basically going

[17:17]

to cover is something called as decision

[17:19]

tree decision tree includes both

[17:21]

classification and regression four fifth

[17:24]

that we are going to cover is something

[17:25]

called as adab boost sixth that we are

[17:27]

going to cover is something called as

[17:28]

random Forest seventh that we are going

[17:30]

to cover is something called as gradient

[17:32]

boosting eighth that we are going to

[17:34]

cover is something called as XG boost N9

[17:37]

that we are going to cover is something

[17:38]

called as n bias then when we go to the

[17:41]

unsupervised machine learning algorithm

[17:43]

the first algorithm that we are going to

[17:45]

do is something called as K means K

[17:47]

means algorithm then we also have DV

[17:48]

scan then we are also going to do higher

[17:50]

C clustering there is also something

[17:52]

called as K nearest neighbor clustering

[17:55]

fifth we'll try to see about PCA then

[17:57]

LDA so different different things we

[18:00]

will try to cover up yes svm I have

[18:02]

missed here I'm going to include svm KNN

[18:05]

will also get covered so I have that in

[18:07]

my list probably I may miss one or two

[18:08]

but we are going to cover everything so

[18:10]

let's start our first algorithm linear

[18:13]

regression so let's go ahead and discuss

[18:15]

about linear regression linear

[18:16]

regression problem statement is very

[18:18]

simple guys so suppose I have let's say

[18:21]

I have two features one is my X feature

[18:23]

and one is my y feature let's say that X

[18:25]

is nothing but age and Y is nothing but

[18:29]

weight so based on these two features I

[18:31]

have some data points that has been

[18:34]

present over here so in linear

[18:35]

regression what we try to do is that we

[18:38]

try to create a model with the help of

[18:40]

this training data set so this will be

[18:43]

my training data set what I'm actually

[18:45]

going to do is that I'm going to

[18:47]

basically train a model and this model

[18:50]

is nothing but a kind of hypothesis

[18:52]

testing or it is just kind of hypothesis

[18:54]

which takes the new age and gives the

[18:57]

output of the weights and then with the

[19:01]

help of performance metrics we try to

[19:03]

verify whether this model is performing

[19:05]

well or not now in short what we are

[19:06]

going to do in linear regression is that

[19:08]

we'll try to find out a best fit line

[19:10]

which will actually help us to do the

[19:12]

prediction that basically means if I get

[19:14]

my new age over here then what should be

[19:16]

my output with respect to Y okay so with

[19:19]

respect to this what should be my output

[19:21]

over here in this particular case

[19:23]

whenever we are drawing a diagram like

[19:24]

this I can basically say that Y is a

[19:28]

linear function of X so this is what we

[19:31]

are going to do now understand how we

[19:33]

are going to create this best fit line

[19:35]

this is very much important whenever we

[19:36]

say linear regression it basically means

[19:39]

that we are going to create a linear

[19:40]

line over there you may be thinking sir

[19:43]

why to create linear line why not

[19:44]

nonlinear line that I'll discuss about

[19:46]

it as we go ahead see other other

[19:48]

algorithms so to begin with let's

[19:51]

consider this line that you see over

[19:53]

here right this line equation can be

[19:56]

given by multiple equations someone some

[19:58]

people people write yal mx + C some

[20:01]

people write uh H some people write yal

[20:05]

beta 0 + beta 1 into X some people write

[20:08]

H Theta of xal to Theta 0 + Theta 1 into

[20:13]

X many many equations are there for this

[20:16]

this straight line this straight line

[20:18]

many many equations are there with

[20:20]

respect to many many different kind of

[20:22]

notations but the first algorithm that I

[20:24]

have probably learned of linear

[20:26]

regression is from Andrew Ng definitely

[20:29]

I would like to give him the entire

[20:30]

credits and based on his notation

[20:33]

whatever he has explained I'll try to

[20:34]

explain you over here so the credits for

[20:37]

this algorithm specifically goes to

[20:40]

Andrew NG so let's consider this one

[20:43]

over here in order to create this

[20:45]

straight line I will basically use a

[20:47]

equation which is called as H Theta so

[20:50]

this is the equation of a straight line

[20:52]

if I know the equation of the straight

[20:54]

line whatever I can write I can write

[20:56]

many things yal mx + C yal beta 0 + beta

[21:00]

1 * X and then I can also write one more

[21:04]

that is H Theta of xal theta 0 + Theta 1

[21:08]

into X of I here also you can basically

[21:11]

say x of I here also you can say x of I

[21:13]

now let's go ahead and let's take this

[21:15]

equation for now let's take this

[21:17]

equation of now so I'm I'm going to take

[21:19]

out this equation and just write one

[21:21]

equation through which I have also

[21:23]

studied but I will definitely be adding

[21:25]

some points which probably Andrew and

[21:27]

could not mention mention in his video

[21:29]

but I'll try my level best obviously he

[21:32]

is the best I cannot even compare myself

[21:34]

to him so Theta 0 + Theta 1 into X now

[21:39]

let's understand what is Theta 0 Theta 1

[21:42]

as I said that let's say I have a

[21:44]

problem statement over here let's say I

[21:47]

this is my X and this is my y this is my

[21:49]

data points now what I'm doing I'm

[21:51]

trying to create a best fit line like

[21:53]

this now what is this best fit line what

[21:55]

is uh when I say this best fit line is

[21:57]

basically given by this equation what

[21:59]

does Theta 0 basically indicate Theta 0

[22:02]

over here is something called as

[22:04]

intercept now what exactly is intercept

[22:08]

intercept basically means that when your

[22:10]

X is zero then H Theta of X is equal to

[22:13]

Theta 0 so in this particular case

[22:16]

intercept basically indicates that at

[22:18]

what point you are meeting the Y AIS so

[22:22]

this particular point is basically

[22:24]

your intercept when your X is equal to 0

[22:28]

at that point of time you'll be seeing

[22:30]

that this line is intersecting the y-

[22:32]

AIS whatever value this will be that is

[22:34]

your intercept now the second thing is

[22:37]

about your Theta 1 what is Theta 1 this

[22:40]

is nothing but slope or coefficient now

[22:43]

what does this basically indicate this

[22:45]

indicates let let's say that this is the

[22:47]

unit one unit in the x-axis and probably

[22:50]

with respect to this I can find one

[22:52]

point over here one point over here and

[22:55]

if I try to draw this over here to here

[22:57]

this is the unit movement in y so what

[23:00]

does it basically say slope with the

[23:02]

unit movement in one one unit movement

[23:05]

towards the x-axis what is the unit

[23:07]

movement in y- axis that is basically

[23:09]

slope or coefficient Theta 0 and Theta 1

[23:11]

two things and X of I is definitely your

[23:14]

data points now our main aim is to

[23:18]

create a best fit line in such a way

[23:21]

that I I'll just try to show it to you

[23:22]

what is our main aim let's let's

[23:24]

understand what is the aim of a linear

[23:26]

regression so if I take an example of

[23:29]

linear regression I need to find out the

[23:32]

best fit line in such a way that the

[23:35]

distance

[23:36]

between this data points that I have and

[23:40]

the predicted points should be very very

[23:42]

less suppose I'm creating a best fit

[23:46]

line okay I'm creating a best fit line

[23:49]

so with respect to this data points

[23:51]

initially was this right but my

[23:52]

predicted point is this point in this

[23:55]

particular case my predicted point is

[23:56]

this point so and if I do do the

[23:58]

summation of all these points those

[24:01]

distance should be minimal then only

[24:04]

I'll be able to say that this is the

[24:06]

best fit line so I I cannot definitely

[24:08]

say that this is exactly the best fit

[24:10]

line or not how will I say when I try to

[24:13]

calculate the difference between this

[24:15]

point and the predicted Point these are

[24:17]

my predicted point right if I try to

[24:19]

calculate the distance between them then

[24:22]

I will basically have a aim to it should

[24:24]

be minimal if I do the summation of all

[24:26]

the distance it should be minimal

[24:29]

so for that what I can do is that see

[24:31]

you may be also thinking Krish why not

[24:33]

just do one thing okay suppose if these

[24:35]

are my data points why not just play and

[24:38]

create multiple lines and try to compare

[24:40]

what we can do is that we can compare

[24:42]

multiple we can create multiple lines

[24:44]

right like this and then whoever is

[24:46]

giving the best minimal point I will go

[24:48]

and select that but how many iteration

[24:51]

you will do how you will come to know

[24:52]

that okay this line is the best line so

[24:55]

for that specific purpose we should

[24:57]

start at one point and we should lead

[25:01]

towards finding the best fit line start

[25:04]

at one point and then we should go

[25:06]

towards finding the best fit line so for

[25:10]

this particular purpose what we do is

[25:12]

that we create a something called as uh

[25:15]

cost function I have already shown you

[25:17]

what is my hypothesis function my best

[25:19]

fit line equation is basically given as

[25:21]

H Theta of x equal to Theta 0 + Theta 1

[25:26]

* X this is my hypothesis right now

[25:29]

coming to the cost function which is

[25:32]

super super important why this it is

[25:34]

super important because cost function

[25:37]

basically what what is cost function

[25:38]

over here I told right right this

[25:41]

distance when I do the

[25:42]

summation this distance that I when I'm

[25:45]

doing the summation it should be minimal

[25:48]

so if I really want to find out this

[25:49]

particular distance I will be using one

[25:51]

more equation how can I use a distance

[25:54]

formula between the predicted and the

[25:56]

real point I will just say that H Theta

[26:00]

of x - y so when I say h Theta of x - Y

[26:06]

what does this basically mean this is my

[26:07]

real point and this is my predicted

[26:10]

Point predicted point is basically given

[26:12]

by H Theta of X and what I'm going to do

[26:15]

I'm going to basically do the squaring

[26:17]

because I may get a negative value so

[26:18]

because of that I really want to do the

[26:20]

squaring part Now understand one thing I

[26:23]

need to also do the

[26:25]

summation I = 1 to compl complete M

[26:29]

let's say that I'm taking the number of

[26:30]

data points over here as M because I

[26:33]

need to calculate the distance between

[26:34]

all the points right with respect to the

[26:37]

predicted and the predict with respect

[26:39]

to the real

[26:40]

points so after this I also need to

[26:44]

divide by 1X 2m the reason why I'm

[26:47]

dividing by first of all let me show you

[26:49]

why we are dividing by 1 by m 1 by m

[26:51]

will give us the average of all the

[26:53]

values that we have the specific reason

[26:56]

why we are dividing by 1 by 2 do is for

[26:59]

the derivation purpose it helps us to

[27:02]

make our equation very much simpler so

[27:05]

that later on when I am updating the

[27:08]

weights when I say weights I'm basically

[27:10]

updating Theta 0 and Theta 1 Theta 0 and

[27:13]

Theta 1 at that point of time you'll be

[27:15]

able to see that this particular value

[27:18]

when we probably do the derivative it

[27:20]

will help us to do it again I'm going to

[27:22]

repeat it I'm going to write it down for

[27:24]

you first of

[27:26]

all now in order to find find out the

[27:28]

best fit line I need to keep on changing

[27:30]

Theta 0 and Theta 1 unless and until I

[27:33]

get the best fit line unless and until I

[27:35]

don't get the best fit line I need to

[27:37]

keep on updating Theta 0 and Theta 1 now

[27:40]

if I need to keep on updating Theta 0

[27:42]

and Theta 1 I probably require a cost

[27:45]

function okay what this cost function

[27:47]

will do I'll just tell you so cost

[27:49]

function over here I will specify as J

[27:53]

of theta 0 comma Theta 1 is equal to now

[27:57]

what is cost fun function over here what

[27:59]

this distance I told right this distance

[28:01]

between the H Theta of X and Y if I do

[28:05]

the summation of all these things it

[28:07]

needs to be minimal it needs to be less

[28:10]

because with respect to an X point this

[28:12]

is my y point

[28:14]

right similarly with respect to this x

[28:16]

point this is my y point so what I'm

[28:19]

actually going to do I'm going to use a

[28:20]

cost function now in this cost function

[28:23]

my main aim is

[28:25]

to basically write H Theta of x - y s

[28:29]

this will be with respect to I I I why I

[28:32]

am saying I because this will be moving

[28:34]

from I equal to 1 to all the points that

[28:37]

is m m is basically all the points over

[28:41]

here now apart from this what I actually

[28:44]

going to do I'm going to divide by 1X 2

[28:46]

m I'll tell you why I'm specifically

[28:48]

dividing by 1X 2 m first of all by

[28:51]

dividing by m I will be getting an

[28:53]

average

[28:54]

output average cost function because

[28:57]

here I'm iterating M the reason why I'm

[29:00]

dividing by two because it will help us

[29:01]

in derivation why let's say that I have

[29:04]

x² if I try to find out derivative of x²

[29:08]

with respect to X then what will I get I

[29:11]

will basically get 2x right that is what

[29:14]

is the formula what is the derivation of

[29:16]

X of n it is nothing but n x of n

[29:19]

minus1 so that is the reason why I'm

[29:21]

actually making it 1 by two so that when

[29:24]

two comes over here this two and two

[29:26]

will get cancelled so I hope everybody's

[29:29]

able to understand so this is my cost

[29:32]

function Now understand what is this

[29:34]

called as this entire equation is

[29:36]

basically called as squared error

[29:40]

function yes mathematical Simplicity

[29:42]

basically means because when we are

[29:44]

updating Theta 0 and Theta 1 we

[29:46]

basically find out derivation in the

[29:47]

cost function so that is the reason why

[29:50]

we are specifically doing it squaring

[29:52]

off is basically done because so that we

[29:54]

don't get any negative values here

[29:56]

squared error function now let's go

[29:59]

towards the what we need to solve this

[30:02]

is my cost function okay so I need to

[30:07]

minimize minimize this particular value

[30:10]

that is 1x 2 m summation of I = 1 2 m

[30:15]

and then this will basically be H Theta

[30:17]

of X of I minus y of I whole Square we

[30:23]

need to minimize this by adjusting

[30:26]

parameter Theta 0 and Theta 1

[30:28]

this entirely is what this is nothing

[30:31]

but J of theta 0 comma Theta 1 and we

[30:36]

really need to minimize this so this is

[30:38]

our task okay this is our task now let's

[30:41]

go ahead and let's try to compare with

[30:44]

two different thing one is the

[30:46]

hypothesis testing and one is with

[30:48]

respect to the cost

[30:49]

function okay let's take an

[30:52]

example so right now my equation of

[30:58]

the

[30:59]

hypothesis is nothing but H Theta of x

[31:02]

equal to Theta 0 + Theta 1 *

[31:06]

X if Theta 0 is 0 then what does this

[31:11]

basically indicate can I say that it

[31:14]

basically the line the line the best fit

[31:16]

line passes through the origin and this

[31:18]

is nothing but s Theta of xal to Theta

[31:21]

1 multiplied by X can I say like this

[31:25]

obviously I can definitely say like this

[31:27]

right so my equation will be like this

[31:29]

so for right now let's consider that

[31:33]

your Theta 0 is equal to 0 so this is

[31:35]

what it is we have done till here we

[31:37]

have minimized we have written the

[31:39]

equation everything yes so it is passing

[31:42]

through the origin and this is what is

[31:44]

the equation I'm actually getting now

[31:47]

let's take one example and let's try to

[31:48]

solve this if I if I have H Theta of X

[31:51]

so this is my new hypothesis considering

[31:54]

that my intercept is passing through the

[31:57]

region so with respect to this let's say

[32:00]

that I will create one line over here

[32:04]

let's say this is

[32:05]

my this is my data points like X1 y1 I

[32:11]

have 1 2 3 I have 1 2 3 now let's

[32:19]

consider that if I have T I have data

[32:22]

points like what I have data points like

[32:24]

let's say I have three data points 1

[32:26]

comma 1 2A 2 3 comma 3 so 1A 1 is

[32:31]

nothing but this is my data point 2A 2

[32:34]

is nothing but this is my data point and

[32:36]

3 comma 3 is this is my data point so

[32:39]

these are my data points from the data

[32:41]

set that I

[32:43]

have so 2 comma 2 is this point and 3

[32:47]

comma 3 is basically this point let's

[32:49]

consider that these are my points that I

[32:51]

have these are my data points now if I

[32:54]

consider Theta 1 as 1 where do you think

[32:57]

the straight line will pass through

[32:59]

where do you think the straight line

[33:00]

will pass the straight line will

[33:02]

definitely pass like this right my

[33:05]

straight line will definitely pass

[33:06]

through all the points this same point

[33:08]

becomes a prediction point also right

[33:11]

same point let's consider that this is

[33:13]

also getting pass through this it passes

[33:15]

through all the points when Theta 1 is

[33:17]

equal to 1 Theta 1 is nothing but slope

[33:19]

when slope is equal to 1 in this

[33:21]

scenario it passes through all the

[33:22]

points now go ahead and calculate your J

[33:25]

of theta so what will the form of J of

[33:28]

theta 1 become because Theta 0 is 0 okay

[33:31]

we can basically write 1 by 2 m

[33:33]

summation of I = 1 2 three how many

[33:36]

points are there three right and here I

[33:39]

have J of H of theta of X1

[33:43]

sorry X of theta of x i - y i

[33:49]

s right now let's go ahead and compute

[33:52]

now in this particular scenario what

[33:54]

will happen 1X 2 m

[33:57]

then what is what is this point minus y

[34:00]

of I see h of X is also 1 y of I is also

[34:04]

one both the point are 1 so this will

[34:06]

become 1 - 1 whole S Plus because we are

[34:09]

doing summation the next point is also

[34:11]

falling in 2A 2 so this will become 2 -

[34:13]

2 s + 3 - 3 S so in total this will

[34:18]

become zero so when your J of theta when

[34:22]

Theta 1 is 1 Theta 1 is 1 so J of theta

[34:26]

1 is how much it is

[34:29]

Z right so what is this J of theta 1 it

[34:33]

is the cost function so let me draw the

[34:35]

cost function graph over here let's say

[34:39]

that this is my Theta and this is

[34:42]

my so here I have 0.5 here I have 1 here

[34:46]

I have 1.5 so this is my Theta here I

[34:49]

have two then I have 2.5 okay then

[34:52]

similarly I have 0. five then I have 1

[34:58]

1.5 2 2.5 this is my J of theta 1 so

[35:04]

right now what is my Theta 1 my Theta 1

[35:07]

is 1 at this particular Point what did I

[35:09]

get J of theta 1 is nothing but zero so

[35:12]

this will be my first point this will be

[35:15]

my first point guys I have discussed why

[35:18]

why the value will be 1X 2m basically to

[35:20]

make the calculation simpler we are

[35:22]

dividing by 1X 2 m is basically used to

[35:26]

average aage is the sumission that we

[35:28]

are actually doing over here now let's

[35:30]

go ahead and let's take the second

[35:32]

scenario in the second scenario let's

[35:34]

consider my Theta 1 let's say that my

[35:37]

Theta 1 over here is now 0.5 if my Theta

[35:41]

1 is 0.5 then tell me what are the

[35:43]

points that I will get for x equal to

[35:47]

1.5 * 1 so it will come as 0.5 over

[35:51]

here right then similarly when X is

[35:54]

equal to 2.5 * 2 is nothing but 1 over

[35:59]

here and then similarly when uh for x

[36:03]

equal to

[36:04]

35 multiplied by 3 see we are

[36:07]

multiplying here right5 multi by 3 is

[36:09]

1.5 so the next point will come over

[36:12]

here now when I create my best fit line

[36:15]

what will happen so here is my next best

[36:19]

fit line which I will probably create by

[36:20]

green

[36:23]

color okay so this is my second one

[36:25]

which is green color here definitely

[36:27]

slope is decreasing so if I go ahead and

[36:30]

calculate my J of theta let's see what

[36:32]

I'll get so J of theta

[36:35]

1 is nothing but 1X 2

[36:39]

m again same equation summation of I = 1

[36:42]

2 3 H Theta of X of

[36:46]

i - y of

[36:49]

i² so what we have for over here we have

[36:52]

nothing but 1X 2 m now let's do the

[36:56]

summation what is this point this point

[36:58]

is nothing but the predicted point and

[37:01]

this point is the real point right so in

[37:03]

this particular scenario the first point

[37:05]

that I will get is nothing but. 5 - 1

[37:10]

whole s how I'm getting. 5 - 1 whole

[37:12]

Square this is 1 this is the real Point

[37:15]

1 this is the predicted Point .5 so here

[37:18]

I'm getting. 5 - 1 whole Square the

[37:21]

second point will be 1 - 2 whole s right

[37:25]

2 so 1 - 2 whole

[37:28]

s and then I will finally get 1.5 - 3

[37:34]

whole s so finally if I do this

[37:36]

calculation how much I'm actually

[37:38]

getting 1X 2 * 3 which is 6 here I'm

[37:42]

getting

[37:44]

.25 5 Square here I'm getting 1 here I'm

[37:47]

getting 1.5 whole Square so my final

[37:51]

output will be which I have already

[37:53]

calculated it is nothing but point it

[37:56]

will be approximately equal to. 58 so 58

[38:01]

now with Theta as this is nothing but

[38:04]

Theta Theta 1 as

[38:07]

.5 right that is what Theta 1 as .5 we

[38:11]

are able to get. 58 so Theta 1 is .5

[38:15]

over here and. 58 will be coming

[38:17]

somewhere here right so this is my next

[38:20]

point which will be again in green color

[38:23]

now let's go ahead and calculate the

[38:24]

third condition now in third condition

[38:26]

what I'm actually going to write I'm

[38:28]

going to basically say Theta 1 as 0 at

[38:31]

that point of time just go and assume

[38:34]

what is 0 multiplied by X it will

[38:36]

obviously be zero so I will be getting

[38:38]

three points and my next line will be in

[38:41]

this line that is the

[38:45]

x-axis and this is basically all my

[38:47]

points now if I go ahead and calculate

[38:50]

this what is J of theta 1

[38:52]

now what is J of theta 1 now in this

[38:55]

particular case when my Theta 1 is equal

[38:57]

= to 0 1X 2 m now this part you'll be

[39:02]

able to see this is 0 - 1 0 - 2 0 -

[39:08]

3 okay so it will become 0 - 1 s 0 - 2 s

[39:14]

and 0 - 3

[39:16]

S okay so this will become 1X 6

[39:20]

* 1 + 4 + 9 which will not be it will be

[39:25]

nothing but 2.3 which is approximately

[39:29]

equal to

[39:30]

2.3 then what will happen with respect

[39:33]

to Theta 1 as 0 we are getting 2.3 so if

[39:36]

I draw this it is nothing but with

[39:38]

respect to zero I'm getting 2.

[39:41]

[39:44]

2.3 this is my point so similarly when

[39:47]

you start constructing with Theta 1 is

[39:49]

equal 2 I may get some point over here

[39:52]

so here when I join this points

[39:56]

together you will be seeing that I will

[39:58]

be getting this kind of

[40:01]

curve okay and this curve is something

[40:04]

called as gradient

[40:07]

descent and this gradient descent will

[40:10]

play a very very important role in

[40:14]

making sure that in making sure that you

[40:17]

get the right Theta 1 value or light

[40:20]

slope value now which is the most

[40:22]

suitable point the most suitable point

[40:24]

is to come over here because this is

[40:27]

this this point is basically called AS

[40:30]

Global

[40:31]

Minima because see out of all these

[40:34]

three lines which is the best fit line

[40:35]

this is the best fit line right this is

[40:38]

the best fit line when I had this best

[40:40]

fit line my point that came over here

[40:44]

was here itself this was my point that

[40:46]

came over here right and I want to

[40:48]

basically come to this region because

[40:50]

this is my Global

[40:52]

Minima when I basically am over here the

[40:56]

distance between the predicted and the

[40:58]

real point is very very less right so

[41:02]

this specific point is basically called

[41:04]

AS Global minimum but still I did not

[41:07]

discuss Krish you have assumed Theta 1

[41:10]

is 1 Theta 1 is .5 Theta 1 is 0 here

[41:13]

also you're assuming many things right

[41:15]

and then you probably calculating and

[41:17]

you're creating this gradient descent

[41:19]

but the thing should be that probably

[41:22]

you come to one point over here and then

[41:25]

you reach towards this so for that

[41:27]

specific reason how do you do that how

[41:30]

do I first of all come to a point and

[41:32]

then move towards This Global Minima so

[41:35]

for that specific case we will be using

[41:37]

one convergence algorithm because if I

[41:40]

come to one specific point after that I

[41:43]

just need to keep on updating Theta 1

[41:45]

instead of using different different

[41:47]

Theta 1 value so for this we use

[41:50]

something called as convergence

[41:52]

algorithm so here the convergence

[41:54]

algorithm basically says

[41:59]

repeat until

[42:03]

convergence that basically means I'm in

[42:05]

a while loop let's say and here I'm

[42:08]

basically going to update my Theta value

[42:11]

which will be given by this notation

[42:13]

which is continuous updation where I'll

[42:15]

say Theta J minus I'll talk about this

[42:19]

Alpha don't worry and then it will be

[42:22]

derivative of theta

[42:25]

J with respect to this J of theta

[42:29]

0 and Theta 1 so this should happen that

[42:34]

basically means after we reach to a

[42:36]

specific point of theta after performing

[42:40]

this particular operation we should be

[42:43]

able to come to the global Minima and

[42:45]

this this specific thing that you are

[42:47]

able to see is called as

[42:50]

derivative this is called as derivative

[42:52]

derivative basically means I'm trying to

[42:54]

find out the slope

[42:57]

derivative which I can also say it as

[42:59]

slope this equation will definitely work

[43:02]

guys trust me this will definitely work

[43:04]

why it will work I'll just draw it show

[43:06]

it to you let's say that this is my cost

[43:09]

function let's say that I've got this

[43:11]

gradient

[43:12]

descent and let's say that my first

[43:15]

point is somewhere here but I have to

[43:18]

reach somewhere here right now when I

[43:20]

reach this this is my Theta 1 and this

[43:23]

is my J of theta 1 suppose I reach at

[43:25]

this specific point and I will also have

[43:28]

another gradient descent which looks

[43:30]

like this let's say that in the initial

[43:33]

time I reach the point over here how we

[43:35]

will be coming to this minimal Global

[43:37]

Minima by using this equation I'll talk

[43:40]

about Alpha also don't worry now this is

[43:42]

also my Theta 1 this is also my J of

[43:44]

theta 1 now let's say suppose I came to

[43:47]

this particular point right after coming

[43:49]

to this particular point I will

[43:52]

basically apply this derivative on this

[43:55]

J of theta 1 okay now when I find out a

[43:59]

derivative that basically means we are

[44:00]

trying to find out the slope and in

[44:02]

order to find the slope we just create a

[44:04]

straight line like

[44:05]

this which will look like this I'll just

[44:08]

try to

[44:10]

create so I'll try to create a slope

[44:12]

like this this

[44:15]

slope so if you try to find out with

[44:17]

respect to this this is a positive slope

[44:20]

how do we indicate it because understand

[44:22]

the right hand side of the line of this

[44:24]

is pointing on the top wordss Direction

[44:27]

this is the best easy way to find out

[44:30]

whether it is a positive slope or

[44:31]

negative slope now in this particular

[44:33]

case this is a positive slope now when I

[44:36]

get a positive slope that basically

[44:38]

means I will update my weights or Theta

[44:40]

1 as Theta 1 let's say I'm writing it

[44:44]

over here so I will just apply this

[44:46]

convergence algorithm see Theta

[44:49]

1 colon Theta 1 minus this learning rate

[44:55]

which is called as Alpha this is my my

[44:57]

learning rate I'll talk about learning

[44:58]

rate don't worry then this derivative

[45:02]

value in this particular case since I'm

[45:04]

having a positive slope I will be

[45:06]

getting a positive value let's say that

[45:09]

for this Theta value I got this slope

[45:12]

initially now I need to come to this

[45:15]

location so for that I have to reduce

[45:17]

Theta 1 so that I come to this main

[45:20]

point now here you can see that I am I

[45:23]

subtracting Theta 1 with something which

[45:25]

is a positive number

[45:28]

right this is a positive number so

[45:29]

definitely I know that after some n

[45:31]

number of iteration I will be able to

[45:34]

come to the global Minima similarly if I

[45:36]

take the right hand side and if I try to

[45:38]

draw the slope in this particular case

[45:40]

my slope will be

[45:42]

negative so similarly I can write the

[45:44]

equation as Theta

[45:46]

1 = to Theta 1 minus learning rate

[45:51]

multiplied by a negative number so minus

[45:54]

into minus will be positive right

[45:55]

suppose initially my 1 was

[45:58]

here my Theta 1 was here now I'll keep

[46:01]

on updating the weight to come to this

[46:02]

Global Minima so minus into minus is

[46:06]

positive so I will basically get Theta 1

[46:09]

[46:10]

Alpha by a positive number because minus

[46:13]

into minus is plus so this will

[46:16]

definitely work so that we will be able

[46:19]

to come over here to the global Minima

[46:22]

whether it is a positive slope or a

[46:24]

negative slope now what is this learning

[46:26]

learning rate now learning rate based on

[46:30]

this learning rate suppose I want to

[46:32]

come from this point to the global

[46:35]

Minima by what speed I should be coming

[46:39]

what speed if my learning rate value is

[46:41]

bigger what speed I may be coming

[46:43]

suppose if I say usually we select

[46:45]

learning rate as 01 if I select a small

[46:48]

number then it'll start taking small

[46:50]

small steps to move towards the optimal

[46:52]

Minima but if I take a alpha value a

[46:55]

huge value if it is a huge huge value

[46:57]

then what will happen this uh this

[47:00]

updation of the Theta 1 will keep on

[47:02]

jumping here and there and the situation

[47:03]

will be that it will never meet it will

[47:07]

never reach the global Minima so it is a

[47:09]

very very good decision to take a alpha

[47:12]

small value it should also not be a very

[47:13]

very small value if it becomes a very

[47:16]

very small value then what will happen

[47:18]

very tiny steps it will take forever to

[47:20]

reach the global Minima that basically

[47:22]

means my model will keep on training

[47:24]

itself so definitely this Al is going to

[47:27]

work now let me talk about one

[47:30]

scenario one scenario will be that what

[47:33]

if my my cost function has a local

[47:36]

Minima what if I have a local Minima

[47:39]

because here if I

[47:41]

come here if I come this is a local

[47:43]

Minima suppose one of my points come

[47:46]

over here and finally I'm reaching over

[47:48]

here what will happen in this particular

[47:50]

case because in this case you'll be

[47:52]

seeing that what will be my equation my

[47:54]

equation will be simply Theta 1

[47:57]

Theta 1 minus Alpha in this point in

[48:01]

this local Minima slope will be zero so

[48:03]

in this particular case my Theta 1 will

[48:05]

be equal to Theta 1 now you may be

[48:07]

thinking what is if this is the scenario

[48:10]

then we will be stuck in local Minima

[48:13]

this is called as local

[48:15]

Minima but usually with respect to the

[48:18]

gradient descent and the equation that

[48:20]

we are using here we do not get stuck in

[48:23]

local Minima because our gradient

[48:25]

descent in this particular scenar iio

[48:27]

will always look like this but yes in

[48:29]

deep learning when we are learning about

[48:31]

grade in descent and a Ann at that point

[48:34]

of time we have lot of local Minima and

[48:37]

because of that we have different

[48:38]

different G decent algorithm like RMS

[48:40]

prop we have Adam optimizers which will

[48:43]

solve that specific problem so this one

[48:46]

point also I wanted to mention because

[48:48]

tomorrow if someone asks you as an

[48:49]

interview question that what if in your

[48:52]

uh do you see any local Minima in linear

[48:54]

regression you can just that the cost

[48:57]

function that we use will definitely not

[49:00]

give us local Minima but if in deep

[49:02]

learning techniques with that we are

[49:03]

trying to use like Ann we have different

[49:05]

different kind of optimizers which will

[49:07]

solve that particular problem so that is

[49:10]

the answer you basically have to give

[49:12]

now let me go ahead and write with

[49:14]

respect to the gradient descent

[49:15]

algorithm so here again I'm going to

[49:17]

write the gradient descent algorithm so

[49:19]

this will be my gradient descent

[49:21]

algorithm and remember guys gradient

[49:24]

descent is an amazing algorithm and you

[49:26]

you will definitely be using it so

[49:29]

please make sure that you know this

[49:32]

perfectly now some questions are that

[49:35]

when will convergence stop convergence

[49:37]

will stop when we come to near this area

[49:40]

where my uh J of theta will be very very

[49:44]

less now in gradient descent algorithm I

[49:47]

will again repeat it so what did I say I

[49:50]

said

[49:51]

repeat until convergence I told you

[49:54]

right here we have written this

[49:55]

algorithm

[49:57]

and now let's take it for Theta 0 and

[49:59]

Theta 1 so here I will write Theta 0

[50:02]

J equal to Theta

[50:06]

J minus learning rate of derivative of

[50:11]

theta

[50:14]

J J of theta 0 and Theta 1 so this is my

[50:19]

repeat until convergence now we really

[50:22]

need to find out what we'll try to

[50:24]

equate we'll try to first of all find

[50:25]

out what is this

[50:28]

now if I really want to find out

[50:30]

derivative

[50:32]

of derivative of derivative of theta J

[50:37]

with respect to J of theta 0 and Theta 1

[50:41]

so how do I write this I can definitely

[50:44]

write this in a easy way okay so this

[50:46]

will be derivative of theta J and

[50:49]

remember J will be 0 and 1 right because

[50:53]

we need to find out for 0 Theta 0 and

[50:55]

Theta 1 so this will be 1 by 2 m what is

[50:59]

what is J of theta 0a Theta 1 obviously

[51:02]

my cost function so I will write

[51:04]

summation of IAL 1 to M and here I will

[51:08]

basically write J of theta of X of I

[51:11]

minus y of I whole squar so if my J is

[51:16]

equal to Z so what will happen for this

[51:19]

so here I can specifically say that

[51:21]

derivative of derivative of theta 0 J of

[51:25]

theta 0a 1

[51:27]

now it's simple here what I will be

[51:29]

doing is that I will be simply applying

[51:31]

derivative function see guys what is

[51:34]

this derivative let's consider this is

[51:36]

something like this 1X 2 m x² so if I

[51:40]

try to find out the derivative this will

[51:42]

be 2x 2 MX so 2 and 2 will get cancel so

[51:46]

similarly I'll have 1 by m and here I

[51:49]

will specifically be writing summation

[51:52]

of I = 1 2 m h Theta of x X of I which

[51:58]

will be my

[51:59]

x - y of i² so this will be my

[52:03]

derivative with respect to Theta 0 this

[52:06]

is what I got now the second thing will

[52:08]

be that when J is equal to 1 derivative

[52:11]

of derivative of theta 1 J of theta 0

[52:15]

comma Theta

[52:16]

1 in this particular case I will be

[52:19]

having 1 by m summation of I = 1 to M

[52:23]

then again see in this particular case

[52:26]

Theta of 1 is there right Theta of 1

[52:29]

basically means what if I try to replace

[52:31]

this let's say that I'm trying to

[52:33]

replace this H Theta of X with something

[52:35]

else what is s Theta of X I know that

[52:38]

right it is Theta 0 + Theta 1 * X so

[52:42]

Theta 0 + Theta 1 * X so after this if

[52:46]

I'm trying to find out the derivative

[52:48]

with respect to Theta 0 this will

[52:50]

obviously become I will be able to get

[52:52]

this much right now with respect to the

[52:54]

second derivative what I will be writing

[52:56]

I will again be writing H thet of X of i

[52:59]

- y of i s

[53:03]

multiplied X of I so this Square also

[53:06]

went off understand this H Theta of X is

[53:09]

what see they H Theta of X is nothing

[53:12]

but Theta 0 + Theta 1 * X so if I'm

[53:16]

trying to find out derivative with

[53:18]

respect to Theta 0 nothing will be going

[53:19]

to come okay Theta 1 of X will become a

[53:22]

constant in this particular case in this

[53:25]

case because Theta 1 of X is there so if

[53:28]

I try to find out derivative of theta 1

[53:30]

into X only I'll be getting X Y Square

[53:33]

will not be there it's easy right X squ

[53:35]

means 2x this is the derivative of x

[53:37]

square right so that square went and 1X

[53:40]

2 1 2 by two got cancelled so this will

[53:44]

be now my convergence algorithm so here

[53:47]

we have discussed about linear

[53:48]

regression oh sorry I have to remove

[53:50]

Square here also so let me write it

[53:53]

again okay repeat until conver con let

[53:57]

me write it down again repeat until

[53:59]

convergence finally your two updates

[54:03]

will be happening one is Theta 0 so here

[54:06]

it will be Theta 0

[54:09]

minus Alpha that is my learning rate 1

[54:12]

by m summation of IAL 1 to M and this

[54:17]

will basically be H Theta of X of I

[54:21]

minus y of

[54:23]

I and similarly if I want to update

[54:26]

Theta 1 it will be - alpha 1 by m

[54:30]

summation of I = 1 to m h Theta of X of

[54:36]

I oh my God y of I uh multiplied by X of

[54:42]

I Alpha is your learning rate guys Alpha

[54:45]

is nothing but it is learning rate here

[54:48]

we have to initialize some value like

[54:51]

0.1 see what is s Theta of X Theta 0 +

[54:55]

Theta 1 into X right if I do derivative

[54:58]

of theta 1 into x what is derivative of

[55:01]

theta 1 with Theta 1 x it is nothing but

[55:03]

X so this x will come over here now

[55:07]

let's discuss about two important thing

[55:09]

one is R square and adjusted R square

[55:11]

now similarly what will happen you will

[55:14]

have lot of convex functions now see if

[55:16]

I talk about uh like if you have

[55:19]

multiple features like X1 X2 X3 x4 at

[55:23]

that point of time you will be having a

[55:25]

3D curve curve which looks like this

[55:28]

gradient

[55:29]

decent which will be something like this

[55:40]

gradient it's just like coming down a

[55:44]

mountain now let's discuss about two

[55:46]

performance metrics which is important

[55:48]

in this particular case one is R

[55:52]

square and adjusted R square

[55:57]

we usually use this performance metrix

[55:59]

to verify how our model is and how good

[56:01]

our model is with respect to linear

[56:03]

regression so R square is basically

[56:05]

given R square is a performance Matrix

[56:07]

to check how good the specific model is

[56:10]

so here we basically have a formula

[56:12]

which is like 1 minus sum of residual

[56:16]

divided by sum of total now this is the

[56:19]

formula of R squ now what is this sum of

[56:21]

residual I can basically write like this

[56:23]

summation of y i Min - y i hat whole

[56:29]

Square this Yi hat is nothing but H

[56:31]

Theta of X just consider in this way

[56:33]

divided by summation of Y of i - y mean

[56:39]

y mean y s to formula this is the

[56:42]

formula I'll try to explain you what

[56:44]

this formula definitely says okay so

[56:47]

first thing first let's consider that

[56:49]

this is my this is my problem statement

[56:51]

that I'm trying to solve suppose these

[56:53]

are my data points and if I try to

[56:55]

create the best fit

[56:57]

line This Yi hat Yi hat basically means

[57:01]

this specific point we are trying to

[57:03]

find out the difference between this

[57:05]

things difference between these things

[57:07]

let's say that these are my points I'm

[57:09]

trying to find out a difference between

[57:11]

this predicted this is my predicted the

[57:13]

point in green color are my predicted

[57:15]

points which I have denoted as y i hat

[57:18]

and always understand this is what Su

[57:21]

sum of residual is sum of residual is

[57:23]

nothing but difference between this

[57:24]

point to this point this point to this

[57:26]

point this point to this point this

[57:27]

point to this point and I doing the all

[57:29]

the summation of those now the next

[57:32]

point which is very much important here

[57:34]

is my X and Y what is this y IUS y y bar

[57:39]

Y Bar is nothing but mean mean of Y if I

[57:43]

calculate the mean of Y then I will

[57:45]

probably get a line which looks like

[57:47]

this I'll get a line something like this

[57:49]

and then I will probably try to

[57:51]

calculate the distance between each and

[57:53]

every point and this specific point with

[57:55]

respect to the distance between this

[57:57]

point and this point the denominator

[57:59]

will definitely be high right this value

[58:02]

obviously this value will be higher than

[58:04]

this value right the reason why it will

[58:07]

be higher because the mean of this

[58:09]

particular value distance will obviously

[58:11]

be higher so this 1 minus high this will

[58:16]

be a low value and this will be a high

[58:18]

value when I try to divide Low by

[58:23]

High Low by high then obviously this

[58:26]

entire number will become a small number

[58:28]

when this is a small number 1 minus

[58:30]

small number will be a big number so

[58:33]

this basically shows that our R square

[58:35]

has fitted properly right it has

[58:38]

basically got a very good R square now

[58:40]

tell me can I get this entire R square a

[58:43]

negative number let's say that in this

[58:44]

particular case I got 90% can I get this

[58:47]

R square as negative number there will

[58:50]

be situation guys what if I create a

[58:52]

best fit line which looks like

[58:54]

this if I create this best fit line

[58:57]

which looks like this then this value

[58:59]

will be quite High it is only possible

[59:02]

when this value will be higher

[59:05]

than higher than this

[59:08]

value okay but in the usual scenario it

[59:11]

will not happen because obviously we'll

[59:13]

try to fit a line which will be at least

[59:16]

good it's not just like pulling one line

[59:19]

somewhere we don't want to create a best

[59:21]

fit line which is worse than this right

[59:23]

worse than this so in this particular

[59:26]

scenario you'll be saying that in R

[59:28]

square now here you'll be able to see

[59:31]

one one amazing feature about R square

[59:33]

is that let's say let's say one scenario

[59:36]

suppose I have features like let's say

[59:38]

that my feature is something like uh

[59:41]

let's say I have a price of a house okay

[59:43]

so suppose this is my bedrooms how many

[59:45]

bedrooms I have and this is basically

[59:48]

the price of the house now if I if I

[59:51]

probably solve this Pro problem I'll

[59:53]

definitely get an R square value let's

[59:54]

say the R square value is 85% let's say

[59:57]

that my R square is 85% now what if if I

[60:00]

add one more feature the one more

[60:02]

feature basically says that okay if I

[60:05]

add

[60:06]

location location of the house will be

[60:09]

definitely correlated with price so

[60:12]

there is a definite chance that the R

[60:14]

square value will increase let's say

[60:16]

that R square will become 90% if I

[60:19]

probably have this two specific feature

[60:21]

and obviously it is basically increasing

[60:23]

the R square because this is also

[60:24]

correlated to price

[60:26]

and let me change the example see first

[60:29]

case I got by R square as 85% let's say

[60:32]

now as soon as I added location I got

[60:35]

90% now let's say that I added one more

[60:37]

feature which gender is going to stay

[60:40]

gender like male or female is going to

[60:42]

stay you know that gender is no way

[60:44]

correlated to price but even though I

[60:47]

add one feature there is a scenario that

[60:48]

my R square will still increase and it

[60:51]

may become

[60:52]

91% even though my feature is not that

[60:56]

important even gender is not that

[60:58]

important the R square formula Works in

[61:01]

such a way that if I keep on adding

[61:03]

features and that are not nowhere

[61:05]

correlated this is obviously nowhere

[61:07]

correlated this is not correlated with

[61:10]

price then also what it does is that it

[61:13]

is basically increasing my r² so this

[61:16]

specific thing should not happen whether

[61:19]

a male will stay or female will stay

[61:21]

that does not matter at all still when

[61:23]

you do the calculation the R square will

[61:26]

still increase so in order to not impact

[61:30]

the model because see now right now with

[61:32]

this particular model where I have got

[61:34]

90% now as soon as I see R square as 91%

[61:38]

because it is considering this

[61:40]

particular gender so this model will be

[61:43]

picked right because it is performing

[61:45]

well and is giving you a better R square

[61:46]

value but this should not happen because

[61:49]

that is not at all corelated this model

[61:51]

should have been picked so in order to

[61:53]

prevent this situation what we do we

[61:55]

basically Ally use something called as

[61:57]

adjusted R square now what is this

[61:59]

adjusted R square and how it will work

[62:02]

I'll also show it to you very very nice

[62:04]

concept of adjusted R square so adjusted

[62:06]

R square R square

[62:08]

adjusted is given by the

[62:11]

formula is given by the Formula 1 - 1 -

[62:16]

r² * N - 1 where n is the total number

[62:20]

of samples n minus P minus 1 this p p is

[62:24]

nothing but number of features

[62:26]

or predictors we'll also say or

[62:28]

predictors suppose initially my number

[62:31]

of predictors were in this particular

[62:33]

scenario in this scenario where I saw

[62:35]

this my number of predictors was two and

[62:37]

in this particular case my number of

[62:39]

predictor was three now if my predictor

[62:41]

is 2 I got the r squ as 90% so in this

[62:45]

particular scenario what all the

[62:46]

calculation will happen okay all the

[62:48]

calculation will happen and let's say

[62:50]

that my R square adjusted it'll be

[62:52]

little bit less it'll be little bit less

[62:55]

let's say it8 is 6% let's say that my R

[62:57]

square adjusted is 86% based on this

[63:00]

predictor 2 now when I use my predictor

[63:03]

3 predictor basically means number of

[63:05]

features that I'm going to use and now

[63:08]

in this one one feature is nowhere

[63:10]

related like gender but what we are

[63:12]

getting we are basically getting R

[63:14]

square increased to

[63:16]

91% now for the R square

[63:19]

adjusted this will not increase this

[63:21]

will in turn decrease right now it will

[63:24]

become 82% how it will become I'll show

[63:26]

you I've just considered some value 8682

[63:29]

here you can see that there is an

[63:31]

increase here an increase is there here

[63:33]

decrease is there now how this is

[63:35]

basically happening see this P value

[63:39]

that I will be putting okay if I put a p

[63:42]

isal 3 obviously with n minus P minus 1

[63:46]

this will become a little bit smaller

[63:48]

number or sorry little bit uh smaller

[63:50]

number right so now in this particular

[63:53]

case if it is not correlated obviously

[63:55]

this will be high when I'm increasing

[63:56]

this so this will also be high let me

[63:58]

write the equation something like this

[64:00]

just a second so this will basically

[64:04]

be okay now why probably this value may

[64:08]

have decreased let me talk about this

[64:10]

one what is r squ I hope everybody

[64:12]

understood n is the number of data

[64:17]

points p is the number of

[64:21]

predictors if p is increasing then what

[64:24]

will happen as P keeps on increasing

[64:27]

this value will keep on

[64:29]

decreasing this value will keep on

[64:31]

decreasing if this values keep on

[64:33]

decreasing this will be a bigger number

[64:35]

this will obviously be a big number a

[64:38]

big number divided by a small number

[64:40]

what it will be obviously this will be a

[64:42]

little bit bigger number 1 minus bigger

[64:45]

number we will basically get some values

[64:47]

which will be decreasing if my P value

[64:49]

is two in this particular case it will

[64:52]

be less smaller than this right at least

[64:54]

it will be greater than this this

[64:55]

particular value right when p is equal

[64:57]

3 so with the help of P obviously R

[65:01]

square is there to support you okay

[65:03]

whether it is correlated or not always

[65:05]

remember when the features are highly

[65:07]

correlated your R square value will

[65:09]

increase tremendously if it is less

[65:12]

correlated then it will be there will be

[65:14]

a small increase but there will not be a

[65:16]

very huge increase now if I consider p

[65:18]

is equal to 2 obviously when I'm trying

[65:20]

to find out this uh calculation n minus

[65:22]

P minus 1 it will obviously be greater

[65:25]

than p is equal to 3 when p is equal to

[65:28]

3 then this value will be still more

[65:30]

smaller and when we are dividing a

[65:32]

bigger number by a smaller number

[65:34]

obviously we are subtracting with one so

[65:37]

that basically means even though my R

[65:39]

square is 86 over here there may be a

[65:41]

scenario since this is nowhere

[65:43]

correlated I'm basically getting an 82%

[65:45]

because of this entire equation so I

[65:48]

hope you are understanding this this is

[65:50]

very much important to understand a very

[65:53]

very important property simple way to

[65:55]

define is that as my P value keeps on

[65:58]

increasing the number of predictors

[66:00]

keeps on increasing my R squ gets

[66:02]

adjusted whatever R square I'm getting

[66:05]

with respect to this it will always be

[66:07]

less than this particular R square there

[66:10]

was one interview question that was

[66:11]

asked one of my student between R square

[66:14]

and adjusted R square which will always

[66:15]

be bigger definitely the student said R

[66:18]

square then he told him to explain about

[66:20]

adjusted R square why does that specific

[66:22]

happen agenda one is about Ridge lasso

[66:27]

regression second is assumptions of

[66:31]

linear regression the third point that

[66:34]

we are probably going to discuss about

[66:37]

is logistic regression then the fourth

[66:42]

thing that we are going to discuss about

[66:43]

is something called as confusion

[66:46]

Matrix the fifth thing that we are going

[66:49]

to consider about

[66:51]

is practicals

[66:54]

for lead lineer Ridge lasso and logistic

[67:00]

so first topic uh that we are probably

[67:03]

going to discuss is something called as

[67:05]

Ridge and lasso

[67:10]

regression so let's understand about

[67:12]

Ridge and lasso regression if you

[67:15]

remember in our previous session what

[67:17]

all things we discussed linear

[67:21]

regression and then we had discussed

[67:23]

about the cost function we have

[67:24]

discussed about R square adjusted

[67:26]

adjusted R square sorry R square and

[67:29]

adjusted R square we have discussed

[67:30]

about it gradient descent we have

[67:32]

discussed about it it was nothing but 1

[67:34]

by 2 m summation of I = 1 2 m h Theta of

[67:41]

x i -

[67:45]

y - y i s so this is the cost function

[67:50]

that we had discussed right yesterday

[67:53]

and this cost function was able to give

[67:55]

us a

[67:57]

gradient descent with respect to the J

[67:59]

[68:00]

theta J of theta Zer or Theta not so I

[68:03]

can also write this as J of theta comma

[68:06]

Theta 0 comma Theta 1 now let me give

[68:09]

you a scenario let's say that I have a

[68:11]

scenario over here and I have this

[68:14]

specific scenario let's say that I just

[68:16]

have two points which looks like this

[68:20]

okay now if I have these two specific

[68:23]

points what will happen I will probably

[68:25]

try to create a best fit line the best

[68:27]

fit line will definitely pass through

[68:29]

all the points like this if I try to

[68:32]

calculate the cost function what will be

[68:34]

the value of J of theta 0 comma Theta 1

[68:38]

let's say that in this particular case

[68:39]

since it is passing through the origin

[68:41]

my Theta 0 will be zero okay so what

[68:44]

will be the value of theta 0 comma Theta

[68:47]

1 so here obviously you can see that

[68:49]

there is no difference so it will

[68:50]

obviously become zero Now understand

[68:54]

this data that you see right right this

[68:56]

data is basically called as training

[68:59]

data so this data that I have actually

[69:01]

plotted with two points these are

[69:03]

specifically called as training

[69:05]

data now what is the problem in this

[69:08]

data right now see right now exactly

[69:11]

whatever line is basically getting

[69:13]

created over here which is through the

[69:16]

uh hypothesis over here you can see that

[69:18]

it is passing through every point so

[69:19]

that is the reason your cost is zero and

[69:21]

our main aim is to basically minimize

[69:23]

the cost function that is absolutely

[69:26]

fine now in this particular case in

[69:29]

which my model this if this model is

[69:32]

getting trained initially this data is

[69:34]

basically called as training data now

[69:37]

just imagine that tomorrow new data

[69:40]

points comes so if my new data points

[69:42]

are here let's consider that I I want to

[69:45]

basically uh come up with this new data

[69:48]

point now in this particular scenario if

[69:50]

I want to predict with respect to this

[69:52]

particular Point let's say my predicted

[69:54]

point is here

[69:55]

is this the difference between the

[69:57]

predicted and the real Point quite

[70:00]

huge yes or no so this is basically

[70:04]

creating a condition which is called as

[70:07]

overfitting that basically means even

[70:11]

though my

[70:13]

model has given or trained well with the

[70:16]

training

[70:17]

data or let me write it down properly

[70:20]

over here so this condition since since

[70:23]

you can see that over here my each and

[70:25]

every point is basically passing through

[70:27]

the best fit line so because of that

[70:30]

what happens it causes something called

[70:32]

[70:33]

overfitting so you really need to

[70:35]

understand what is overfitting now what

[70:37]

does overfitting mean overfitting

[70:40]

basically means my model performs well

[70:44]

with training data but it fails to

[70:48]

perform well with test data now what is

[70:51]

the test data over here the test data is

[70:53]

basically this points the real test data

[70:55]

answer was this points but because the

[70:58]

my line is like this I'm actually

[71:00]

getting the predicted point over here so

[71:02]

this distance if I try to calculate it

[71:03]

is quite huge so in this scenario

[71:06]

whenever I say my model performs well

[71:08]

with training data and it fails to

[71:10]

perform well with test data then this

[71:12]

scenario we say it as overfitting so

[71:14]

this scenario when the model performs

[71:16]

well with training data I have a

[71:18]

condition which is called as low bias

[71:20]

and when it fails to perform with the

[71:22]

test data then it is basically called as

[71:25]

high High variance very important okay I

[71:28]

will make each and everyone understand

[71:31]

one by one if it is performing well with

[71:33]

the training data that is basically low

[71:35]

bias and whenever it performs well with

[71:38]

the test sorry fails to perform well

[71:40]

with the fails to perform well with the

[71:42]

test data then it is basically High

[71:44]

variance now similarly I may have

[71:46]

another scenario which is called as

[71:48]

underfitting so let's say that I have

[71:50]

something called as

[71:51]

underfitting now in this underfitting

[71:53]

what is the scenario the

[71:56]

model fails to perform it gives bad

[72:00]

accuracy I say that model always

[72:03]

remember whenever I talk about bias then

[72:05]

you can understand that it is something

[72:07]

related to the training data whenever I

[72:10]

talk about test data at that point of

[72:12]

time you talk about variance and that

[72:15]

specifically whenever you talk about

[72:17]

variance that basically means we are

[72:18]

talking about the test data so for an

[72:21]

overfitting you will basically have low

[72:23]

bias and high variance low bias with

[72:26]

respect to the training data and high

[72:29]

variance with respect to the test data

[72:31]

now if the model accuracy is bad with

[72:36]

training data and the model accuracy is

[72:39]

also bad with test data in this scenario

[72:44]

we basically say it as underfitting so

[72:47]

these are the two conditions that are

[72:50]

with respect to underfitting that

[72:51]

basically means that both for the

[72:54]

training data also the model is giving

[72:55]

bad accuracy and again for the test data

[72:59]

also it is basically having a bad

[73:01]

accuracy so in this particular scenario

[73:03]

we can definitely say two things out of

[73:05]

underfitting one is high bias and high

[73:10]

variance so this is the condition with

[73:12]

respect to underfitting very super

[73:15]

important let me just explain you once

[73:17]

again suppose let's consider I have one

[73:21]

model I have model two this is model one

[73:24]

this is model one this is model two and

[73:27]

this is model 3 okay guys so suppose

[73:30]

let's say that I have my model my

[73:33]

training accuracy is let's say

[73:36]

90% And my let's say that my test

[73:39]

accuracy is 80% now in this particular

[73:42]

case let's say that my training accuracy

[73:44]

[73:46]

92% and my test accuracy is 91% and

[73:51]

let's say my model three is basically

[73:53]

having training accuracy as

[73:56]

70% and my test accuracy is 65% so if I

[74:01]

take this particular case it is

[74:03]

basically overfitting if I take this

[74:06]

particular thing this basically becomes

[74:08]

my generalized model and when I talk

[74:11]

about this this is my I'll just say that

[74:15]

okay I'll also put nice color so that uh

[74:17]

you'll be able to understand this this

[74:19]

becomes our generalized model and this

[74:22]

finally becomes our underfitting right

[74:24]

under under fitting so here is my red

[74:27]

color I will just say it as underfitting

[74:29]

what are the main properties of this

[74:31]

overfitting as I said in this scenario

[74:34]

since it is performing well with the

[74:36]

training data so it will be low bias

[74:38]

High variance in this particular case it

[74:41]

will be low bias low variance and this

[74:44]

particular case it will be high bias and

[74:47]

high variance understand in this

[74:49]

terminology in this particular way

[74:51]

you'll be able to understand so why do

[74:53]

we require always a generalized model

[74:55]

because whenever our new data will

[74:57]

definitely come generalized model will

[74:59]

be able to give us very good output

[75:01]

let's go back to this particular example

[75:03]

here you'll be able to see this straight

[75:05]

line the red line that I have actually

[75:07]

created is basically overfitting so that

[75:10]

whenever I probably get the new points

[75:12]

which is having this real value and the

[75:14]

predicted points here you'll be able to

[75:16]

see the difference is quite huge so

[75:18]

because of this it will definitely be a

[75:20]

scenario of overfitting where it has low

[75:24]

bias and high weight

[75:25]

so again let me go ahead and take this

[75:28]

example so this was my line which I have

[75:30]

actually drawn I had two points and when

[75:33]

I draw this line which was a best fit

[75:36]

line to which is passing through both

[75:37]

the points this scenario is basically

[75:40]

causing a overfitting problem and I've

[75:42]

also shown you my J of theta 1 will be

[75:45]

zero in this scenario since it is

[75:47]

passing exactly and the predicted point

[75:49]

is also over there now understand one

[75:52]

thing is that what can can we take out

[75:55]

from this what assumptions we can take

[75:57]

out from this definitely if I talk about

[76:00]

our cost function our cost function here

[76:02]

is nothing but 1X 2 m summation of I = 1

[76:06]

2 m h Theta of X of i - y of I whole s

[76:13]

now let's consider that I am going to

[76:15]

use this H Theta X and I'm going to

[76:17]

basically write it as y hat okay let's

[76:19]

focus on this specific point so when I

[76:22]

take this I'm I'm just going to focus on

[76:24]

this particular point so here I will

[76:26]

definitely write it as y hat minus y of

[76:30]

I whole squ so this is my y y hat of I

[76:35]

minus y hat y i whole Square so this is

[76:38]

nothing but the difference between the

[76:40]

predicted value and the real value okay

[76:42]

this is what I'm actually trying to get

[76:44]

now in this scenario if I am adding this

[76:47]

values obviously I'm going to get the

[76:48]

value as zero now I have to make sure

[76:52]

that this value does not come to zero

[76:53]

because this is still over fitting so

[76:57]

that is where your Ridge regression will

[76:58]

come into picture Ridge and lasso will

[77:01]

come into picture now when I use Ridge

[77:03]

and lasso suppose if I use Ridge now in

[77:06]

Ridge what we say this this is also

[77:09]

called as L2

[77:11]

regularization now L2 regularization

[77:14]

what it does is that it basically adds a

[77:17]

unique

[77:18]

parameter add a One More Sample value

[77:21]

which is like Lambda multiplied by slope

[77:25]

Square now what is this slope whatever

[77:28]

slope of this particular line it is we

[77:30]

are just going to square it off now

[77:33]

suppose if I take my equation which

[77:34]

looks like this H Theta of X is equal to

[77:39]

Theta 0 + Theta 1 x now in this

[77:41]

particular case my Theta 0 was zero so

[77:44]

my H Theta of X is nothing but Theta 1

[77:47]

what is Theta 1 this is specifically

[77:49]

called as slope and I am basically

[77:52]

taking this Theta 1 I'm actually making

[77:54]

it as a square Square so always

[77:56]

understand I don't want to make this as

[77:57]

zero because if it becomes zero it may

[78:00]

lead to overfitting condition now what

[78:03]

will happen if I add this particular

[78:05]

equation if I add this particular

[78:06]

equation this will obviously come as

[78:08]

zero let's consider my Lambda value over

[78:12]

here my Lambda value is one I'll talk

[78:15]

about how do you set up Lambda value

[78:17]

okay let's consider that I'm

[78:18]

initializing it to one let's say my

[78:21]

Lambda value is 1 now what I will do is

[78:24]

that this l Lambda value is 1 Let's

[78:26]

consider our slope value initially is

[78:28]

two and because of this two I got this

[78:30]

best fit line I'm just going to consider

[78:32]

it so if I do the total sum over here if

[78:35]

I'm just considering this this value is

[78:37]

three now the cost function will not

[78:40]

stop over here because still it has to

[78:42]

minimize it has to reduce this three

[78:45]

value so what it will do it will again

[78:47]

change the Theta 1 value and let's say

[78:49]

that my Theta van value has changed now

[78:52]

it got another best fit line which looks

[78:54]

something like like this this is my next

[78:56]

best fit line I'll talk about Lambda

[78:57]

Lambda is a hyper parameter guys what

[79:00]

exactly is Lambda I'll just talk about

[79:01]

it now when I basically change this line

[79:04]

now see why I'm getting this line let's

[79:06]

consider I have changed my Theta 1 value

[79:08]

since we need to minimize now when we

[79:11]

need to minimize what it will do we'll

[79:12]

again calculate the slope of this

[79:14]

particular line and then we will try to

[79:16]

create a new line when we sorry it is

[79:18]

two two not three just a second guys 0 +

[79:24]

1 multiplied by 2 s which is nothing but

[79:27]

4 so now my cost function will not stop

[79:31]

over here so we are going to still

[79:33]

reduce this now in order to reduce this

[79:36]

again Theta 1 value will get changed and

[79:39]

then we will get a next best fit line

[79:40]

for this point now what will happen in

[79:43]

this scenario once we have this best fit

[79:45]

line we will definitely get a kind of

[79:47]

small difference so now if I go ahead

[79:50]

and consider the new equation my y hat I

[79:54]

minus y

[79:55]

i² + Lambda of slope squar this value

[80:00]

will be a small value now because I have

[80:03]

some difference and then plus again 1

[80:06]

multiplied by now understand whether the

[80:09]

slope will increase in this particular

[80:11]

case or whether it will decrease in this

[80:13]

particular case there will be some slope

[80:15]

value let's say that I have got some

[80:17]

slope of this particular line in this

[80:19]

particular scenario again your slope

[80:21]

will definitely decrease so let's say in

[80:23]

the case of two initially it was now it

[80:25]

is basically

[80:27]

1.36 whole squ now this small Value

[80:32]

Plus 1 + 1.3 squ or let me consider that

[80:37]

my slope is now one simple value that is

[80:40]

5 so if I get this it is 2.25 2.25 plus

[80:44]

small value it will be less than three

[80:46]

only right it will obviously be less

[80:48]

than three or equal to 3 but understand

[80:50]

what is happening the value is getting

[80:52]

reduced from 4 to 3 so this is is the

[80:55]

importance of Ridge now what will happen

[80:57]

is that you will try to get a

[80:59]

generalized model which has low bias and

[81:02]

low variance instead of this overfitting

[81:05]

condition you know why specifically we

[81:08]

are adding Ridge L2 regularization it is

[81:11]

basically to prevent

[81:14]

overfitting because here you are not

[81:16]

stopping here you are trying to reduce

[81:18]

it unless and until you get a line you

[81:21]

get a line which will be able to handle

[81:24]

the which will be able to handle as a uh

[81:27]

generalized model now here you can see

[81:29]

now if I have my new points like how I

[81:31]

drew over here now the distance will be

[81:33]

less so now you'll be able to see that

[81:36]

it will be able to create a generalized

[81:38]

model guys this will be a small value

[81:40]

only see initially when we have this

[81:42]

line obviously we have zero if we try to

[81:45]

slightly move here and there so here

[81:48]

you'll be able to see that it will just

[81:50]

a slight movement but what this movement

[81:52]

is basically specifying it is specifying

[81:55]

that the slope should not be steep if we

[81:59]

probably have a steep slope it obviously

[82:02]

leads to most of the time overfitting

[82:04]

condition it should not be steep it

[82:06]

should be very very it should be less

[82:08]

steeper but it should actually help you

[82:10]

to create a generalized model so you

[82:13]

will be seeing that after playing for

[82:14]

some amount of time this value will not

[82:18]

reduce after some point of time it'll

[82:19]

get almost it'll be a minimal value

[82:22]

it'll be a smaller value and for this

[82:24]

also you have to specify iterations how

[82:26]

many times you probably have to train

[82:29]

them now this iterations is also a

[82:33]

hyperparameter based on number of

[82:35]

iterations you will probably see your R

[82:38]

square or adjusted R square over here so

[82:41]

this iterations based on the number of

[82:43]

iterations it will never become zero

[82:45]

guys understand because zero it is not

[82:48]

possible if it becomes zero trust me it

[82:50]

is an overfitting model you cannot get

[82:52]

that is something zero now what is

[82:55]

Lambda coming to this Lambda this Lambda

[82:57]

is a

[82:58]

hyperparameter this is basically to

[83:01]

check how fast you want to lessen the

[83:04]

steepness or how fast you want to make a

[83:06]

steepness grow higher right and this

[83:08]

Lambda will also be selected by using

[83:11]

hyper parameter and this also I'll show

[83:13]

you today in Practical what do you mean

[83:15]

by iterations iteration basically means

[83:17]

how many time I want to change the Theta

[83:18]

1 value how many times you want to

[83:20]

change the Theta value that is the

[83:22]

convergence algorithm right

[83:25]

convergence algorithm over here L2

[83:27]

regularization or Ridge is basically

[83:30]

used in such a way that you should never

[83:33]

overfit why we assume Theta 0 is equal

[83:35]

to 0 because I'm considering that it

[83:37]

passes through a origin right origin

[83:40]

over here Lambda is a hyper

[83:43]

parameter steep basically means how

[83:46]

steep the line is if I have this line

[83:49]

this line is quite steep if I have this

[83:51]

line This is less steep now if I go to

[83:54]

the next regularization which is called

[83:56]

as lasso raso R lasso regression this is

[84:00]

also called as L1

[84:02]

regularization now here the formula will

[84:05]

be changing little bit here you will be

[84:07]

having y hat of minus of Y whole Square

[84:11]

here you'll be adding a parameter Lambda

[84:14]

but understand here you'll not be adding

[84:16]

slope Square no here you'll be adding

[84:20]

mode of slope here you'll be adding mode

[84:23]

of slope and this mode of slope will

[84:26]

work is that it will actually help you

[84:29]

to do feature selection now you may be

[84:31]

thinking how feature selection crash

[84:33]

let's consider a equation over here

[84:35]

let's say that I have many many features

[84:37]

I have many many many features okay so

[84:40]

my H Theta of X which I'm indicating

[84:42]

here as y hat let's say that I'm I'm

[84:45]

writing this equation apart from

[84:47]

preventing for overfitting it will also

[84:49]

help you to do feature selection here

[84:51]

let me just show you over here with an

[84:53]

example this H Theta of X which I'm

[84:56]

probably writing as y hat will basically

[84:59]

be indicated by something over here

[85:01]

you'll be able to see that it is nothing

[85:03]

but let's say that I have multiple

[85:05]

features like this now in this

[85:07]

particular features obviously there are

[85:09]

so many coefficients over here so many

[85:11]

slopes over here now mod of slope will

[85:13]

be what it will be nothing but mod of X1

[85:16]

plus X2 plus X3 plus X4 plus X5 like

[85:20]

this up to xn now in this particular

[85:23]

case how it is basically helping you to

[85:25]

sorry not X1 sorry just a second this

[85:29]

mod of I have taken the data point this

[85:31]

is not data points this should be your

[85:34]

mod of theta 0 + Theta 1 + Theta 2 +

[85:38]

theta 3 + Theta 4 + Theta 5 like this up

[85:42]

to Theta n so here you'll be able to see

[85:45]

that this is how I will basically uh

[85:47]

I'll basically be calculating the slope

[85:50]

now as we go ahead guys whichever

[85:52]

features are probably not playing an

[85:55]

amazing role the Theta value the

[85:57]

coefficient value the slope value will

[85:59]

be very very small it is just like that

[86:01]

entire feature is neglected that entire

[86:04]

feature is neglected now in this

[86:06]

particular case we were doing squaring

[86:08]

because of the squaring that value was

[86:10]

also increasing but here because of the

[86:11]

mode that value will not increase

[86:14]

instead it will be a condition wherein

[86:16]

we are basically neglecting those

[86:18]

features that are not at all important

[86:21]

in this specific problem statement so

[86:23]

with the help of L1 regularization that

[86:26]

is lasso you are able to do two

[86:28]

important things one is preventing

[86:31]

overfitting and the second case is that

[86:33]

if you have many features and many of

[86:36]

the features are not that important okay

[86:39]

in basically finding out your slope or

[86:42]

your line or the best fit line in that

[86:44]

particular case it will also help you to

[86:46]

perform feature selection so this is the

[86:48]

importance of the entire what is the

[86:51]

importance of this this is the

[86:52]

importance of the uh Ridge and the lasso

[86:56]

regression that we are doing here I'm

[86:57]

just going to write L1

[86:59]

regularization and obviously we have

[87:01]

discussed about L2 regularization also

[87:04]

now you have probably understood Lambda

[87:06]

is one hyperparameter okay which we will

[87:09]

specifically using okay and based on

[87:12]

this Lambda this will be found out

[87:13]

through cross

[87:14]

validation cross validation is a

[87:16]

technique wherein we will try to

[87:19]

probably train our model and try to find

[87:21]

out the specific things okay what should

[87:24]

be the exact value and there also we

[87:26]

play with multiple values in short what

[87:28]

we are doing we just trying to reduce

[87:29]

the cost function in such a way that uh

[87:32]

it will definitely never become zero but

[87:34]

it will basically reduce based on the

[87:37]

Lambda and the slope value in most of

[87:39]

the scenario if you ask me we should

[87:41]

definitely try both the regularization

[87:44]

and see that wherever the performance

[87:46]

Matrix is good we should use that what

[87:48]

is cross validation basically means I

[87:50]

will try to use different different

[87:52]

Lambda value and basically Ally use it

[87:55]

so in a short let me write it down again

[87:58]

for Ridge regression which is an L2 Norm

[88:03]

here I'm simply writing my cost function

[88:05]

in this particular case will be little

[88:08]

bit different here I can definitely

[88:10]

write my cost function as H Theta X of i

[88:14]

- y of I S Plus Lambda multiplied slope

[88:21]

Square what is the purpose of this the

[88:24]

purpose is very simple here we are

[88:27]

preventing overfitting this was with

[88:29]

respect to the Ridge Recreation that is

[88:31]

L2 nor now if I go ahead and discuss

[88:33]

about the next one which is called as

[88:34]

lasso regression which is also called as

[88:38]

L1 regularization in the case of lasso

[88:41]

regression your cost function will be H

[88:44]

Theta of X of

[88:47]

IUS y of

[88:49]

i² plus Lambda ultied mode of flow so

[88:55]

here you have this specific thing and

[88:57]

what is the purpose the purpose are two

[89:00]

one is prevent overfitting and the

[89:04]

second one is something called as

[89:05]

feature selection so these two are the

[89:08]

outcomes of the entire thing see with

[89:10]

respect to this lasso right you have

[89:12]

slopes slopes here you'll be having

[89:15]

Theta 0 plus Theta 1 plus Theta 2 plus

[89:17]

theta 3 like this up to Theta n now when

[89:20]

you'll have this many number of thetas

[89:22]

when you have many number of features

[89:24]

and when you have many number of

[89:25]

features that basically means you'll

[89:26]

have multiple slopes right those

[89:28]

features that are not performing well or

[89:30]

that has no contribution in finding out

[89:32]

your output that coefficient value will

[89:35]

be almost nil right it will be very much

[89:37]

near to zero in short you neglecting

[89:40]

that value by using modulus you're not

[89:42]

squaring them up you're not increasing

[89:44]

those values now I will continue and uh

[89:47]

probably I will also discuss about the

[89:49]

assumptions of linear regressions so

[89:52]

what are the assumptions of linear

[89:54]

regression in this particular scenario

[89:56]

so assumption is that number one point

[90:00]

linear regression if our features are in

[90:03]

normal or gion

[90:06]

distribution if our features follows

[90:09]

this particular distribution it is

[90:11]

obviously good our model will get

[90:13]

trained well so there is one concept

[90:16]

which is called as feature

[90:18]

transformation now in future

[90:20]

transformation always understand what

[90:22]

will happen if a model does not fall

[90:24]

follow a gan distribution then we apply

[90:26]

some kind of mathematical equation onto

[90:28]

the data and try to convert them into

[90:30]

normal orian distribution the second

[90:33]

assumption that I would definitely like

[90:34]

to make is that standard scalar or

[90:37]

standard digestion standard dig is

[90:40]

nothing but it is a kind of scaling your

[90:43]

data by using Z score I hope everybody

[90:46]

remembers Z score this is what we

[90:48]

basically apply there your mean is equal

[90:50]

to zero and standard deviation equal to

[90:52]

1 see guys wherever you have gradient

[90:54]

descent involved it is good to basically

[90:57]

[90:58]

standardization because if our initial

[91:01]

point is a small Point somewhere here

[91:03]

then to reach the global Minima or

[91:05]

training will happen quickly otherwise

[91:07]

what will happen if your values are

[91:09]

quite huge then your graph may be very

[91:11]

big and the point can come over any over

[91:13]

there and the third point is that this

[91:16]

linear regression works with respect to

[91:19]

linearity it works if your data is

[91:22]

linearly separable

[91:24]

I'll not say linearly separable but this

[91:26]

linearity will come into picture if your

[91:28]

data is too much linear it will

[91:30]

obviously be able to give a very good

[91:31]

answer like logistic regression also

[91:34]

which we are going to discuss today this

[91:35]

also has the same property now you may

[91:38]

be asking is it compulsory to do

[91:40]

standardization guys if you want to

[91:42]

increase the training time of your model

[91:45]

or if you want to optimize your model I

[91:47]

would suggest go ahead and do

[91:48]

standardization now coming to the fourth

[91:50]

Point here you really need to check

[91:52]

about multicolinearity

[91:54]

this is also one kind of check we

[91:56]

basically do what is multicol

[91:58]

linearities let's say I have X1 I have

[92:00]

X2 and this is my output feature I have

[92:03]

let's say X3 also now let's say that if

[92:05]

I try to see the colinearity of this two

[92:08]

feature how how correlated these two

[92:10]

feature are let's say that these two

[92:12]

feature are 95% correlated is it is it a

[92:16]

wise decision to use both the features

[92:18]

and let's say that let's let's say that

[92:20]

these two features are 95% correlated

[92:23]

but it is highly correlated with Y is it

[92:25]

necessary that we should use both the

[92:27]

feature in this particular scenario the

[92:29]

answer should be no we can drop this

[92:32]

particular feature okay we can drop this

[92:34]

particular feature any one of the

[92:36]

feature we can definitely drop it and

[92:38]

based on that I can just use one single

[92:40]

feature and basically we do the

[92:42]

prediction there is also a concept which

[92:44]

is called as variation inflation factor

[92:46]

I will try to make a dedicated video

[92:48]

about this multical is also solved with

[92:51]

the help of variation inflation Factor

[92:53]

one more term is there homos orc so that

[92:56]

kind of terminologies also we use one

[92:58]

more condition in this but if you almost

[93:00]

satisfied with this assumptions you will

[93:02]

definitely be able to outperform in

[93:03]

linear regression so you have got an

[93:06]

idea of the assumptions you have also

[93:07]

got an idea of multiple things okay now

[93:10]

let's go towards something called as

[93:12]

logistic regression now logistic

[93:14]

regression what logistic regression is

[93:16]

the first type of algorithm that we are

[93:18]

going to learn in classification let's

[93:20]

say that in classification I have one

[93:22]

example you know so suppose I have say

[93:24]

number of hours study hours and number

[93:28]

of play hours based on this I want to

[93:31]

predict whether a child is passing or

[93:33]

failing suppose these two are my

[93:35]

features I want to predict whether it is

[93:36]

pass or fail so here you'll be able to

[93:39]

see that I have some fixed number of

[93:40]

categories specifically in this

[93:42]

particular scenario I have two

[93:43]

categories binary logistic regression

[93:46]

works very well with binary

[93:48]

classification now the uh question comes

[93:50]

that can we solve multiclass

[93:52]

classification using logistic the answer

[93:54]

is simply yes you can definitely do it

[93:57]

so let's go ahead and let's try to

[93:58]

discuss about uh logistic regression now

[94:02]

what is the main purpose of the logistic

[94:03]

regression first of all let's let's uh

[94:06]

understand one scenario okay suppose I

[94:08]

have a feature which basically says um

[94:13]

number of study hours and this is like 1

[94:17]

2 3 4 5 6 7 and let's say that I have

[94:24]

pass this point is basically pass and

[94:28]

this point is basically

[94:29]

fail so I have this two conditions these

[94:32]

are my outcomes now what I'll do I will

[94:34]

just try to make some data points let's

[94:36]

say that if I study Less Than 3 hours I

[94:40]

will probably be fail if I study more

[94:43]

than 3 hours then probably I will pass

[94:47]

this I'll make it as fail and this I

[94:49]

will make it as pass so I will be having

[94:52]

points over here this 1 2 3 let's say

[94:57]

that this is my training data set now

[95:00]

the first question says that okay Chris

[95:02]

fine you have some data over here

[95:05]

whenever it is less than three you are

[95:06]

basically the person is failing if it is

[95:08]

greater than five greater than three it

[95:12]

is basically showing data points points

[95:14]

with respect to pass now can't we solve

[95:17]

this problem first with linear

[95:18]

regression now with the help of linear

[95:21]

regression here the first point will be

[95:23]

that yes I can definitely draw a best

[95:26]

fit line my best fit line in this

[95:28]

particular scenario may be something

[95:30]

like this it may it may look something

[95:32]

like this so here fail is nothing but

[95:35]

zero pass is one the middle point is

[95:38]

basically 0.5 so obviously with the help

[95:41]

of linear

[95:42]

regression I'm able to create this best

[95:44]

fit line and I'll put a scenario that

[95:47]

whenever the value is less

[95:50]

than5 whenever the value is less than

[95:52]

0.5 whenever the output is less than5

[95:55]

let's say that new data point is this

[95:57]

and based on this I'll try to do the

[95:58]

prediction I'm actually able to get the

[96:00]

output over here now when I'm getting

[96:02]

the output over here this basically is

[96:04]

0.25 now in this particular scenario

[96:06]

obviously I'm able to say that yes the

[96:08]

person I'll write a condition over here

[96:11]

saying that if my H Theta of x value is

[96:15]

less than 0.5 then my output should be

[96:20]

zero let's say less than 0.5 I'll say

[96:22]

not less than or equal to less than5

[96:25]

then my output will be zero right so in

[96:28]

this particular case Zero basically

[96:29]

means fail similarly I'll have a

[96:32]

scenario where I'll say that when if my

[96:35]

S of theta of X is greater than or equal

[96:36]

to 5 then this will basically be one

[96:39]

which is nothing but pass so this two

[96:41]

condition I can definitely write over

[96:42]

here this is my center point so that any

[96:45]

point that will probably come over here

[96:47]

let's say that this point is coming over

[96:49]

here right let's say new data point is

[96:51]

somewhere coming over here with this red

[96:53]

point

[96:54]

now what I'll do I'll basically draw a

[96:56]

straight line it will come over here I

[96:58]

will just extend this line

[97:00]

long I will extend this line over here

[97:04]

and I will extend this line over here

[97:07]

and here you can see that based on this

[97:09]

I'm actually getting this particular

[97:11]

prediction which is greater than 0.5 so

[97:13]

I will say that okay the person has

[97:15]

passed obviously this is fine this is

[97:18]

obviously working better this is

[97:20]

obviously working better so what what is

[97:22]

the problem why we are not using linear

[97:24]

regression okay in order to solve this

[97:26]

particular problem why you are

[97:27]

specifically having logistic regression

[97:29]

the answer is very much simple guys the

[97:31]

answer is that whenever let's say that

[97:34]

if I have an outlier which looks

[97:35]

something like this suppose I have an

[97:37]

outlier which comes like this over here

[97:40]

what is this value let's say that this

[97:41]

value is nothing but 7 8 9 10 let's say

[97:46]

that the number of study hours and I'm

[97:48]

studying for nine it is obviously pass

[97:51]

now in this particular scenario when I

[97:52]

have an outlier this entire line will

[97:54]

change now I will probably get my line

[97:57]

which looks something like this okay my

[97:59]

line will basically move something like

[98:01]

this it will now get moved something

[98:03]

like this now when it gets moves

[98:04]

completely like this now for even five

[98:08]

or even at any point that I am actually

[98:10]

predicting let's say that at this

[98:11]

particular point if I try to find out

[98:14]

it'll be showing less than 0. five so

[98:16]

here this particular value or answer

[98:19]

will be wrong right because if we are

[98:21]

studying more than 5 hours OB viously B

[98:24]

based on the previous line the person

[98:26]

had to pass but in this scenario it is

[98:28]

failing it is coming less than 0.5 but

[98:31]

the real value for this is basically

[98:33]

passed so I hope you are understanding

[98:36]

because of the outlier the entire line

[98:37]

is getting changed so how do we fix this

[98:40]

particular problem now in this two

[98:42]

scenarios are there first of all

[98:44]

obviously because of just an outlier

[98:46]

your entire line is getting shifted here

[98:48]

and there the second point is that over

[98:50]

here sometimes you're also getting

[98:52]

greater than one you you're also getting

[98:54]

less than one suppose if I try to

[98:55]

calculate for this particular point if I

[98:58]

project it in behind I'll be getting

[99:00]

some negative value so we have to squash

[99:02]

this function if I squash this function

[99:04]

then it'll become a plain line right how

[99:07]

do we squash it and for this we use

[99:09]

something called as sigmoid activation

[99:11]

function or sigmoid function if somebody

[99:14]

ask you why don't you use linear

[99:17]

regession in order to solve this

[99:19]

classification problem then your answer

[99:21]

should be very much simple you should

[99:23]

say this to specific points so we will

[99:26]

try to go ahead and solve some linear

[99:27]

regression now with the help of cost

[99:29]

function everything as such and we'll

[99:31]

try to understand how the cost function

[99:34]

will look for logistic regression second

[99:36]

reason I told you right it is greater

[99:38]

than zero over here the line is going

[99:40]

greater than zero right greater than

[99:42]

zero I have only Z and one and it is

[99:45]

becoming greater than zero but I have

[99:47]

already told that our maximum and

[99:49]

minimum value are 1 and zero so I hope

[99:51]

you have understood why linear Reg

[99:53]

cannot be used okay I showed you all the

[99:56]

scenarios why linear regression should

[99:58]

not be used now we'll continue and

[100:00]

probably discuss about the other things

[100:02]

over here and uh we will now try to

[100:05]

understand fine what exactly logistic

[100:07]

regression is all about and how the

[100:09]

decision boundaries basically created

[100:11]

now we'll go ahead and discuss about

[100:12]

that specific thing so let's go ahead

[100:15]

our values should be always between 0 to

[100:17]

one over here in this particular case

[100:19]

because it is a binary classification

[100:21]

problem only this should be the answer

[100:23]

so let's go ahead and let's define our

[100:25]

decision boundary so my decision

[100:26]

boundary decision boundary in the case

[100:29]

of logistic regression first of all as

[100:31]

usual in logistic regression we defined

[100:34]

our hypothesis okay guys first of all

[100:36]

let's see if I'm writing my my h of

[100:40]

theta my H Theta of X as Theta 0 + Theta

[100:45]

1 into x + Theta 2 into X like this X1

[100:49]

X2 + Theta n into xn

[100:53]

now in this scenario can I write this

[100:55]

entire equation as Theta transpose X

[100:59]

obviously I can definitely write this

[101:01]

way right and this is what is the

[101:02]

notation that you will probably seeing

[101:04]

in many places so with respect to the

[101:06]

decision boundary of logistic regression

[101:10]

our Theta see like this we can write I'm

[101:12]

saying okay but since we have to

[101:14]

consider two things one is squashing the

[101:17]

line okay how that squashing will

[101:19]

basically happen see if I have this if I

[101:22]

have this line

[101:24]

we saw in the above right if I have this

[101:26]

line suppose I have some data points

[101:28]

over here and I have some data points

[101:30]

over here if I want to create the best

[101:32]

fit line how will I create I will

[101:33]

basically create like this but I have to

[101:35]

also do two things one is squash over

[101:37]

here and squash over here right squash

[101:40]

over here and squash over here now in

[101:42]

order to squash I'm saying squash squash

[101:46]

means

[101:48]

okay now in order to do this I use a

[101:51]

function which is called as sigmoid

[101:52]

activation function

[101:54]

that basically means what happens

[101:56]

obviously you know this line is

[101:57]

basically denoted by H Theta of x equal

[102:01]

to how do you denote this straight line

[102:04]

let me write it down nicely for you so

[102:06]

how do you denote this straight line the

[102:08]

straight line is obviously denoted by

[102:11]

Theta 0 + Theta 1 * X1 let's say now on

[102:15]

top of this on top of this I have to

[102:18]

apply something on top of this value I

[102:21]

have to apply something so that I can

[102:23]

make this line straight instead of just

[102:26]

expanding in this way so my hypothesis

[102:29]

will basically be now G of G is

[102:32]

basically a function on Theta 0 and

[102:34]

Theta 1 * X1 so here I'm trying to

[102:38]

basically what I'm trying to do I will

[102:40]

apply a mathematical formula on top of

[102:42]

this linear regression to squash this

[102:45]

line now let's go ahead and let's try to

[102:47]

find out what is this G okay what is

[102:50]

this G I will say let Z equal to Theta 0

[102:54]

+ Theta 1 * X I'm just initializing this

[102:58]

now my H Theta of X is nothing but G of

[103:00]

Z now we need to understand what is this

[103:03]

z g of Z and how do we basically specify

[103:06]

what is the G function so my G function

[103:08]

is nothing but H Theta of x equal to 1

[103:11]

by 1 + e ^ of minus Z which in short if

[103:15]

I try to initialize Zed now it is 1 ^ of

[103:19]

e ^ of minus Theta 0 + Theta 1 * X so

[103:24]

this is what is my H Theta of X which is

[103:26]

my hypothesis and this obviously works

[103:29]

well because it is being able to squash

[103:32]

the function so this is basically my

[103:34]

hypothesis which I am definitely trying

[103:36]

to use it and this function that you are

[103:39]

actually able to see is called as

[103:43]

sigmoid or logistic function now you

[103:47]

need to understand what does this

[103:48]

sigmoid function look like in graph in

[103:50]

graph it looks something like this so

[103:52]

this this is my Zed value and this is my

[103:56]

G of Z this is my 05 your sigmoid

[104:00]

function will have this curve so this is

[104:03]

your one this is zero your value when

[104:07]

now from this we can make a lot of

[104:08]

assumptions what are the assumptions

[104:10]

that we can basically make your G of Zed

[104:15]

your G of Zed is greater than or equal

[104:18]

5.5 is obviously greater than or equal

[104:21]

to 0.5 when your Zed value is greater

[104:24]

than or equal to zero this is the major

[104:27]

assumptions that we can basically make

[104:29]

that is whenever your G of Z is greater

[104:32]

than your G of Z is greater than or

[104:35]

equal to 0.5 whenever your Zed is

[104:38]

greater than or equal to Z so obviously

[104:40]

whenever your Zed value is greater than

[104:42]

Z it is greater than 0.5 if your Zed

[104:44]

value is less than zero what it will

[104:46]

become it will basically be less than

[104:47]

0.5 so you can write that specific

[104:50]

condition also you want so this is the

[104:52]

most important condition

[104:53]

over here why it is called as logistic

[104:55]

regression see guys with the help of

[104:56]

regression you creating this straight

[104:57]

line and with the help of the concept of

[104:59]

sigmo you are able to squash it so they

[105:01]

have probably combined that name and uh

[105:04]

basically have written in this way will

[105:05]

squashing of the best fit L line help to

[105:07]

overcome the outlier issues yes

[105:09]

obviously it'll be able to help you so

[105:10]

let's go ahead and let's try to solve

[105:12]

the problem statement now usually let's

[105:14]

consider my training set let's consider

[105:17]

my training set suppose I have some

[105:19]

training points like this x of 1 comma y

[105:22]

of 1

[105:24]

let's say x of 2A y of 2 okay X of 3A y

[105:28]

of 3 like this I have lot of training

[105:30]

points and finally X of n comma y of n

[105:33]

let's say that this is my training data

[105:35]

so here uh my y y will belong to what

[105:41]

zero or 1 because I will only have two

[105:43]

outputs since we are solving a binary

[105:45]

classification problem here is my

[105:47]

training set with two outputs and I hope

[105:50]

everybody knows about J Theta of Z

[105:53]

it is nothing but 1 + e ^ of minus Z

[105:57]

here your Z is nothing but Theta 0 +

[105:59]

Theta 1 * X1 so this is your Theta 0 now

[106:04]

what we have to do we have to select

[106:06]

this Theta now in this particular case

[106:08]

let's consider that my Theta 0 is 0

[106:10]

because it is passing through the origin

[106:13]

just for time pass sake suppose my Z is

[106:15]

Theta 1 into X so now I need to change

[106:19]

what is my parameter my parameter is

[106:21]

Theta 1

[106:23]

I have to change parameter Theta 1 in

[106:25]

such a way that I get the best fit line

[106:28]

and along that I apply this sigmoid

[106:30]

activation function now let's go ahead

[106:33]

and let's first of all Define our cost

[106:36]

function because for this we definitely

[106:38]

require our cost

[106:39]

function now everything will be same

[106:42]

obviously you know the cost function of

[106:44]

linear regression because the first best

[106:47]

fit line that you are probably creating

[106:48]

is with the help of linear

[106:50]

regression now in this particular case

[106:52]

in the case of linear regression so here

[106:55]

you can basically write J J of theta 1

[106:57]

is nothing but 1 by m summation of I = 1

[107:02]

2 m 1X 2 and here you have H Theta of x

[107:08]

minus y of I I whole Square so this is

[107:13]

your entire thing of if you remember

[107:15]

linear regression whatever things we

[107:17]

have discussed yesterday okay so this is

[107:19]

the cost function let's consider that

[107:22]

for linear regression for this is for

[107:24]

the linear regression now for the

[107:25]

logistic regression what will happen for

[107:27]

your logistic regression I will take the

[107:28]

same cost function H Theta of X now you

[107:31]

know what is s Theta of X it is nothing

[107:33]

but 1 + 1 + e ^ of minus Theta 0 + Theta

[107:37]

sorry Theta 1 multiplied by X right this

[107:40]

is my with respect to logistic

[107:42]

regression this is my entire equation

[107:45]

now similarly I will try to only put

[107:48]

this H Theta of X let's consider that

[107:51]

this is my cost function only only my H

[107:53]

Theta of X is changing in this

[107:55]

particular case so if I go ahead and

[107:57]

write my cost function I can basically

[107:59]

say 1x2 h Theta of X of i - y of

[108:05]

i² and in this particular scenario what

[108:07]

is h Theta of X it is nothing but 1 + 1

[108:11]

+ e ^ minus Theta 1 x so this is what

[108:16]

this is getting replaced and this is my

[108:18]

logistic regression cost function I'm

[108:20]

just considering this cost function part

[108:22]

this part later on if you replace this

[108:25]

to this see if I replace this to this

[108:28]

and if I replace this to this it becomes

[108:30]

a logistic regression cost function

[108:33]

intercept I'm considering it as zero

[108:34]

guys now when I'm replacing this to this

[108:36]

this to this then it becomes a logistic

[108:39]

uh regression cost function but there is

[108:41]

one problem we cannot we cannot use we

[108:45]

cannot use this cost function there is a

[108:48]

reason for this because this equation

[108:50]

that you're seeing 1/ 1 + e^ of minus

[108:54]

Theta 1 * X this is a non-convex

[108:59]

function now you may be considering what

[109:01]

is a non-convex function so let me write

[109:03]

it down so here this this term this

[109:07]

terminology right it is a non-convex

[109:09]

function now what is this non-convex

[109:10]

function let me show you and let me

[109:12]

differentiate it with convex function

[109:15]

okay we'll try to understand what is the

[109:16]

difference between non-convex function

[109:18]

and convex function this is related to

[109:21]

gradient descent very important this is

[109:24]

related to gradient desent if you

[109:27]

remember with the help of linear

[109:29]

regression whatever gradient Dent we are

[109:32]

actually getting it is a convex function

[109:34]

like this this is the convex function

[109:38]

which looks like a parabola curve

[109:40]

Parabola curve because of this Parabola

[109:42]

curve whenever we use this linear

[109:44]

regression cost function specifically

[109:46]

because here my H Theta of X is what it

[109:48]

is nothing but Theta 0 + Theta 1 into X

[109:51]

because of this this equ

[109:53]

will always give you a parabola curve

[109:56]

this kind of cost function or convex

[109:59]

function you can say but here your s

[110:01]

Theta of X is changing so in the case of

[110:03]

if I use that cost function you will be

[110:05]

getting some curves which looks like

[110:07]

this now what is the problem with this

[110:08]

curve here you have lot of local Minima

[110:11]

if local Minima is there you will never

[110:13]

reach This Global Minima so that is the

[110:15]

reason we cannot use that c function now

[110:18]

mathematically you can also go and

[110:20]

probably search in the Google what is

[110:22]

the

[110:23]

what is the graph or what is a convex or

[110:25]

non-convex function but always remember

[110:27]

whenever we updates Theta 1 with this

[110:30]

within this particular equation by

[110:32]

finding the slope then this way it will

[110:35]

not be differentiable and here you have

[110:37]

lot of local Minima and because of this

[110:39]

local Minima you will never be able to

[110:41]

reach the global Minima this is your

[110:42]

Global Minima right in case

[110:45]

of in case of linear regression you'll

[110:48]

reach This Global Minima but in this

[110:50]

case you will never reach never never

[110:52]

you'll be stuck over here or you may get

[110:54]

stuck over here you may get stuck over

[110:56]

here okay so this has a local Minima

[111:00]

problem so how do we solve this

[111:02]

understand in local Minima these are my

[111:03]

points right I have to come over here

[111:05]

this is my deepest point in this

[111:07]

particular case I don't have any local

[111:09]

Minima now in local Minima also you'll

[111:11]

get slope is equal to Z so that is the

[111:13]

reason your Theta 1 will never get

[111:14]

updated so in order to solve this

[111:17]

problem you can see this diagram we have

[111:19]

something called as logistic regression

[111:20]

cost function so I can now write my

[111:23]

logistic regression cost function in a

[111:25]

different way so this researcher

[111:27]

researcher thought of it and basically

[111:30]

came up with this proposal that the

[111:31]

logistic cost function should look

[111:33]

something like this so the entire cost

[111:36]

function of logistic regression that is

[111:38]

specifically H Theta of X of I comma y

[111:43]

this should be written something like

[111:44]

this and it should be written like this

[111:47]

see here I'm just going to write cost

[111:49]

function of J of theta 1 let's say that

[111:51]

I'm writing J of theta 1 okay so J of

[111:54]

theta 1 what are the different different

[111:56]

output that I'll be getting I'll be get

[111:58]

I'll be getting yal 1 or y equal to 0 So

[112:02]

based on this two scenarios our cost

[112:04]

function will look something like this

[112:06]

minus log of H of theta of X and I know

[112:11]

I hope you all know what is h Theta of x

[112:13]

h Theta of X is nothing but 1 + 1 ^ of -

[112:19]

Theta 1 x so this is what is my H Theta

[112:22]

of X and whenever Y is Zer then you

[112:25]

basically have minus log * 1 - H Theta

[112:31]

of X of I of I okay so this is how you

[112:35]

basically write your cost function in

[112:36]

this particular scenario now with the

[112:38]

help of this cost function it is always

[112:40]

possible since it is getting log log is

[112:42]

basically getting used in this scenario

[112:45]

you'll always get a global Minima that

[112:46]

is the reason why they have completely

[112:48]

neglected this cost function and utiliz

[112:51]

this cost function now what does this

[112:52]

cost function basically mean two

[112:55]

scenarios if Y is equal to 1 Let's

[112:58]

consider this is my cost function

[113:01]

graph I have H Theta of X and you know

[113:06]

that H Theta of x value will be ranging

[113:08]

between 0 to 1 since it is a

[113:10]

classification problem so it will be

[113:11]

ranging between 0 to 1 and this is

[113:14]

basically of J of theta 1 which is my

[113:16]

cost function so if Y is equal to 1 this

[113:19]

specific equation will be used and

[113:21]

whenever this equation is is basically

[113:22]

used you get a you get a curve see minus

[113:25]

log s of X of I you get a curve which

[113:29]

looks something like this okay which

[113:31]

you'll get a curve which looks like this

[113:33]

now what does this curve basically

[113:35]

specify the curve come up with two

[113:37]

assumptions the cost will be zero if Y

[113:42]

is = 1 and H Theta of x equal to 1 that

[113:46]

basically when your s Theta of X is 1

[113:49]

and the Y is output is one that

[113:51]

basically means you're going to assign

[113:52]

over here one right so in this

[113:54]

particular case you will be seeing that

[113:56]

your cost function will be zero cost is

[113:59]

zero so here is my zero it is meeting

[114:01]

over here if you of x equal to 1 and Y

[114:04]

is equal to 1 so this is this is again a

[114:06]

convex function only then the next point

[114:08]

that you can probably discuss over here

[114:10]

is with respect to Y is equal to 0 if

[114:13]

your Y is Z then what kind of curve you

[114:16]

will be getting you'll get a different

[114:18]

kind of curve which will look like this

[114:20]

H Theta of x here your value will be 0

[114:23]

to one and here you'll be having a curve

[114:26]

which looks like this so when you

[114:29]

combine this two you'll be able to see

[114:31]

that you are able to get a kind of

[114:34]

gradient descent so this will definitely

[114:36]

help us to create a cost function so I

[114:38]

hope everybody is able to understand

[114:40]

till here with respect to this and this

[114:42]

will definitely work so finally I can

[114:45]

also write my cost function in a

[114:47]

different way the cost function that I

[114:49]

will probably write over here so this

[114:50]

will be my J of theta 1

[114:53]

so I can come up with a cost function

[114:54]

which looks like this

[114:57]

cost of H of theta of X of I comma Yus

[115:02]

log of H Theta of x if Y is equal

[115:09]

1 and then minus

[115:11]

log 1 - H Theta of x if Y is equal

[115:17]

0 now I can combine this both and

[115:21]

probably write something like like this

[115:23]

I can combine this both and I can

[115:25]

basically write cost of H Theta of X of

[115:27]

IA Y is equal to - y log H Theta of X of

[115:35]

I minus log 1 -

[115:40]

y okay 1 - y log of 1 - H Theta of X so

[115:47]

this will be my final cost

[115:50]

function and here also you can see that

[115:53]

if I

[115:54]

replace if I replace y with one then

[115:57]

what will remain only this particular

[115:59]

value will remain right this value when

[116:01]

Y is equal to 1 this thing only will

[116:03]

come you see over here replace y with

[116:05]

one probably replace y with one and then

[116:08]

you'll be able to see so here I can now

[116:10]

write if Y is equal to 1 my cost

[116:14]

function will Rook something like this

[116:18]

which is nothing

[116:19]

but see Y is 1 then what will happen my

[116:22]

log of H Theta of X of I will come and

[116:26]

this 1 - 1 is 0 so 0 multili by anything

[116:29]

will be 0 if Y is equal to 0 then what

[116:32]

will happen my cost function will be so

[116:36]

when it is zero this will - y will

[116:38]

become 0 0 multili by anything is z so

[116:42]

here you'll be able to see that I am

[116:43]

I'll be having minus log 1 - H Theta of

[116:48]

x i so this both the condition has been

[116:50]

proved by this cost function

[116:52]

so this is my cost function yes cost

[116:54]

function and loss function with respect

[116:55]

to the number of parameters will be

[116:57]

almost same so finally if I try to write

[117:00]

J of theta because I have that 1X 2 m

[117:03]

also right so 1X 2 m also I have so what

[117:06]

I'm actually going to do here you will

[117:08]

be able to see that I can write J of

[117:11]

theta 1 is equal to 1 by 2 m summation

[117:16]

of IAL 1 to M and then write down the

[117:19]

entire equation that you have probably

[117:22]

over here so here you have minus y or I

[117:26]

I'll just remove this minus and put it

[117:27]

over here and this will become plus

[117:29]

sorry y of I

[117:31]

* log H Theta of X of I 1 - y of i y

[117:41]

log 1 - H Theta of X of I so this

[117:45]

becomes my entire first function and

[117:48]

obviously you know what is h thet of x H

[117:52]

Theta of X of I is nothing but 1 + 1 e^

[117:56]

minus Theta 1 * X and finally my

[117:59]

convergence algorithm I have to repeat

[118:02]

this to update Theta 1 repeat until this

[118:07]

updation that is Theta Theta

[118:11]

J is equal to Theta J minus learning

[118:15]

rate derivative with respect to Theta J

[118:18]

and this will be my J of theta 1 this is

[118:21]

my repeat until conversion so this is my

[118:24]

cost function this is my repeat

[118:27]

algorithm and here I will be updating my

[118:30]

entire Theta

[118:32]

1 and this solves your problem with

[118:35]

respect to logistic regression simple

[118:37]

simple questions may come like how it is

[118:39]

different from linear regression how it

[118:41]

is not different from linear regression

[118:44]

can we say log likelihood a topic from

[118:46]

probabilistic yes this is uh this is log

[118:50]

likelihood if now I will discuss about

[118:54]

performance metrics and this is specific

[118:56]

to classification problem and binary

[118:59]

classification I'm talking let's

[119:02]

consider let's consider I have a data

[119:04]

set which has X1 X2 and this is y and

[119:09]

obviously in logistic uh classification

[119:11]

you have outputs like 0 1 0 1 1 0 1 and

[119:17]

your y hat y hat is basically the output

[119:20]

of the predicted model now in this

[119:22]

particular scenario my y hat will

[119:24]

probably be 1 1 0 uh 1 1 1 Z so in this

[119:31]

particular scenario this is my predicted

[119:34]

output and this is my actual output so

[119:39]

can we come to some kind of conclusions

[119:41]

wherein probably we will be able to

[119:44]

identify what may be the accuracy of

[119:48]

this specific model with respect to this

[119:49]

many data points because confusion

[119:52]

Matrix is all dealt with this is called

[119:54]

as we will first of all have to create a

[119:56]

confusion Matrix now for a binary

[119:59]

classification problem the confusion

[120:01]

Matrix will look like this so here you

[120:03]

have 1 0 1 0 Let's say that this is

[120:06]

prediction let's say that these are my

[120:08]

actual value and these are my prediction

[120:10]

value okay these both are prediction

[120:12]

value these are my output value when my

[120:15]

actual value is zero my predicted value

[120:17]

is one does this what does this mean

[120:21]

wrong prediction right so when my actual

[120:23]

value is zero my predicted value is 1 so

[120:26]

here my count will increase to one let's

[120:28]

go to the second scenario when the

[120:30]

actual value is one and my predicted

[120:33]

value is one that basically means one

[120:35]

and one so here I'm going to increase my

[120:37]

count similarly when my actual value is

[120:40]

zero my predicted value is zero so that

[120:42]

basically mean when my actual value is z

[120:43]

my predicted value is zero I'm going to

[120:45]

increase the count by one if I go over

[120:47]

here 1 one again it is so instead of

[120:50]

writing one now this will become two I'm

[120:52]

going to increase the count similarly

[120:54]

I'll go over here one more one is there

[120:56]

so I'm going to increase the count three

[120:58]

then I have 01 01 basically means when

[121:00]

my actual value is zero I'm actually

[121:02]

getting it as one so I'm also going to

[121:04]

increase this particular value as two

[121:07]

and then finally I have 1 and zero where

[121:09]

I'm going to increase like this now what

[121:11]

does this basically mean now what does

[121:13]

this basically mean see with respect to

[121:16]

this kind of predictions whenever we are

[121:17]

discussing this basically basically says

[121:20]

so this is my actual values and I have Z

[121:22]

1 and zero and this is my predicted

[121:24]

values I also have 1 and zero this value

[121:27]

when one and one are there this is

[121:29]

called as true positive this value when

[121:31]

0 and Zer are there this is called as

[121:33]

false negative whenever your actual

[121:35]

value is zero and you have predicted one

[121:37]

this becomes false positive and whenever

[121:40]

your actual value is one you have

[121:41]

predicted zero this becomes false

[121:43]

negative now coming to this I really

[121:45]

need to find out the accuracy of this

[121:47]

model now if I really want to find out

[121:51]

and this is what is called as confusion

[121:52]

Matrix now in this confusion Matrix if I

[121:55]

really want to find out the accuracy the

[121:57]

accuracy of this model it is very much

[121:59]

simple this middle elements that you are

[122:01]

able to see will basically give us the

[122:03]

right output so this and this if I add

[122:07]

it up it will give us the right output

[122:10]

so here I'm going to get TP + TN divided

[122:13]

by TP + FP + FN + TN so once I calculate

[122:21]

this so I have 3 + 1

[122:23]

/ 3 + 2 + 1 + 1 so this is nothing but 4

[122:29]

by 7 what is 4 by

[122:32]

757 so am I getting 57 percentage

[122:35]

accuracy so I'm actually getting 57%

[122:38]

accuracy over here with respect to the

[122:39]

accuracy so this is how we basically

[122:42]

calculate with respect to basic accuracy

[122:45]

with the help of uh the confusion Matrix

[122:48]

okay so this is specifically called as

[122:49]

confusion Matrix now there are some more

[122:52]

things that you really need to specify

[122:54]

always remember our model aim should be

[122:56]

that we should try to reduce false

[122:57]

positive and false negative now let's

[123:00]

say that I want to discuss about two

[123:02]

topics what one is suppose in our data

[123:04]

set I have zeros and one category let's

[123:07]

say in my output if I say Zer are 900

[123:11]

and ones are 100 this becomes an

[123:13]

imbalanced data very clear right so this

[123:15]

become an imbalanced data set it is a

[123:18]

biased data suppose if I say zeros are

[123:21]

probab

[123:22]

600 and ones are probably 400 in this

[123:25]

particular scenario I will say that this

[123:27]

is the balance data because yes you have

[123:29]

100 less but it's okay the it may not

[123:32]

impact many of the algorithm now see

[123:34]

guys most of the algorithm that we will

[123:36]

be probably discussing imbalanced if we

[123:38]

have an imbalanced data set it will

[123:40]

obviously affect the algorithms let me

[123:42]

talk about this let's say that I have

[123:44]

number of zeros as 900 and number of

[123:46]

ones is 100 now let's say that my model

[123:49]

I have created which will directly

[123:51]

predict

[123:52]

zero it'll I'll just say that all my

[123:55]

inputs that it is probably getting with

[123:57]

respect to this training data it'll just

[123:59]

output zero now in this particular

[124:01]

scenario what will be my accuracy my

[124:03]

accuracy will be 900 divid by 1,000

[124:05]

right so this is nothing but 90% so is

[124:09]

this a good

[124:10]

accuracy obviously it is a good accuracy

[124:12]

but this is a biased data if my model is

[124:15]

basically just outputting 00000000 0 if

[124:19]

it is outputting 00 00 0 obviously most

[124:22]

of the answer will be zeros but this

[124:24]

will be a scenario like you know where

[124:27]

it is just outputting one thing then

[124:28]

also it is able to get 90% accuracy so

[124:31]

you should only not be dependent on

[124:33]

accuracy so there are lot of

[124:35]

terminologies that we will basically use

[124:37]

one terminology that we specifically use

[124:40]

is something called as Precision then

[124:42]

we'll also use recall what is precision

[124:45]

what is recall I'll write the formula

[124:46]

over here in Precision what do we need

[124:48]

to focus and then finally we will

[124:50]

discuss about f score so we have to use

[124:53]

different kind of parametrics of sorry

[124:55]

different kind of formulas whenever you

[124:58]

have an imbalanced data set you can also

[124:59]

do oversampling but again understand in

[125:02]

most of the scenarios in some of the

[125:04]

scenarios oversampling may work but we

[125:06]

have to focus on the type of performance

[125:08]

metrics that we are focusing on right

[125:10]

now I'll not say F1 score I'll say F

[125:11]

score the reason why I'm saying I'll

[125:13]

just let you know so let's talk about

[125:15]

recall recall formula is basically given

[125:17]

by true positive divided by true

[125:20]

positive plus false negative

[125:22]

Precision is given by true positive

[125:23]

divided by true positive plus false

[125:27]

positive and then I will probably

[125:29]

discuss about F sore also or we

[125:31]

basically say fbaa also now I'll just

[125:34]

draw this confusion Matrix again okay

[125:36]

which is having true positive true

[125:37]

negative so let me draw it over here so

[125:40]

this is my ones and zeros these are my

[125:42]

actual values and these are my predicted

[125:44]

values I have true positive I have true

[125:47]

negative false positive and false

[125:49]

negative now in this particular scenario

[125:50]

when I'm actually discussing understand

[125:53]

what is recall and what focus it is

[125:54]

basically given on so here whenever I

[125:57]

talk about recall recall basically says

[125:59]

that TP TP divided by TP plus FN so I'm

[126:04]

actually focusing on this so what does

[126:06]

this basically say true uh recall out of

[126:10]

all the actual true positives how many

[126:13]

have been predicted correctly that is

[126:15]

basically mentioned by TP out of all the

[126:18]

positive values how many of them have

[126:20]

predicted as positive so this is what it

[126:22]

is basically saying and this scenario is

[126:24]

called as recall in this the false

[126:27]

negative is basically given more

[126:28]

priority and our focus should be that we

[126:31]

should try to reduce false positive

[126:33]

false negative sorry we should try to

[126:35]

reduce this now let's go ahead and let's

[126:37]

discuss about Precision in Precision

[126:39]

what we are doing we are basically

[126:41]

taking out of all the predicted values

[126:44]

out of all the predicted positive values

[126:47]

how many of them are actual true or

[126:50]

positive okay this is what Precision

[126:52]

basically means now suppose if I

[126:54]

consider spam classification suppose

[126:56]

this is my task tell me in this

[126:57]

particular case should we use Precision

[127:00]

or recall and one more use case I'm

[127:02]

saying that whether the person has

[127:05]

cancer or not in which case we have to

[127:08]

support recall and in which case we have

[127:10]

to go ahead with Precision has cancer or

[127:13]

not in spam what is important okay guys

[127:16]

the recall is also called as true

[127:18]

positive rate I can also say recall as

[127:20]

sensitivity so if I go with Spam

[127:22]

classification it should definitely go

[127:24]

with Precision why it should go with

[127:26]

Precision if I probably get a Spam ma

[127:28]

the main aim should be that whenever I

[127:30]

get a Spam Mill it should be identified

[127:31]

as spam okay in that specific scenario

[127:34]

my positive false positive we should try

[127:37]

to reduce and in this scenario my false

[127:39]

pository talks about the spam

[127:41]

classification a lot in a better way in

[127:43]

the case of cancer I should definitely

[127:46]

use recall let's let's focus on the

[127:48]

recall formula tp/ by TP plus FN if a

[127:52]

person has a cancer see one actually he

[127:55]

has a cancer it should be predicted as

[127:57]

one otherwise if we have FN it is

[127:59]

basically predicting it does not have a

[128:01]

cancer that is really a big situation in

[128:04]

this case if a person does not have a

[128:07]

Cancer and if he's predict if the model

[128:09]

predicts okay fine he has a cancer he

[128:11]

may go and further do the test and then

[128:13]

he'll come to know whether he has a

[128:14]

cancer or not but this scenario is very

[128:16]

dangerous if a person has a cancer but

[128:19]

he is being indicated that he does not

[128:20]

have that cancer

[128:22]

so here false negative is given more

[128:24]

priority over here in the case of spam

[128:26]

classification false positive is given

[128:28]

more priority so this is something

[128:30]

important over here and you really need

[128:31]

to understand with respect to different

[128:33]

different problem statement let me give

[128:35]

you one more example tomorrow the stock

[128:37]

market is going to crash in this what we

[128:40]

need to focus on should we focus on

[128:41]

Precision or should we focus on recall

[128:44]

now here two things are there who is

[128:46]

solving what kind of problem see many

[128:48]

people will say recall or Precision but

[128:50]

here two things are there on whose point

[128:52]

of view you are creating this model are

[128:55]

you creating this model for the industry

[128:57]

or are you creating this model for the

[128:59]

people for the people he should

[129:01]

definitely get identified that okay in

[129:04]

this particular scenario you need to

[129:06]

sell your stock because tomorrow stock

[129:07]

market is going to crash but for

[129:09]

companies this is very bad okay I hope

[129:11]

everybody is able to understand for

[129:13]

companies it is very very bad so in this

[129:15]

particular case sometime we need to

[129:17]

focus both on false positive and false

[129:19]

negative and again I'm telling you for

[129:22]

which problem statement you are solving

[129:23]

that indicates if you are solving for

[129:25]

people then they should be able to get

[129:27]

the notification saying that it is going

[129:29]

to crash if you're probably uh doing it

[129:32]

for companies at that time your

[129:34]

Precision recall may change but if I

[129:36]

consider for both the scenarios at that

[129:39]

point of time I will definitely use

[129:40]

something called as F score F score or

[129:42]

I'll also say it as F beta now how is

[129:45]

fbaa Formula given as I will talk about

[129:48]

it and here in the F score you have

[129:50]

three different formulas the first

[129:51]

Formula I will say basically as when

[129:53]

your beta value is 1 okay first of all

[129:57]

I'll just give a generic definition of f

[129:59]

s or F beta here you are basically going

[130:01]

to consider 1 + beta squ Precision

[130:05]

multiplied by recall divided beta Square

[130:09]

* Precision plus recall whenever your

[130:14]

both false positive and false negative

[130:16]

are important we select beta as one so

[130:19]

if I select beta as 1 it becomes 1 + 4

[130:22]

Precision multiplied by recall then you

[130:25]

have Precision plus recall so here sorry

[130:28]

1 + 1 so this becomes 2 multiplied by

[130:31]

Precision into recall divided by

[130:34]

Precision plus recall so here you have

[130:37]

this is basically called as harmonic

[130:39]

mean harmonic mean probably you have

[130:41]

seen this kind of equation where you

[130:42]

have written 2x y / x + y same type you

[130:46]

are able to see this this is called as

[130:48]

harmonic mean here the focus is on both

[130:51]

false positive and false negative let's

[130:53]

say that your false positive is more

[130:56]

important than false negative at that

[130:58]

point of time you will try to decrease

[131:01]

or you will try to decrease your beta

[131:03]

value let's say that I'm decreasing my

[131:05]

Beta value to 0.5 then what will happen

[131:07]

1 +5 whole

[131:09]

s and then you have P * R Precision

[131:13]

recall and here also you have 25 p + r

[131:17]

now in this particular scenario I'm

[131:19]

decreasing my Beta decreasing the beta

[131:21]

basically means that you are providing

[131:23]

more importance to false positive than

[131:25]

false negative and finally you'll be

[131:27]

able to see that if I consider beta

[131:30]

value as let me just say my notes if I

[131:34]

consider beta value as two that

[131:37]

basically means you are giving more

[131:38]

importance to false negative than false

[131:40]

positive so with this specific case you

[131:42]

can come up to a conclusion what value

[131:44]

you basically want to use now whenever I

[131:46]

use beta is equal to 1 it becomes fub1

[131:49]

score if I use beta as .5 then this

[131:52]

basically becomes f.5 score and this

[131:56]

becomes your F2 score So based on which

[132:00]

is important okay which is important

[132:03]

whether your Precision or false positive

[132:05]

or false negative is important you can

[132:06]

consider those things F score will have

[132:09]

different values if you're using beta is

[132:11]

equal to 1 that basically means you are

[132:13]

giving importance to both precision and

[132:16]

recall if your false positive is more

[132:18]

important then at that point of time you

[132:20]

reduce beta value if false negative is

[132:23]

greater than false bet uh false positive

[132:25]

then your beta value is

[132:26]

increasing beta is a deciding parameter

[132:29]

to decide your F1 score or F2 score or F

[132:32]

Point score now first thing first what

[132:34]

is the agenda of today's session first

[132:36]

of all we will complete practicals for

[132:39]

all the algorithms that we have

[132:41]

discussed these all algorithms that we

[132:43]

have discussed we will cover the

[132:45]

practicals probably we will be doing

[132:47]

hyper parameter tuning everything the

[132:49]

second thing and again here we are going

[132:51]

to take just simple examples so yes uh

[132:54]

so today's session I said practicals

[132:56]

with simple examples where I'll probably

[132:59]

discuss about all the hyper parameter

[133:01]

tuning then the second one the second

[133:04]

algorithm that I'm going to discuss

[133:05]

about is something called as n bias this

[133:09]

is a classification algorithm so we are

[133:10]

going to understand the intuition and

[133:13]

the third one that we are going to

[133:15]

probably discusses KNN algorithm so KNN

[133:19]

algorithms is definitely there

[133:21]

so this our today's plan I know I've

[133:23]

written very less but this much maths

[133:26]

and involved in na bias right we'll

[133:29]

understand the probability theorem again

[133:30]

over there there is something called as

[133:32]

bias theorem we'll try to understand and

[133:35]

then we'll try to solve a problem on

[133:36]

that so let's proceed and let's enjoy

[133:39]

today's session how do we enjoy first of

[133:42]

all we enjoy by creating a practical

[133:44]

problem so I am actually opening a

[133:47]

notebook file in front of you so here uh

[133:50]

we will try to Sol solve it with the

[133:51]

help of linear regression Ridge lasso

[133:55]

and try to solve some problems let's see

[133:58]

how much we will be able to solve it but

[134:00]

again the aim is that we learn in a

[134:02]

better way okay uh so that everybody

[134:06]

understands some basic basic things okay

[134:08]

so first of all as usual uh everybody

[134:11]

open your jupyter notebook file the

[134:13]

first algorithm that I'm going to

[134:14]

discuss about is something called as SK

[134:16]

learn linear regression so everybody I

[134:19]

hope everybody knows about this SK learn

[134:21]

let's see what all things are basically

[134:23]

there in this we will be using fit

[134:25]

intercept everything as such but here

[134:28]

the main aim is to find out the

[134:29]

coefficients which is basically

[134:31]

indicated by Theta 0 Theta 1 and all the

[134:34]

first thing we'll start with linear

[134:38]

regression and then we will go ahead and

[134:40]

discuss with r and lassor I'm just going

[134:42]

to make this as

[134:44]

markdown how many different libraries of

[134:46]

for linear regression you can do with

[134:48]

stats you can do with skyi you can do

[134:49]

with many things okay so first thing

[134:52]

first let's first of all we require a

[134:53]

data set so for the data set what we are

[134:56]

going to do is that we are going to

[134:58]

basically take up some smaller smaller

[135:01]

data just let me do this so for this uh

[135:05]

we are going to take the house pricing

[135:07]

data set so we are going to solve house

[135:10]

pricing data set problem a simple data

[135:13]

set which is already present in SK learn

[135:16]

only now in order to import the data set

[135:18]

I will write a line of code which is

[135:19]

like from SK learn dot data sets data

[135:24]

sets

[135:25]

import load uncore Boston so we have

[135:29]

some Boston house pricing data set so

[135:31]

I'm just going to execute this I'm also

[135:33]

going to make a lot of Sals so that I

[135:35]

don't have to again go ahead and create

[135:37]

all the sales again some basic libraries

[135:39]

that I probably want is pro import numai

[135:43]

[135:44]

[135:45]

import pandas

[135:48]

SPD okay import cbon as

[135:52]

SNS and then I will also import Matt

[135:56]

Matt plot lib do p plot as PLT and then

[136:02]

percentile matplot lib matlot lib do

[136:07]

inline and I will try to execute this

[136:09]

see this my typing speed has become a

[136:11]

little bit faster by writing by

[136:12]

executing this queries again and again

[136:15]

and uh let's go ahead uh so I have

[136:18]

imported all the necessary libraries

[136:19]

that is required which which will be

[136:21]

more than sufficient for you all to

[136:23]

start with now in order to load this

[136:25]

particular data set I will just use this

[136:27]

Library called as load uncore Boston and

[136:30]

I'm going to just initialize this so if

[136:32]

you press shift tab you will be able to

[136:34]

see that return load and return the

[136:37]

Boston house prices data set it is a

[136:39]

regression problem it is saying and then

[136:41]

probably I'm just going to execute it

[136:43]

now once I execute it I will go and

[136:45]

probably see the type of DF so it is

[136:48]

basically saying skarn dos. bunch now if

[136:51]

I go and probably execute DF you'll be

[136:53]

able to see that this will be in the

[136:55]

form of key value pairs okay like Target

[136:57]

is here data is here okay so data is

[137:01]

here Target is here and probably you'll

[137:03]

be able to find out feature names is

[137:04]

here so we definitely require feature

[137:06]

names we require our Target value and

[137:09]

our data value so we really need to

[137:11]

combine this specific thing in a proper

[137:14]

way in the form of a data frame so that

[137:16]

you will be able to see so what I'm

[137:18]

actually going to do over here I'm just

[137:19]

going to say PD do data frame I'll

[137:22]

convert this entirely into a data frame

[137:24]

and I will say DF do data see this is a

[137:27]

key value pair right so DF do data is

[137:29]

basically giving me all the features

[137:31]

value so if I write DF do data and just

[137:35]

execute it you'll be able to see that I

[137:36]

you will be able to get my entire data

[137:39]

set in this way my entire data set in

[137:41]

this way this is my feature one feature

[137:43]

two feature three feature 4 this feature

[137:45]

12 I have 12 features over here and

[137:47]

based on that I have that specific value

[137:50]

now the next thing thing that I'm going

[137:51]

to do probably I should also be able to

[137:53]

add the target feature name over here so

[137:55]

what I will do I will just convert this

[137:57]

into DF and then I will also say DF do

[138:02]

columns and I'll set it to DF do Target

[138:05]

okay and let me change this to data set

[138:08]

so I'm going to change this to data set

[138:10]

and I'm going to say data set. columns

[138:12]

is equal to DF do Target so if I execute

[138:15]

this and now if I probably

[138:18]

print my data set do head you will be

[138:22]

able to see this specific thing okay it

[138:24]

is an error let's see expected axis has

[138:27]

13 element new values has

[138:30]

506 so Target okay I should not use

[138:33]

Target over here instead I had a column

[138:36]

which is called as features feature

[138:38]

names like if I go and probably see

[138:41]

DF DF over here you'll be able to see

[138:45]

there is one thing which is called as

[138:46]

feature names so I'm going to use DF do

[138:48]

feature names over here so here it is DF

[138:52]

do feature names I'm just going to paste

[138:55]

it over here and now if I go and write

[138:57]

here you can see print DF data set. head

[139:00]

if I go and execute without print you'll

[139:02]

be able to see my entire data set so

[139:04]

these are my features with respect to

[139:06]

different different things and this is

[139:09]

basically a house pricing data set so

[139:10]

initially I have this features CRM ZN

[139:13]

indust CH nox RM age distance radius tax

[139:18]

PT ratio b l stack that so I have my

[139:22]

entire data set over here the same data

[139:24]

set I have basically put it over here

[139:26]

now here also you'll be able to see what

[139:28]

all this feature basically means this is

[139:30]

showing wasted weighted distance to five

[139:31]

do uh Five Boston employment center rad

[139:34]

basically means index of accessibility

[139:36]

to radial Highway tax basically means

[139:39]

full value property tax rate this much

[139:41]

PT rate basically means pupil teacher

[139:44]

ratio I don't know what the hell it

[139:45]

means but it's fine we have some kind of

[139:47]

data over here properly in front of you

[139:51]

so these are my independent features

[139:53]

what are these these all are my

[139:54]

independent features if you want the

[139:56]

features detail here you can see it

[139:59]

right everything what is CRM this

[140:01]

basically means per capita crime rate by

[140:03]

town which is important ZN it is

[140:06]

proportional of residential land zone

[140:08]

for Lots over 25,000 Square ft so this

[140:12]

is my DF I did not do much I'm just

[140:14]

using data frame DF do data column

[140:17]

features name I'm getting this value

[140:18]

very much simple now let's go a little

[140:21]

bit slowly so that many people will be

[140:23]

able to also understand now this is my

[140:25]

data set. head now the thing is that I

[140:29]

obviously have taken all these

[140:31]

particular values but this is my

[140:32]

independent feature I still have my

[140:35]

dependent feature so what I'm actually

[140:37]

going to do I will create a new feature

[140:40]

which is like data set of price I'll

[140:42]

create my feature name price price of

[140:44]

the house and what I will assign this

[140:46]

particular value this value will be

[140:48]

assigned with this target this target

[140:50]

value this target value is basically the

[140:53]

sale the price of the houses right it is

[140:56]

again in the form of array so I'm going

[140:58]

to take this and put it as a dependent

[141:00]

feature so here you'll be able to see

[141:02]

that my price will be my dependent

[141:04]

feature so here I'll basically write DF

[141:06]

do Target so once I execute it and now

[141:09]

if I probably go and see my data set do

[141:12]

head you'll be able to see features over

[141:15]

here and one more feature is getting

[141:17]

added that is price now this price may

[141:20]

be the units may be in

[141:22]

millions somewhere Target should be here

[141:24]

or there it should be probably in

[141:27]

millions

[141:28]

or I cannot see it but it should be

[141:31]

somewhere here it should have definitely

[141:33]

said that it is probably in millions or

[141:36]

okay but that is not a problem I think

[141:37]

but mostly it'll be in millions

[141:39]

somewhere I think it should be

[141:42]

here okay I cannot see it but probably

[141:45]

if I put more time I'll be able to

[141:47]

understand it okay so over here what is

[141:49]

the thing main thing this all are my

[141:51]

independent features and this is my

[141:53]

dependent feature right so if I'm trying

[141:55]

to solve linear regression I have to

[141:57]

divide my independent and dependent

[141:58]

features properly now let's go to the

[142:01]

next step that

[142:03]

[142:05]

dividing the data

[142:07]

set dividing the oh my God dividing the

[142:12]

data

[142:14]

set

[142:17]

into

[142:19]

train into first of all I'll try try to

[142:22]

divide into

[142:24]

independent and dependent

[142:27]

features so I want my entire features

[142:30]

data set divided into independent and

[142:31]

dependent features X I will be using as

[142:34]

my independent featur so I will write

[142:35]

data set dot I will use an iock which is

[142:39]

present in data frames and understand

[142:41]

from which feature to which feature I

[142:42]

will be taking as my independent feature

[142:44]

to this feature till lat so the best way

[142:48]

that basically means that I just need to

[142:49]

skip the last feature in order to skip

[142:52]

the last feature what I'm actually going

[142:54]

to do from all the columns I will just

[142:57]

skip the last column so this is how you

[142:59]

basically do an indexing with respect to

[143:02]

just skipping the last feature and this

[143:05]

will basically be my independent

[143:06]

features and here I will basically say Y

[143:08]

is equal to data set do iock and here I

[143:11]

just want the last feature so I will

[143:14]

write colon all the records I want and

[143:18]

see the first term that we are probably

[143:20]

WR writing over here this basically

[143:22]

specifies with respect to records here

[143:24]

this specifies with respect to columns

[143:26]

from all the columns I'm taking the last

[143:27]

column here I will just take the last

[143:29]

column and this will basically be my

[143:32]

dependent features dependent features so

[143:35]

here I have basically executed now if

[143:37]

you can go and probably see x. head here

[143:40]

you'll be able to find all my

[143:41]

independent features in y do head you'll

[143:43]

be able to find the dependent feature

[143:45]

now let's go to the first algorithm that

[143:47]

is called as linear regression

[143:51]

always remember whenever I definitely

[143:53]

start with linear regression I'll

[143:55]

definitely not go directly with linear

[143:56]

regression instead what I will do is

[143:59]

that I'll try to go with Ridge

[144:00]

regression and uh lasso regression

[144:02]

because there you are lot of options

[144:04]

with respect to hyper pment T but I'll

[144:06]

just show you how linear regression is

[144:08]

done so basically you really really need

[144:11]

to use a lot of libraries okay over here

[144:13]

and based on this libraries this

[144:15]

libraries will try to install okay and

[144:18]

what are these libraries these are

[144:19]

basically the linear regression Library

[144:21]

so here I'm basically going to use two

[144:23]

specific thing one is linear regression

[144:25]

Library so I will just use from SK learn

[144:28]

do linear uncore model import linear

[144:32]

regression do you need to remember this

[144:35]

the answer is no because I also do the

[144:37]

Google and I try to find out where in

[144:39]

escal and it is present okay so here is

[144:42]

my linear regression so I will try to

[144:44]

initialize linear reg is equal to

[144:47]

initialize with linear regression and

[144:49]

then here what I'm actually going to do

[144:51]

I'm going to basically apply something

[144:53]

called as cross validation cross

[144:55]

validation is very much important

[144:57]

because in Cross validation we divide

[144:59]

out train and test data in such a way

[145:01]

that every combination of the train and

[145:04]

test data is basically taken by care is

[145:07]

taken by the model and whoever accuracy

[145:09]

is better that all entire thing is

[145:11]

basically combined so here what I'm

[145:13]

going to do I'm going to say mean square

[145:14]

error is equal to here I will import one

[145:17]

more Library let's say from SK learn

[145:20]

dot model selection I'm going to import

[145:25]

cross Val

[145:26]

score so cross Val score cross

[145:29]

validation score basically means it is

[145:31]

going to do a lot of train and test

[145:32]

split it's something like this one

[145:34]

example I will show it to you here only

[145:37]

so what does cross validation basically

[145:39]

do okay so in Cross validation what

[145:42]

happens what you do suppose this is your

[145:44]

entire data

[145:46]

set suppose this is 100 records if you

[145:48]

do five cross validation then in the

[145:51]

first this will be your test data and

[145:53]

remaining all will be your training data

[145:55]

if in the second cross validation this

[145:58]

will be your test data and remaining all

[145:59]

will be your test uh training data like

[146:01]

this five times you'll be doing cross

[146:03]

validation by taking the different

[146:05]

combination of train and test but I'm

[146:07]

not going to discuss much about it in

[146:09]

the future if you want a separate

[146:10]

session I will include that in one of

[146:11]

the session itself so this was uh

[146:13]

basically the plan with respect to cross

[146:15]

validation or cross Val score so here

[146:17]

I'm going to basically take cross

[146:20]

Val

[146:21]

score and here the first parameter that

[146:24]

I give is my model so linear regression

[146:27]

is my model and here I will take X and Y

[146:30]

I'm not doing a train test split

[146:32]

specifically over here I'm giving the

[146:34]

entire X and Y and probably based on

[146:36]

that I'm going to do a cross validation

[146:38]

over here you can also do train test

[146:39]

plate initially and then just give the X

[146:42]

train and Y train over here to do the

[146:43]

cross validation it is up to you but the

[146:45]

best practices will be that first you do

[146:47]

the train test split and then only give

[146:49]

the train data over here to do the cross

[146:51]

validation I'm just going to use scoring

[146:53]

is equal to you can use mean squared

[146:56]

error negative mean squared error let's

[146:58]

say that I'm going to use negative mean

[147:00]

squ error again where do you find all

[147:02]

these things you will be able to see in

[147:04]

the SK learn page of L uh cross Val

[147:06]

score and then finally in the cross Val

[147:08]

score you give cross validation value as

[147:10]

5 10 whatever you want so after this

[147:13]

what I'm actually going to do I'm just

[147:14]

going to basically from this how many

[147:17]

scores I will get the mean squar error

[147:19]

will be five since I'm doing five cross

[147:21]

validation if you don't believe me just

[147:23]

see over here print msse so here you'll

[147:26]

be able to see five different values 1 2

[147:30]

3 4 5 right five different mean values

[147:34]

because we are doing cross five five

[147:36]

cross validation so here what I'm going

[147:37]

to write I'm just going to say np. mean

[147:40]

I want to take the average of all the

[147:41]

five so here will basically be my

[147:45]

meanor

[147:46]

msse okay and then probably I'll print I

[147:49]

will print my Ms meanor MSC so this will

[147:54]

be my average score with respect to this

[147:56]

the negative value is there because we

[147:58]

have used negative mean squ error but if

[148:00]

you just consider mean square error then

[148:01]

it is only 37.1 3 okay so this I have

[148:05]

actually shown you how to do cross

[148:06]

validation see with respect to linear

[148:08]

regression you can't modify much with

[148:10]

the parameter so that is the reason why

[148:12]

specifically in order to overcome

[148:14]

overfitting and do the feature selection

[148:16]

we use uh R and lasso regression so here

[148:19]

I will show show you how to do ridge

[148:21]

ridge regression

[148:24]

now now in order to do the prediction

[148:26]

all you have to do is that just go over

[148:28]

here take the model okay what is the

[148:31]

model linear R and just say do

[148:37]

predict so here you can see uh you'll be

[148:40]

getting a function called as do predict

[148:42]

and give the test value whatever you

[148:44]

want to predict automatically the

[148:45]

prediction will be done so I'm just

[148:46]

going to remove this and focus on Ridge

[148:48]

regression right now because I I want to

[148:50]

show how hyperparameter tuning is done

[148:52]

in R regression so for R regression the

[148:54]

simple thing is that I'll be using two

[148:56]

different libraries from skarn do

[149:01]

linear linear uncore model I'm going to

[149:05]

import Ridge so for the ridge it is also

[149:09]

present in linear underscore model for

[149:11]

doing the hyperparameter tuning I will

[149:12]

be using from SK learn do modore

[149:17]

selection and then I'm going to import

[149:20]

grid SE CV so these are the two

[149:22]

libraries that I'm actually going to use

[149:24]

grid SE CV will be able to help you out

[149:26]

with the um okay will be able to help

[149:30]

you out with Hyper parameter tuning and

[149:32]

then probably you'll be able to do

[149:34]

that uh difference between MSE and

[149:37]

negative MSE not big thing guys if you

[149:39]

use MSE here mean squ error you'll be

[149:42]

getting 37 I've just used negation of

[149:45]

MSE it's okay anything is fine you can

[149:48]

go with MSE also means square error

[149:50]

there is also another uh another scoring

[149:53]

area which is like which focuses on

[149:54]

square root square mean Square uh sorry

[149:58]

root means Square eror okay so there are

[150:00]

different different things which you can

[150:01]

basically focus on okay now in order to

[150:05]

give you this specific good value I'm

[150:07]

actually going to do hyper Peter tuning

[150:10]

now let's go ahead with uh grid s CV so

[150:13]

here what I'm going to do again I'm

[150:14]

going to basically Define my model which

[150:16]

will be

[150:18]

Ridge okay so this this is what I have

[150:20]

actually imported now uh let me open the

[150:24]

ridge skarn so SK learn

[150:28]

Ridge we need need to understand what

[150:30]

all parameters are basically

[150:33]

used do you remember this Alpha value

[150:36]

guys do you remember this Alpha value

[150:38]

why do we use Alpha I I told you now

[150:40]

Alpha multiplied by slope square if you

[150:43]

remember in Ridge we specifically use

[150:45]

this right Ridge and lasso regression

[150:48]

Alpha so this is the alpha the this is

[150:50]

probably the best parameter we can

[150:52]

perform hyper parameter tuning the next

[150:54]

parameter that we can probably perform

[150:56]

is basically uh this Max iteration okay

[151:00]

Max iteration basically means how many

[151:01]

number of iteration how many number of

[151:03]

times we may probably change the Theta 1

[151:05]

value to get the right value so we can

[151:08]

do this so what I'm actually going to do

[151:10]

I'm going to select some Alpha values

[151:12]

I'm going to play with this apart from

[151:14]

that if I want I can also play with the

[151:16]

other parameters which are uh like kind

[151:19]

of uh you know probably you can you can

[151:21]

also play with the iteration parameter

[151:23]

it is up to you try whichever parameter

[151:25]

you want to change you can go ahead and

[151:26]

change it now let me show you how do we

[151:28]

write this and how do we make sure that

[151:31]

this specific thing is done now uh

[151:34]

before doing grid s CV uh let me do one

[151:36]

thing I will Define my parameters okay

[151:39]

so here is my Ridge now what I'm going

[151:41]

to do I'm going to say parameters and in

[151:44]

this parameter two important value that

[151:46]

I'm probably going to take is this one

[151:49]

that is my C value and I will try to

[151:51]

Define this in the form of dictionaries

[151:53]

so here the C value that I sorry not C

[151:57]

just a second

[151:58]

guys my mistake it is not C it is

[152:03]

Alpha let's see so how do I Define my

[152:05]

Alpha value we'll try to see so here the

[152:09]

parameters will be Alpha C is basically

[152:13]

for uh logistic regression I'll show you

[152:16]

so the alpha value I will just mention

[152:18]

some values like

[152:20]

1 e to the power of -5 that basically Me

[152:27]

00000000 0 0 0 1 similarly I I can write

[152:31]

1 E to the^ of - 10 that again means 0 0

[152:35]

0 0 0 0 0 0 10 * 0 1 I'm just making fun

[152:38]

okay so that you will also get

[152:39]

entertained 1 E to the^ of minus 8 okay

[152:43]

similarly I can write 1 E to the^ of

[152:45]

minus 3 from this particular value now

[152:48]

I'm increasing this value see 1 E to

[152:50]

the^ of minus 2 and then probably I can

[152:53]

have 1 5 10 um 20 something like this so

[152:58]

I'm going to play with all this

[152:59]

particular parameters for right now

[153:01]

because in grit or CV what they do is

[153:03]

that they take all the combination of

[153:04]

this Alpha value and wherever your uh

[153:07]

your your model performs well it is

[153:09]

going to take that specific parameter

[153:11]

and it is going to give you that okay

[153:13]

this is the best fit parameter that is

[153:14]

got selected so here I have got all

[153:16]

these things now what I'm going to do

[153:18]

I'm going to basically apply the grid C

[153:19]

TV so here I have uh gridge uh sorry

[153:24]

Ridge GD I'm

[153:25]

saying ridore regressor so I'm going to

[153:28]

use git s

[153:31]

CV git s CV and here I'm basically going

[153:34]

to take the parameters regge okay Ridge

[153:36]

is my first model and then I will take

[153:39]

up all this params that I have actually

[153:40]

defined see in git CV if I press shift

[153:43]

tab I have to first of all execute this

[153:46]

then only it will be able to press shift

[153:48]

tab so here if I press shift tab here

[153:50]

you'll be able to see estimator and

[153:52]

parameter grid is my second parameter

[153:54]

then scoring and then all the other

[153:56]

parameters so here the first thing that

[153:58]

goes is your model then your parameters

[154:00]

which what you are actually playing then

[154:03]

the third parameter is basically your

[154:05]

scoring

[154:06]

scoring and again here I'm going to use

[154:09]

negative mean squ error some people are

[154:10]

saying that mean squared error is not

[154:13]

present so that is the reason why

[154:15]

negative mean squ error is done why it

[154:18]

may not be present because

[154:20]

uh they try to always create a generic

[154:22]

Library probably this kind of uh scoring

[154:24]

parameter may also get used in other

[154:26]

algorithms so that is the reason they

[154:28]

may not have created but if you want to

[154:30]

Deep dive into it Google

[154:33]

Google then what is r regress dot fit on

[154:38]

X comma y again I'm telling you you can

[154:40]

first of all do train test split on X

[154:42]

and Y and then probably only do this on

[154:44]

X train and Y train parameter is not oh

[154:47]

sorry

[154:49]

okay I get this okay parameter is not

[154:53]

and why it is not and oh yeah it has

[154:56]

become a

[154:58]

list I'm going to make this as

[155:00]

dictionary right now I'm fully focused

[155:02]

on implementing things if I get an error

[155:04]

I'll definitely make sure that it'll get

[155:07]

fixed anyhow if I get that error I will

[155:09]

not say oh Kish why why this error came

[155:12]

you

[155:13]

know why this error came I I'll not get

[155:15]

worried I'll get the error down only you

[155:17]

cannot give this as the one okay so try

[155:21]

to understand okay so this is your gitar

[155:23]

CV I've also done the fit and let's go

[155:27]

and select the best parameter so what I

[155:28]

can do I will write print

[155:32]

ridore

[155:35]

regressor dot

[155:37]

params sorry there will be a parameter

[155:39]

called as best params I'm going to print

[155:42]

this and I'm going to print ridore

[155:46]

regressor Dot

[155:50]

best

[155:51]

score so these are all the values that

[155:53]

are got selected one is Alpha is equal

[155:55]

to 20 and the best score is - 32 so

[155:58]

initially I gotus 37 but because of

[156:00]

Ridge regression you can see that our

[156:02]

negative mean square error has

[156:04]

definitely become better there is a

[156:06]

minus sign don't worry but from 37 it

[156:08]

has come to 32 cross validation guys

[156:11]

over here inside grids s CV also when it

[156:13]

is probably taking the entire

[156:15]

combination over there the CV Value

[156:17]

Cross validation also we can use

[156:20]

so probably if I am probably considering

[156:23]

all these

[156:24]

things many people has a question Chris

[156:27]

is this minus value increased that

[156:29]

basically means you cannot use Ridge

[156:31]

regression you are right in this

[156:33]

particular case Ridge regression is not

[156:34]

helping you out so guys let me again

[156:36]

write it down everybody don't worry yeah

[156:41]

previous I got minus 32 right now I'm

[156:43]

getting - 37

[156:45]

right sorry previously I got what - 37

[156:52]

- 37 now I got - 32 so here you can see

[156:56]

this I got it from linear regression

[156:59]

this I got it from what Ridge which one

[157:02]

should I select I should select this

[157:03]

model only because it is performing well

[157:05]

than this but again understand Ridge

[157:08]

also tries to reduce the overfitting so

[157:11]

probably in this particular scenario we

[157:12]

cannot use Ridge because the performance

[157:14]

is becoming more bad so what I will do I

[157:17]

will go and try with lasso regression

[157:20]

now I'll copy and paste the same thing

[157:22]

so linear model import lasso then this

[157:25]

will basically be my

[157:27]

lasso let's see with lasso whether it

[157:29]

will increase or not let's

[157:33]

see this is my parameter that got

[157:35]

selected now let me write lasto

[157:38]

regressor

[157:39]

dot best params so this is Alpha is

[157:42]

equal to one is got selected over here

[157:44]

I'm just going to print it okay and then

[157:47]

I'm going to print with last one

[157:49]

regression DOT score will be the best so

[157:52]

here I'm actually getting - 35 - 35 here

[157:56]

I'm actually getting - 32 so minus 35

[157:59]

still I will focus on linear regression

[158:01]

now see what will happen if I add more

[158:04]

parameters if I add more parameters see

[158:06]

what will happen so now I'm going to

[158:08]

take Alpha different different values

[158:10]

see this I'm just going to remove this

[158:13]

and probably add Alpha value in this

[158:16]

way see here I have added more values 5

[158:19]

10 20 30 35 40 45 100 okay let's see

[158:23]

whether we our performance will increase

[158:25]

or not so here

[158:28]

uh first of all let me remove from here

[158:32]

in Ridge just take it down guys I'm I'm

[158:35]

adding more parameters like this just

[158:36]

take it down yeah CV is equal to 5

[158:40]

nobody okay you're not able to see it um

[158:43]

CV is equal to 5 now here it is uh what

[158:46]

you can basically focus on so here you

[158:49]

can see I have added some values like

[158:51]

this you can also

[158:52]

add and just try to execute and now if I

[158:56]

go and probably see this is my see first

[158:59]

I have tried for Ridge I'm getting minus

[159:02]

29 do you see after adding more

[159:04]

parameters what happened in Ridge after

[159:07]

adding more parameters what happened in

[159:09]

Ridge you can see om minus 29 and the

[159:12]

alpha value that is got selected is 100

[159:14]

if you want try with cross validation

[159:17]

10 and just try to execute now

[159:20]

now so these are are some hyper

[159:22]

parameters that we will definitely play

[159:24]

with here you can see - 29 so here you

[159:27]

can see minus 29 you can also increase

[159:30]

the cross validation

[159:32]

value over here also and probably

[159:34]

execute it but with lasso I don't know

[159:38]

whether it is improving or not it is

[159:39]

coming to minus 34 you just have to play

[159:42]

with this parameters as now for a bigger

[159:45]

problem statement the thing is not

[159:47]

limited to here right we try to take

[159:49]

multiples and many parameters multiples

[159:52]

and many parameters and try to do these

[159:54]

things it is up to you we play with

[159:56]

multiple parameters whichever gives us

[159:58]

the best result we are basically taking

[160:00]

it it's okay error is increased I know

[160:03]

that no error is increasing definitely

[160:06]

error is increasing even though by

[160:08]

trying with different different

[160:09]

parameters but about most of the

[160:11]

scenario see here I gotus 37 probably

[160:14]

what I can actually do is that uh try to

[160:17]

get better one with respect to this

[160:20]

now the best way what I can also do is

[160:22]

that I can basically take up train and

[160:25]

test split also and probably do these

[160:27]

things let's see let's see one example

[160:29]

so how do we do train and test from SK

[160:32]

scalar dot I think model selection

[160:35]

import train test split okay it's okay

[160:38]

guys you may get a different value okay

[160:40]

let's do one thing okay let's make your

[160:42]

problem statement little bit simpler now

[160:45]

what I'm going to do just tell me in

[160:46]

train test plate what we need to do so

[160:48]

I'm going to take the same code I'm

[160:50]

going to paste it over here or let me do

[160:52]

one thing let me insert a cell below and

[160:55]

let me do it for train test split so in

[160:57]

train test plate what we can do so I'm

[161:00]

just going to take the syntax paste it

[161:02]

over here let's say that I'm taking XT

[161:04]

train y train and then I'm using train

[161:07]

test split with 33% now if I execute

[161:10]

with respect to X train and Y train so

[161:12]

here is my you can see this I have

[161:13]

written this code from SK learn. model

[161:15]

selection uh train test plate random

[161:17]

State can be anything whatever you write

[161:20]

it is fine then you basically give X and

[161:22]

Y with test sizes 33 uh this is

[161:25]

basically saying that the test will have

[161:27]

33% and the train data will be 77% so

[161:31]

this is what I'm actually getting with

[161:33]

respect to X train and Y train here what

[161:35]

I'm going to do I'm going to basically

[161:37]

take X train comma y train and now if I

[161:40]

go and probably see this here you can

[161:41]

see minus 25 understand this value

[161:44]

should go towards zero if it is going

[161:47]

towards zero that basically means the

[161:49]

performance is better now similarly I do

[161:52]

it for Ridge in Ridge what I'm actually

[161:54]

going to do here I'm going to write X

[161:56]

train and Y train and if I go and

[161:58]

probably select the best score than this

[162:00]

here you'll be able to see I'm getting

[162:03]

how much I'm getting minus

[162:06]

2.47 okay here I'm getting

[162:09]

25.8 here 25. 47 that basically means

[162:12]

now still the Improvement is little bit

[162:15]

bad because here we are not going

[162:17]

towards zero so the next part again here

[162:20]

also you can basically do it for X train

[162:22]

and Y train X train and Y train so here

[162:25]

you have this one and let's go and

[162:27]

execute this so here you can see minus

[162:30]

2.47 now what you can also do is that

[162:33]

you can use this

[162:35]

lasso regressor do predict and you can

[162:39]

basically predict with respect to X test

[162:42]

so this is your white test value suppose

[162:44]

let's say that this is my y PR Yore PR

[162:47]

then what I can do from SK

[162:50]

learn I will be using R square and

[162:53]

adjusted R square if you remember SK

[162:55]

learn R square r² so this is my R2 score

[163:00]

so where it is present in SK learn.

[163:02]

Matrix so I'm going to write from SK

[163:04]

learn import let's say I'm saying from

[163:08]

skarn do Matrix import r² R2 score now

[163:14]

what I'm going to do over here I'm

[163:16]

basically going to say my R2 score which

[163:20]

is my variable I'll say this is nothing

[163:22]

but R2 score here I'm just going to give

[163:24]

my y PR comma Yore test so if I go and

[163:28]

probably see the output here I will be

[163:30]

able to see print R2 score this is all I

[163:34]

have discussed guys there is also

[163:37]

adjusted rant score is there where is R2

[163:41]

R2 score one adjusted r² okay R2 score

[163:46]

is there but adjusted R square should be

[163:48]

here somewhere in some manner so this is

[163:52]

how your output looks like with respect

[163:53]

to by using this lasso regressor okay

[163:56]

which is very good okay it should be I

[163:59]

told it should be near 100% right now

[164:01]

I'm getting 67% if I want to tie with

[164:04]

the ridge you can also try that so you

[164:06]

can say Ridge regressor do predict and

[164:10]

here you can see 7 68% then you can also

[164:12]

try linear regressor and

[164:16]

predict what is the error saying the

[164:19]

regression is not fitted yet why why it

[164:22]

is not fitted why it is not

[164:25]

fitted let's say that I have fitted here

[164:28]

linear

[164:30]

regression dot fit on X train and Y

[164:33]

train X train and comma y train so I'm

[164:37]

just going to fit it now if I go and

[164:40]

probably try to do the

[164:41]

calculation so if I go and see my R2

[164:44]

score it is also coming somewhere around

[164:46]

68% 67% now since this is just a linear

[164:50]

regression you won't be able to get 100%

[164:52]

because you're drawing a straight line

[164:53]

right so for that you basically have to

[164:56]

other use other algorithms like XG boost

[164:58]

and all n bias so many algorithms are

[165:01]

there it's okay see you give y test over

[165:04]

here y PR over here both are same right

[165:06]

they're

[165:07]

comparing by see at one limit you can

[165:10]

you can increase the performance after

[165:12]

that you cannot see again I'm telling

[165:14]

you in linear regression what we do

[165:15]

these are my points right I will be only

[165:17]

able to create one best line I cannot

[165:19]

create a curve line right over here so

[165:21]

obviously my accuracy will be only

[165:23]

limited let's go and do it logistic

[165:26]

practical

[165:27]

quickly and here uh in logistic also we

[165:31]

can do git SE CV now what I'm actually

[165:34]

going to do first of all let's go ahead

[165:35]

with the data set so I will quickly

[165:38]

Implement logistic so from LC learn.

[165:41]

linear

[165:42]

model I'm going to import logistic

[165:46]

regression so I'm going to use logistic

[165:48]

regression and apart from that we know

[165:50]

that let's take a new data set because

[165:52]

for logistic we need to solve using

[165:54]

classification problem so this is

[165:56]

basically my logistic regression I'll

[165:58]

take one data set so from SK learn. data

[166:01]

sets import we'll take a data set which

[166:03]

is like uh breast cancer data set so

[166:05]

that is also present in SK learn with

[166:07]

respect to the breast cancer data set

[166:09]

I'm just going to use this see load best

[166:12]

cancer data set I'm loading it and all

[166:14]

the independent features are in data and

[166:16]

my columns are feature names the same

[166:18]

thing like how we did previously okay so

[166:20]

this will basically be my

[166:23]

complete uh complete independent feature

[166:25]

so if I go and probably see this x. head

[166:28]

here you'll be able to see that based on

[166:31]

this input features the independent

[166:32]

feature we need to determine whether the

[166:34]

person is having cancer or not these are

[166:37]

some of the features over here and this

[166:39]

is like many many features are actually

[166:40]

present so next thing I this that was my

[166:43]

independent feature now I'll take my

[166:45]

dependent feature dependent feature will

[166:47]

already present in DF Target okay this

[166:50]

particular data set that we have taken

[166:52]

in DF in DF do Target we will basically

[166:55]

have all our dependent feature these are

[166:56]

my independent features so what I'm

[166:58]

actually going to do I'm going to create

[166:59]

Y and I'm going to say PD do data frame

[167:04]

and here I'm going to say DF do Target

[167:07]

Target and this column name should be

[167:11]

Target right so this will be my column

[167:13]

name and now if I go and see my y y is

[167:16]

basically having zeros and one in the

[167:18]

target feature now the next thing that

[167:20]

we are going to do is that uh apply

[167:23]

basically apply the first of all we need

[167:26]

to check whether this data set is uh

[167:29]

this particular y column is balanced or

[167:31]

imbalanced okay in order to do that I

[167:33]

will just write F

[167:35]

Target if the data set is imbalanced

[167:38]

definitely we need to work on that and

[167:40]

try to perform upsampling so if I write

[167:42]

y target. Valore counts if I execute

[167:46]

this so here you'll be able to see that

[167:48]

value SC counts will basically give that

[167:50]

how many number of ones are and how many

[167:52]

number of zeros are so now total number

[167:54]

of ones are 357 and total number of

[167:57]

zeros are 22 so is this a imbalanced

[168:01]

data set probably this is a balanced

[168:03]

data set so here I'm actually going to

[168:04]

now do train test spit train test spit I

[168:08]

will try to do again train test spit how

[168:10]

do we do we can quickly do copy the same

[168:14]

thing entirely I'll copy this entirely

[168:16]

over here and then I will get my X and Y

[168:20]

so here is my X train X test y train y

[168:22]

test so train test plate obviously I'll

[168:24]

be doing it now in logistic regression

[168:26]

if I go and search for

[168:28]

logistic regression escalar I will be

[168:31]

able to see this what all parameters are

[168:33]

there this is basically the L1 Norm or

[168:35]

L2 Norm or L1 regularization or L2

[168:37]

regularization with respect to whatever

[168:39]

things we have discussed in logistic and

[168:41]

then the C value these two parameter

[168:43]

values are very much important if I

[168:45]

probably show you over here the penalty

[168:49]

what kind of penalty whether you want to

[168:50]

add L2 penalty L1 penalty you can use L2

[168:53]

or L1 the next thing is C this is

[168:56]

nothing but inverse of regularization

[168:57]

strength this basically says 1 by Lambda

[169:01]

something like that this parameter is

[169:02]

also very much important guys class

[169:04]

weight suppose if your data set is not

[169:06]

balanced at that point of time you can

[169:09]

apply weights to your classes if

[169:11]

probably your data set is balanced you

[169:14]

can directly use class weight is equal

[169:16]

to balanced other than that you can use

[169:18]

other other weight which you basically

[169:19]

want so this is specifically some of

[169:22]

this right no this is not Ridge or lasso

[169:25]

okay this is logistic in logistic also

[169:28]

you have L1 norm and L2

[169:30]

Norms understand probably I missed that

[169:32]

particular part in the theory but here

[169:35]

also you have an L2 penalty norm and L1

[169:37]

penalty Norm I probably did not teach

[169:39]

you in theory because if you look see

[169:43]

logistic regression can be learned by

[169:45]

two different ways one is through

[169:47]

probabilistic method and one is through

[169:49]

geometric method if you go and probably

[169:51]

see my video that is present with

[169:52]

respect to logistic regression right now

[169:54]

in my YouTube channel there I have

[169:56]

explained you about this L1 and L2 Norms

[169:58]

also over there so in this also it is

[169:59]

basically present it is a kind of

[170:01]

penalty again just for uh using for this

[170:05]

kind of classification problem so what

[170:08]

I'm actually going to do let's go and

[170:10]

play with the parameters that I am

[170:12]

looking at so I will play with two

[170:14]

parameters one is params C value here

[170:17]

I'm defining 1 10 20 anything that you

[170:20]

can Define one set of values you can

[170:22]

Define and there was one more parameter

[170:24]

which is called as Max iteration this is

[170:26]

specifically for grits or CV okay that

[170:28]

I'm specifically going to apply so I

[170:30]

will just try to execute this this will

[170:32]

be my params now I'm going to quickly

[170:34]

Define my model one which will be my

[170:36]

logistic regression model so my logistic

[170:39]

regression here by default one value

[170:41]

I'll give for C and Max itra let's say

[170:45]

I'm giving this value later on what I

[170:47]

will do for this model I'll apply it to

[170:49]

grid sear CV so I'm just going to say

[170:51]

grid s CV and I'm going to apply it for

[170:55]

model one param grid is equal to params

[170:59]

this parameter that I'm specifically

[171:01]

trying to apply since this is a

[171:02]

classification problem and I am not

[171:04]

pretty sure that whether true positive

[171:06]

is important or true negative is

[171:08]

important I'm going to use F1 scoring

[171:10]

okay F1 scoring is basically again the

[171:13]

parametric term which we discussed

[171:14]

yesterday which is nothing but

[171:16]

performance metrics and then I'm going

[171:18]

to use CV is equal to 5 so this will be

[171:21]

entirely my model with respect to grid s

[171:24]

CV and I'll be executing this then I

[171:27]

will do model. fit on my X train and Y

[171:32]

train data so once I execute it here you

[171:34]

can see all the output along with

[171:36]

warnings a lot of warnings will be

[171:38]

coming I don't know because this many

[171:40]

parameters are there and finally you can

[171:42]

see that this has got selected now if

[171:44]

you really want to find out what is your

[171:46]

best param score model

[171:49]

dot best params so here you can see Max

[171:52]

iteration as

[171:54]

150 and what you can actually do with

[171:58]

respect to your best score model do best

[172:03]

score is 95 percentage but still we want

[172:06]

to test it with test data so can we do

[172:09]

it yes we can definitely do it I'll say

[172:11]

model do core or I'll say model dot

[172:15]

predict on my X test data and this will

[172:18]

basically be my y red so this will be my

[172:21]

y red all the Y prediction that I'm

[172:23]

actually getting so if you go and see y

[172:26]

red so these are my ones and zeros with

[172:28]

respect to the Y

[172:30]

prediction at finally after getting the

[172:32]

prediction values I can apply confusion

[172:35]

Matrix I hope I have taught you about

[172:36]

confusion Matrix so from sklearn do

[172:39]

confusion Matrix sorry sklearn do metrix

[172:43]

I'm going to import confusion metrix

[172:46]

classification report and the next thing

[172:49]

that I would like to do is this two I

[172:52]

will try to import confusion Matrix and

[172:54]

classification report now if you want to

[172:56]

see the confusion Matrix with respect to

[172:58]

your I can just write

[173:00]

Yore frad or Yore test whatever you want

[173:04]

go ahead with it and this is basically

[173:06]

my confusion Matrix if I put this

[173:09]

forward no difference will be there only

[173:11]

this thing will be moving that also I

[173:13]

showed you 63 118 3 and 4 now finally if

[173:17]

I want to accuracy score I can also

[173:19]

import accuracy score over here so here

[173:21]

you can see accuracy score is imported I

[173:23]

can also find out my accuracy score

[173:25]

which is my the total accuracy with

[173:28]

respect to this I we can give y test and

[173:31]

Yore PR which we have discussed

[173:34]

yesterday this is giving

[173:35]

96% if you want detailed Precision

[173:38]

recall all the score then at that point

[173:40]

of time I can use this classification

[173:43]

report and here I can give white test

[173:45]

and wied here is what I'm actually

[173:47]

getting so here you can see with respect

[173:50]

to F1 F1 score Precision recall since

[173:52]

this is a balanced data set obviously

[173:54]

the performance will be best yes you can

[173:57]

also use Roc see I'll also show you how

[174:00]

to use Roc and probably you'll be able

[174:01]

to see this you have to probably

[174:03]

calculate false positive rate two

[174:05]

positive rate but don't worry about Roc

[174:07]

I will first of all explain you the

[174:08]

theoretical part now let's go ahead and

[174:10]

discuss about n bias n bias is an

[174:13]

important algorithm so here I'm just

[174:16]

going to go ahead so now let's go ahead

[174:18]

and discuss about na bias and here we

[174:21]

are going to discuss about the intuition

[174:23]

so na bias is an another amazing

[174:26]

algorithm which is specifically used for

[174:29]

classification and this specifically

[174:31]

works on something called as base

[174:34]

theorem now what exactly is base theorem

[174:36]

first of all we need to understand about

[174:38]

base theorem let's say that guys I have

[174:41]

base theorem let's say that I have an

[174:43]

experiment which is called as rolling a

[174:45]

dis now in rolling a dis how many number

[174:47]

of elements I have have so if I say what

[174:49]

is the probability of 1 then obviously

[174:51]

you'll be saying 1X 6 if I say

[174:53]

probability of two then also here you'll

[174:55]

say 1X 6 if I say probability of three

[174:58]

then I will definitely say it is 1x 6 so

[175:01]

here you know that this kind of events

[175:04]

are basically called as independent

[175:06]

events now rolling a dice why it is

[175:08]

called as an independent event because

[175:10]

getting one or two in every experiment

[175:12]

one is not dependent on two two is not

[175:14]

dependent on three so they are all

[175:16]

independent that is the reason why we

[175:18]

specifically say is an independent event

[175:20]

but if I take an example of dependent

[175:22]

events let's consider that I have a bag

[175:24]

of marbles okay in this marble I

[175:28]

basically have three red marbles and I

[175:31]

have two green marbles now tell me what

[175:33]

is the probability of suppose I have a

[175:36]

event in the first event I take out a

[175:38]

red marble so what is the probability of

[175:40]

taking out a red marble so here you can

[175:43]

definitely say that it is

[175:44]

3x5 okay so this is my first event now

[175:47]

in the second event let's say that in

[175:49]

this you have taken out the red marble

[175:51]

now what is the second second time again

[175:53]

you are taking out the second red marble

[175:55]

or forget about second Rand marble now

[175:57]

you want to take out the green marble

[175:59]

now what is the probability with respect

[176:01]

to taking out a green marble so here

[176:03]

you'll be definitely saying that okay

[176:05]

one red marble has been removed then the

[176:07]

total number of marbles that are left

[176:09]

are four so here you can definitely

[176:11]

write that probability of getting a

[176:12]

green marble is nothing but 2x4 which is

[176:14]

nothing but 1x2 so here what is

[176:16]

happening first first element you took

[176:18]

out first marble that you took out first

[176:20]

event from from the first event you took

[176:21]

out red marble from the second event you

[176:23]

took out green marble this two are in

[176:25]

these two are dependent events because

[176:28]

the number of marbles are getting

[176:29]

reduced as you take out from them so if

[176:32]

I tell you what is the probability of

[176:35]

taking out a red marble and then a green

[176:39]

marble so it's the simple the formula

[176:42]

will be very much simple right which we

[176:43]

have already discussed in stats it is

[176:45]

nothing but probability of probability

[176:47]

of red multiplied by probability of

[176:50]

green given Red so this specific thing

[176:53]

is called as conditional probability

[176:55]

here understand what is happening

[176:57]

probability of green marble given the

[176:59]

red marble event has occurred here both

[177:01]

the events are independent now let me

[177:03]

write it down very nicely so I can write

[177:05]

probability of A and B is equal to

[177:08]

probability of a multiplied probability

[177:12]

of B divided by probability of a let's

[177:16]

go and derive something can can I write

[177:18]

probability of A and B is equal to

[177:21]

probability of b and a so answer is yes

[177:24]

we can definitely say we can definitely

[177:26]

say if you go and do the calculation

[177:27]

you'll be able to get the answer you

[177:29]

should not say no now what is the

[177:32]

formula for probability of A and B so

[177:34]

here you can basically write probability

[177:36]

of a multiplied by probability of B

[177:39]

given a if I take out probability of

[177:42]

green what is probability of green in

[177:43]

this particular case 2x 5 what is

[177:46]

probability of red 3x 4 for right now

[177:49]

let's consider this now this part I can

[177:51]

definitely write as this part I can

[177:54]

definitely write as probability of B

[177:56]

multiplied by probability of B

[178:00]

probability of B this one probability of

[178:02]

B and this will be probability of a

[178:04]

given B so I can definitely write this

[178:06]

much with respect to all this

[178:08]

information now can I derive probability

[178:10]

of a is equal to probability of B

[178:14]

multiplied by probability of a / B me

[178:18]

probability of a given B divided by

[178:21]

probability of sorry I'll write this as

[178:24]

probability of B given a divided by

[178:27]

probability of a and this is

[178:28]

specifically called as base theorem and

[178:31]

this is the Crux behind na bias

[178:34]

understand this is the Crux behind the

[178:35]

base theorem now let's go ahead and

[178:38]

let's discuss about how we are using

[178:40]

this to solve let's take some examples

[178:43]

and probably make you understand let's

[178:45]

say that I have some features like X1 X2

[178:49]

X3 X4 X5 like this till xn and I have my

[178:54]

output y so these are my independent

[178:56]

features these all are my independent

[178:58]

features these all are my independent

[179:00]

features so here I'm going to write

[179:02]

independent features and this is my

[179:04]

output feature which is also my

[179:05]

dependent feature now what is happening

[179:08]

if I say probability of b or a what does

[179:10]

this basically mean I need to really

[179:12]

find what is the probability of Y and

[179:15]

you know that guys I will have some

[179:17]

values over here and basically I'll have

[179:19]

some output value over here so based on

[179:21]

this input values I need to predict what

[179:23]

is the output initially on a training

[179:25]

data set I will have your input and then

[179:28]

your output initially my model will get

[179:30]

trained on this now let's consider what

[179:32]

this entire terminology is I will try to

[179:34]

write in terms of this equation so I

[179:36]

will say probability of Y given x1a x2a

[179:41]

X3 up till xn then this equation will

[179:44]

become probability of Y see probability

[179:46]

of Y given X X1 X2 X3 xn this a is

[179:50]

nothing but X1 X2 X3 xn and I'm trying

[179:52]

to find out what is the probability of Y

[179:54]

and then I will write probability of b b

[179:57]

is nothing but y but before that what

[180:00]

I'll write probability of a / B right a

[180:03]

given b or probability of B probability

[180:06]

of B is nothing but y multiplied by

[180:08]

probability of a given B probability of

[180:12]

a given B basically means probability of

[180:15]

x1a X2 comma xn and given b b is given

[180:20]

right so I'm able to find this entire

[180:22]

value now just a second I made some

[180:24]

mistakes I guess now it is correct sorry

[180:26]

I I just missed one term that is this

[180:29]

given y this is how it will become and

[180:32]

this will be equal to probability of a

[180:36]

that is X1 comma X2 like this up to XL

[180:39]

so probability of Y multiplied by

[180:41]

probability of a given y now if I try to

[180:44]

expand this then this will basically

[180:46]

become something like this see

[180:48]

probability of Y multiplied by

[180:51]

probability of X1 given yes a given y

[180:56]

sorry given y multiplied by probability

[181:00]

of X2 given y probability of x3 given Y

[181:06]

and like this it will be probability of

[181:08]

xn given y so this will also be y1 Y2 Y3

[181:12]

YN this I can expand it like this and

[181:15]

then this will basically become

[181:16]

probability of X Y 1 multiplied by

[181:18]

probability of X2 multiplied by

[181:21]

probability of x3 like this up to

[181:23]

probability of xn so this is with

[181:26]

respect to all the probability y will be

[181:28]

different see here for this particular

[181:30]

record y will be different for this y

[181:32]

will be different for this y will be

[181:34]

different but why output it may be yes

[181:37]

or no right it may be yes or no okay I

[181:41]

I'll solve a problem it will make

[181:43]

everything understand and this will

[181:45]

probably be probability of Y it can be

[181:47]

binary multiclass whatever things you

[181:49]

want I'll solve a problem in front of

[181:51]

you now let's say that I have my y as

[181:54]

let's say that I have a lot of features

[181:56]

X1 X2 X3 X X4 with respect to this let's

[182:02]

say in my one of my data set I have this

[182:03]

many x1s this many features and this is

[182:06]

my y so these are my feature number and

[182:09]

this is my y let's say that in y I have

[182:11]

yes or no so how I will probably write

[182:15]

we really need to understand this okay I

[182:17]

will basically

[182:18]

say what is the probability of Y is

[182:21]

equal to yes given this x of I this is

[182:25]

my first record first record of X of I

[182:27]

this is my second record of X of I so I

[182:30]

may write like this what is the

[182:31]

probability of Y being yes if x of I is

[182:34]

given to you X of I basically means X1

[182:37]

X2 X3 X4 so here you'll obviously write

[182:39]

what kind of equation you'll basically

[182:41]

say probability of yes multiplied by

[182:45]

probability of yes multiplied by

[182:46]

probability of X of 1 given

[182:50]

yes multiplied by probability of X2

[182:53]

given yes probability of x3 given yes

[182:58]

and probability of X4 given yes divided

[183:03]

by probability of X1 multiplied by

[183:06]

probability of X2 multiplied by

[183:08]

probability of x3 multiplied by

[183:10]

probability of X4 Y is fixed it may be

[183:13]

yes or it may be no but with respect to

[183:15]

different different records this value

[183:17]

may change similarly if I write

[183:18]

probability of Y is equal to no given X

[183:22]

of I what it will be then it will be

[183:26]

probability of no multiplied by

[183:30]

probability of X1 given no then

[183:33]

probability of

[183:35]

X2 given

[183:37]

no probability of

[183:39]

x3 given

[183:42]

no and probability of X4 given no so

[183:46]

here because every any input that I give

[183:49]

any input X of I that I give I may

[183:51]

either get yes or no so I need to find

[183:53]

both the probability so probability of

[183:54]

X1 multiplied by probability of X2

[183:57]

multiplied by probability of x3

[183:59]

multiplied by probability of X4 see with

[184:02]

respect to Any X of I the output can be

[184:05]

yes or no and I really need to find out

[184:07]

the probabilities so both the formula is

[184:09]

written over here what is the

[184:11]

probability of with respect to yes and

[184:13]

what is the probability with respect to

[184:14]

no now in this case one common thing you

[184:17]

see that this this denominator is fixed

[184:20]

this is definitely fixed it is fixed it

[184:22]

is it is not going to change for both of

[184:24]

them and I can consider that this is a

[184:27]

constant so what I can do I can

[184:30]

definitely ignore so here I can

[184:32]

definitely ignore these things ignore

[184:34]

this also ignore this Al because see

[184:36]

this is constant so I don't want to

[184:38]

consider this in the next time I'll just

[184:40]

use this specific formula to calculate

[184:42]

the probability now let's say that if my

[184:46]

first probability for a specific data

[184:49]

set yes of X of I is let's say that I'm

[184:52]

getting

[184:53]

as13 and similarly probability of no

[184:56]

with respect to X of I if I get

[185:00]

05 you know that in a binary

[185:02]

classification any values if it get

[185:04]

greater than or equal to 5 we are going

[185:06]

to consider it as 1 and if it is less

[185:09]

than 0.5 I'm going to consider it as

[185:10]

zero now I'm getting values like this 13

[185:13]

and .1 05 obviously I'm getting .13 05

[185:18]

so we do something called as

[185:20]

normalization it says that if I really

[185:23]

want to find out the probability of X

[185:24]

with X of I if I do normalization it is

[185:27]

nothing but .13 divided by .13 +

[185:31]

05 72 this is nothing but

[185:35]

72% and similarly if I do for

[185:37]

probability of no given X of I here

[185:39]

obviously it will say 1 - 72 which will

[185:42]

be your remaining answer that is 28

[185:44]

which is nothing but 28% so your final

[185:47]

answer will be this one this formulas

[185:49]

you have to remember now we'll solve a

[185:50]

problem let's solve a problem this will

[185:52]

be a very very interesting problem let's

[185:54]

say I have a data set which has like

[185:56]

this feature day let me just copy this

[185:59]

data set okay for you all now in this

[186:02]

data set I want to take out some

[186:04]

information let's take out Outlook

[186:08]

table now based on this output Outlook

[186:11]

feature see over here Outlook my day

[186:14]

outlook temperature humidity wind are

[186:17]

the input features independent feature

[186:19]

this is my output feature this one that

[186:22]

you are probably seeing play tennis is

[186:24]

my output feature which is specifically

[186:26]

a binary

[186:27]

classification so what I'm actually

[186:29]

going to do I'm basically going to take

[186:31]

my Outlook feature and based on this

[186:33]

Outlook feature I will just try to

[186:34]

create a smaller table which will give

[186:36]

some information now based on Outlook

[186:39]

first of all try to find out how many

[186:40]

categories are there in Outlook one is

[186:43]

sunny one is

[186:45]

overcast and one is rain right three

[186:48]

categories are there so I'm going to

[186:50]

write it down over here Sunny overcast

[186:53]

and rain so these three are my features

[186:56]

with respect to Sunny uh with Outlook I

[186:58]

have three categories one is sunny one

[187:00]

is overcast and one is RA here I'm going

[187:02]

to basically say with respect to Sunny

[187:05]

how many yes are there and how many no

[187:08]

are there and what is the probability of

[187:11]

yes and probability of no so I'm going

[187:13]

to again write it over here so this is

[187:16]

my Outlook feature

[187:18]

and then I have categories first yes no

[187:23]

Sunny overcast rain yes no then

[187:28]

probability of yes and probability of no

[187:31]

now the next thing that we need to find

[187:33]

out is that with respect to Sunny how

[187:37]

many of them are yes see yes we have so

[187:40]

when we have sunny over here the answer

[187:42]

is no so I will increase the count over

[187:44]

here one then again I have sunny again

[187:47]

answer is no so I'm going to increase

[187:49]

the count to two with this sunny this is

[187:52]

basically no okay so again I'm going to

[187:54]

increase the count to three now with

[187:56]

sunny how many of them are yes one and

[188:00]

two so I have this one and this one so I

[188:03]

have two so I'm going to say with

[188:05]

respect to Sunny I have two

[188:07]

yes understand Outlook is my X1 X1

[188:11]

feature let's consider now the next

[188:13]

thing is that let's see with respect to

[188:16]

overcost with overcast how many of them

[188:18]

are yes so this overcast is there yes 1

[188:22]

2 3 and four so total four yes are there

[188:26]

with respect to overcast then with

[188:28]

respect to overcast how many are on no

[188:31]

you can go ah and find out it is

[188:32]

basically zero NOS then with respect to

[188:35]

rain how many of them are yes so here

[188:37]

you can see with respect to one rain yes

[188:40]

yes no no so this is nothing but 3 2

[188:46]

let's try to find out there are three is

[188:47]

two or

[188:48]

not one here also one yes is there right

[188:52]

so 3 yes two NOS so the total number of

[188:55]

yes and NOS if you count it there are

[188:58]

nine yes and five NOS this is my total

[189:01]

count so if you totally count this 9 + 5

[189:04]

is 14 you'll be able to compare that

[189:06]

there will be 9 yes and five NOS what is

[189:08]

the probability of yes when Sunny is

[189:10]

given so here you have 2X 9 here you

[189:14]

have 4X 9 here you have 3x 9 now if if I

[189:17]

say what is the probability of no given

[189:20]

Sunny now see probability of yes given

[189:23]

Sunny probability of yes given forecast

[189:26]

probability of yes given rain so it is

[189:28]

basically that I will just try to write

[189:30]

it in a simpler manner so that you'll

[189:31]

not get confused okay so this is my

[189:33]

probability of yes and this is my

[189:35]

probability of no but understand what

[189:37]

does this basically mean this

[189:39]

terminology basically means probability

[189:41]

of yes given Sunny probability of yes

[189:44]

given overcast probability of yes given

[189:46]

rain similarly what is probability of no

[189:49]

probability of no obviously you know

[189:50]

that 3x 5 is my first probability then

[189:54]

you have 0x 5 and then you have 2X 5 now

[189:58]

with respect to the next feature let's

[190:00]

consider that I'm going to consider one

[190:01]

more feature and in this feature I will

[190:03]

say let's consider

[190:05]

temperature okay let's consider

[190:07]

temperature now in temperature how many

[190:10]

features I have or how many categories I

[190:12]

have I have hot you can see hot mild and

[190:17]

and cold now with respect to hot mild

[190:19]

cold here also I will be having yes no

[190:23]

probability of yes and probability of no

[190:26]

now try to find out with respect to hot

[190:28]

how many are yes so here no is there

[190:31]

here also no is there two NOS uh 1 yes

[190:36]

uh 2 yes so two yes and two NOS probably

[190:39]

then similarly with respect to mild mild

[190:42]

how many are there 1 yes 1 No 2 yes 3s

[190:48]

4s 4S and two knows okay so here you

[190:51]

basically go and calculate 4 yes and two

[190:54]

knows with respect to cold how many are

[190:57]

there cool cool or cold 1 yes 1 No 2 yes

[191:03]

3 S 3 S and 1 no so here I have

[191:07]

specifically have 3s and 1 no again the

[191:10]

total number is 9 and five which will be

[191:12]

equal to the same thing that what we

[191:15]

have got now really go ahead with

[191:16]

finding probability of yes given hot so

[191:19]

it will be 2x 9 over here then here it

[191:22]

will be how much 4X 9 here it will be 3x

[191:26]

9 again here what will be the

[191:28]

probability of no given given hot so

[191:31]

it'll be 2x 5 2x 5 1X 5 so this two

[191:36]

tables has already been created and

[191:37]

finally with respect to play the total

[191:39]

number of plays are yes is 9 no is five

[191:44]

and the answer is total 14 if if I say

[191:47]

what is the probability of yes only yes

[191:50]

then it is nothing but 9 by4 what is the

[191:54]

probability of no it is nothing but

[191:56]

5x4 okay so this two values also you

[191:59]

require now let's say that you get a new

[192:02]

data set you need get a new data set

[192:05]

let's say you get a new test data where

[192:08]

it says that suppose if you are having

[192:11]

sunny and hot tell me what is the output

[192:16]

so this is my problem statement so let

[192:18]

me write it down so here I will write

[192:20]

probability of yes given Sunny comma hot

[192:25]

then here I will write probability of

[192:27]

yes multiplied by probability of so here

[192:31]

I will write probability of Sunny given

[192:34]

yes multiplied by probability of hot

[192:38]

given yes divided by what is it

[192:42]

probability of Sunny multiplied by

[192:45]

probability of hot

[192:50]

equation because it is a

[192:52]

constant because probability of no also

[192:55]

I'll be getting the same value 9 by4 so

[192:58]

probability of yes I'm going to replace

[193:00]

it with 9

[193:02]

by4 multiplied by 2x 9 then probability

[193:06]

of hot given yes so I am going to get 2

[193:09]

by 9 so

[193:12]

here 99 cancel or 2 1 7 then this is

[193:17]

nothing but 2 by

[193:21]

6331 I read this statement little bit

[193:23]

wrong it should be probability of Sunny

[193:25]

given yes now go ahead and calculate go

[193:28]

ahead and calculate what is probability

[193:30]

of no given sunny and hot so here you

[193:33]

have probability of no multiplied by

[193:36]

probability of Sunny given

[193:38]

no multiplied by probability of hot

[193:43]

given

[193:44]

no divided by probability of Sunny

[193:50]

multiplied by probability of heart this

[193:53]

will get cancelled denominator is a

[193:55]

constant guys this is a constant so what

[193:58]

is probability of no so probability of

[194:00]

no is nothing but 5 by4 so I will write

[194:03]

over here 5 by4 multiplied by

[194:07]

probability of Sunny given no what is

[194:09]

probability of Sunny given no what is

[194:11]

probability of Sunny given no is nothing

[194:13]

but probability of Sunny given no is

[194:15]

nothing but 3x 5 so here I'm going to

[194:17]

get 3x 5 multiplied probability of H

[194:22]

given no that is nothing but 2x 5 so 2x

[194:25]

5 is here 3x 5 is there five and five

[194:28]

will get cancelled 2 1 2 7 and then I'm

[194:32]

getting 3x 35 which is nothing but

[194:35]

calculator uh if I'm actually getting

[194:37]

three ID by 35 it's nothing but

[194:41]

857 I will write it down again

[194:44]

probability of yes given Sunny comma hot

[194:49]

which is my independent feature is

[194:51]

nothing but

[194:52]

031

[194:54]

031 and this is probability of no given

[194:57]

Sunny comma hot 85 now we'll try to

[195:00]

normalize this 85 + Point divided by 031

[195:06]

+ 085 73 this is nothing but 73% and

[195:11]

here I can basically say 1 -73 which is

[195:14]

my27 which is nothing but 27% if the

[195:18]

input comes as sunny and hot if the

[195:21]

weather is sunny and hot what will the

[195:23]

person do whether he will play or not

[195:26]

the answer is no okay now my next

[195:29]

question will be that if your new data

[195:31]

is overcast and Mild now tell me what

[195:34]

will be the probability using name bias

[195:37]

now you can add any number of features

[195:39]

let's say that I will say that okay

[195:42]

let's let's say that I will I will

[195:44]

probably say we can consider humidity

[195:47]

mind wind also you basically create this

[195:49]

kind of table to find it out but this

[195:50]

will be an assignment just do

[195:53]

it overcast and Mild if it is with

[195:56]

respect to NB try to solve it so the

[195:58]

second algorithm that we are going to

[196:00]

discuss about is something called as KNN

[196:02]

algorithm KNN algorithm is a very simple

[196:05]

problem statement okay which can be used

[196:09]

to solve both classification and

[196:11]

regression so KNN basically means K

[196:14]

nearest neighbor let's first of all

[196:16]

discuss about classification problem

[196:18]

number one classification problem let's

[196:20]

say that I have a binary classification

[196:22]

problem which looks like this I have two

[196:23]

data points like this one and this is

[196:26]

another one suppose a new data point

[196:29]

suppose a new data point which comes

[196:31]

over

[196:32]

here then how do I say that whether this

[196:35]

belongs to this category or whether it

[196:36]

belongs to this category if I probably

[196:38]

create a logistic regression I may

[196:40]

divide a line but in this particular

[196:42]

scenario how do we Define or how do we

[196:44]

come to a conclusion that

[196:47]

whether this will belong to this

[196:48]

category or this category so for here we

[196:50]

basically use something called as K

[196:52]

nearest neighbor let's say that I say

[196:55]

that my K value is five so what it is

[196:57]

going to do it is going to basically

[196:58]

take the five nearest closest point

[197:01]

let's say from this you have two nearest

[197:03]

closest point and from here you have

[197:05]

three nearest closest point so here we

[197:07]

basically see from the distance the

[197:09]

distance that which is my nearest point

[197:11]

now in this particular case you see that

[197:13]

maximum number of points are from Red

[197:15]

categories from Red from Red categories

[197:18]

I'm getting three points and from White

[197:21]

categories I'm getting two points now in

[197:23]

this particular scenario maximum number

[197:25]

of categories from where it is coming we

[197:27]

basically categorize that into that

[197:29]

particular class just with the help of

[197:30]

distance which all distance we

[197:31]

specifically use we use two distance one

[197:33]

is ukan distance and the other one is

[197:36]

something called as Manhattan distance

[197:37]

so ukan and Manhattan distance now what

[197:40]

does ukan distance basically say suppose

[197:42]

if this is your two points which is

[197:44]

denoted by X1 y1

[197:47]

X2 Y2 ukine distance in order to

[197:50]

calculate we apply a formula which looks

[197:52]

like this X2 - X1 s + Y2 - y1 s whereas

[197:58]

in the case of magetan distance suppose

[198:00]

this are my two points then we calculate

[198:03]

the distance in this way we calculate

[198:05]

the distance from here then here right

[198:07]

this is the distance we calculate we

[198:09]

don't calculate the hypothenuse distance

[198:10]

so this is the basic difference between

[198:11]

ukan and magetan distance now you may be

[198:14]

thinking Chris then fine that is for

[198:15]

classification problem for regression

[198:17]

what do we do for regression also it is

[198:19]

very much simple suppose I have all the

[198:22]

data points which looks like this now

[198:24]

for a new data point like this if I want

[198:26]

to calculate then we basically take up

[198:28]

the nearest Five Points let's say my K

[198:30]

is five k is a hyper parameter which we

[198:33]

play now suppose let's say that K it

[198:35]

finds the nearest point over here here

[198:38]

here here and here so if we need to find

[198:42]

out the point for this particular output

[198:44]

with respect to the K is equal to 5 it

[198:46]

will try to calculate the average of all

[198:48]

the points once it calculates the

[198:51]

average of all the points that becomes

[198:53]

your output so regression and

[198:55]

classification that is the only

[198:56]

difference because this K is actually an

[198:58]

hyper parameter we try with K is equal

[199:00]

to 1 to 50 and then we probably try to

[199:03]

check the error rate and if the error

[199:06]

rate is less then only we select the

[199:08]

model now two more things with respect

[199:10]

to K nearish neighbor K nearest neighbor

[199:12]

works very bad with respect to two

[199:15]

things one is outliers and and one is

[199:17]

imbalanced data set now if I have an

[199:19]

outlier let's say I have an outlier over

[199:22]

here this is one of my categories like

[199:24]

this and this is my another category

[199:26]

let's consider that I have some outliers

[199:28]

which looks like this now if I'm trying

[199:29]

to find out the point for this you can

[199:32]

see that the nearest point is basically

[199:35]

blue only and it belongs to the blue

[199:37]

category but because this outlier you

[199:39]

know it'll consider that the nearest

[199:40]

neighbor is this so then this will be

[199:42]

basically treated in this group only

[199:44]

formula for Manhattan distance it uses

[199:46]

modulus X2 - X1 + Y2 - y1 mode X2 - X1

[199:53]

Y2 - y1 uh this was it from my side guys

[199:55]

and yes I've also made detailed videos

[199:57]

about whatever topics we have discussed

[199:59]

today you can directly go and search for

[200:01]

that particular

[200:03]

topic so this is the agenda of this

[200:06]

session we will try to complete this all

[200:08]

things again here we are going to

[200:10]

understand the mathematical equations

[200:12]

and all uh so today's session we are

[200:14]

basically going to discuss about uh

[200:16]

decision tree okay and uh in this

[200:20]

session we are going to basically

[200:21]

understand what is the exact purpose of

[200:23]

decision tree with the help of decision

[200:25]

tree you are actually solving two

[200:27]

different problems one is regression and

[200:30]

the other one is

[200:32]

classification so we'll try to

[200:34]

understand both this particular part

[200:37]

well we will take a specific data set

[200:38]

and try to solve those problems now

[200:40]

coming to the decision tree one thing

[200:42]

you need to understand I'll say that if

[200:45]

age is less than 8 let's say I'm writing

[200:48]

this condition if age is less than or

[200:51]

equal to 18 I'm going to say print go to

[200:55]

college here I'm printing print college

[200:58]

and then I'll write else if age is

[201:02]

greater than 18 and pag is less than or

[201:05]

equal to 35 I'll say print work then

[201:09]

again I'll write else if age is let me

[201:12]

let me put this condition little bit

[201:14]

better then I'll write here L if if age

[201:17]

is greater than 18 and age is less than

[201:22]

or equal to 35 I'm going to say print

[201:25]

work basically people needs to work in

[201:27]

this age else I'm just going to consider

[201:30]

print retire so here is my ifls

[201:34]

condition over here now whenever we have

[201:36]

this kind of nested if Els condition

[201:38]

what we can do is that we can also

[201:40]

represent this in the form of decision

[201:42]

trees we'll also we can actually form

[201:45]

this in the form of decision and the

[201:46]

decision tree here first of all we will

[201:48]

have a specific root node let's say this

[201:51]

is my root node now in this root node

[201:52]

the first condition is less than or

[201:54]

equal to 18 so here obviously I will be

[201:56]

having two conditions saying that if it

[201:59]

is less than or equal to 18 and one

[202:02]

condition will be yes one condition will

[202:03]

be no so if this is yes and if this is

[202:06]

no right if this condition is true that

[202:09]

basically means we'll go in this side if

[202:11]

it is true then here we will basically

[202:14]

have something like college so this is

[202:17]

your Leaf node similarly when I have no

[202:22]

okay no no in this particular case we

[202:24]

will go to the next condition in this

[202:26]

next condition I will again create a

[202:28]

node and I'll say that okay this is less

[202:30]

than 18 and greater than sorry less than

[202:33]

or equal to 35 so if this is also there

[202:38]

then again I'll have two conditions

[202:39]

which is basically yes or no now when I

[202:42]

create this yes or no over here you'll

[202:43]

be able to see that basically means here

[202:46]

again two condition will be there if it

[202:48]

is yes I will say print work so this

[202:50]

will again be my leaf

[202:52]

node and again for no again I will do

[202:55]

the further splitting which is retire so

[202:59]

here you can see that this entire

[203:00]

algorithm this entire code that I have

[203:02]

actually written you can see that it has

[203:05]

got converted to this kind of

[203:08]

trees where you specifically able to

[203:10]

take decisions yes or no so can we solve

[203:15]

a classification

[203:17]

problem sorry this is greater than 18

[203:21]

again if it is greater than 18 and less

[203:23]

than or 35 so can we solve a

[203:28]

regression and a classification problem

[203:31]

regression and classification problem

[203:34]

using this decision trees by creating

[203:37]

this kind of

[203:38]

nodes so in short whenever we talk about

[203:41]

decision

[203:42]

trees whenever we talk about decision

[203:45]

trees

[203:47]

you will be seeing that decision trees

[203:49]

are nothing but decision trees are

[203:52]

nothing but by using this nested if El

[203:56]

condition we can definitely solve some

[203:58]

specific problem statement but here in

[204:00]

the visualized way we will specifically

[204:02]

create this decision tree in the form of

[204:04]

nodes now you need to understand that

[204:07]

what type of maths we will probably use

[204:10]

okay so let's do one thing let's take a

[204:12]

specific data set which I will

[204:14]

definitely do it over here in front of

[204:15]

you

[204:17]

okay and we will try to solve this

[204:18]

particular data set and this will

[204:20]

basically give you an idea like how we

[204:23]

can probably solve these problems so uh

[204:26]

let me just open my snippet tool so this

[204:29]

is my data set that I have let's

[204:31]

consider that I have this specific data

[204:33]

set now this data set are pretty much

[204:35]

important because this probably in

[204:39]

research papers also probably people who

[204:41]

have come up with this algorithm they

[204:43]

usually take this they take this thing

[204:46]

but but right now this particular

[204:47]

problem statement if I talk about this

[204:49]

is a classification problem statement

[204:51]

okay but don't worry I will also help

[204:53]

you to explain I'll also explain you

[204:56]

about regression also how decision tree

[204:58]

regression will definitely work so let's

[205:01]

go ahead and let's try to understand

[205:03]

suppose if I have this specific problem

[205:05]

statement how do we solve this this is

[205:07]

my output feature play tennis yes or no

[205:10]

okay whether the person is going to pay

[205:12]

tennis or not yesterday or there after

[205:14]

yesterday or whenever you want so if I

[205:17]

have this input features like Outlook

[205:19]

temperature humidity and wind is the

[205:22]

person going to play tennis or not this

[205:24]

is what my model should predict with the

[205:26]

help of decision tree so how decision

[205:28]

tree will work in this particular case

[205:29]

first of all let's consider any any any

[205:33]

specific uh feature let's say that

[205:35]

Outlook is my feature so this will be my

[205:37]

first

[205:38]

feature which is specifically Outlook

[205:41]

now just tell me how many are basically

[205:45]

having no and how many are basically

[205:48]

having yes in the case of Outlook there

[205:51]

you'll be able to find out there are

[205:52]

nine yes see 1 2 3 4 5 6 7 8 9 and how

[205:58]

many NOS are there 1 2 3 4 5 I think 1 2

[206:04]

3 4 5 so nine yes and five NOS what we

[206:09]

are going to do in this specific thing

[206:11]

now we have N9 yes and five Nos and the

[206:13]

first node that I have actually taken

[206:17]

is basically Outlook so Outlook feature

[206:20]

now just try to find out we are focusing

[206:22]

on this specific feature now in this

[206:24]

feature how many categories I have I

[206:26]

have one Sunny category you can see over

[206:29]

here I have Sunny one category then I

[206:31]

have another category called as

[206:33]

overcast then I have another category as

[206:37]

rain so I have three unique categories

[206:40]

So based on these three categories I

[206:42]

will try to create three nodes so here

[206:45]

is my one node here is my second node

[206:49]

here is my third node so these are my

[206:52]

three categories so this category is

[206:53]

basically called as Sunny this category

[206:57]

is basically called as overcast and this

[207:00]

category is basically called as rain

[207:03]

based on these three categories so I'm

[207:04]

splitting it now just go ahead and see

[207:07]

in Sunny how many yes and how many no

[207:10]

are there how many yes with respect to

[207:12]

Sunny are there see in sunny I have two

[207:14]

NOS see one and two no uh one more no is

[207:18]

there three NOS so here you can see this

[207:21]

is my one no then this is my two no this

[207:25]

is my three no and yes are two so this

[207:30]

one and this one so how many total

[207:33]

number of yes so here you can see that

[207:36]

there are 1 2 2 yes and three no let's

[207:41]

say that I have randomly selected one

[207:43]

feature which is Outlook why can't I

[207:45]

when like see it is up to it it is up to

[207:49]

the decision tree to select any of the

[207:51]

feature here I have specifically taken

[207:53]

Outlook later on I'll explain why it it

[207:57]

can basically select how it selects the

[207:59]

feature okay I'll I'll talk about it

[208:00]

don't worry so in the Outlook we have

[208:04]

two yes sorry in the case of Sunny we

[208:06]

have two yes and three NOS now the next

[208:08]

thing is that let's go and see for

[208:10]

overcast in overcast I have 1 yes uh 2s

[208:14]

um 3s and 4 yes I don't have any no in

[208:18]

overcast so over here my thing will be

[208:21]

that four yes and Zer Nos and then

[208:24]

finally when we go to the Rain part see

[208:26]

in Rain how many features are there in

[208:29]

rain if you go and probably see it how

[208:31]

many number of yes and NOS are there go

[208:33]

and see in one one yes in row rain two

[208:36]

yes then one no then again you have one

[208:39]

yes and one no right so here you can

[208:43]

basically say that in rain in the case

[208:45]

of rain if I take a as an example how

[208:47]

many number of yes and NOS are there it

[208:49]

will be 3 yes and two

[208:52]

NOS understand understanding

[208:57]

algorithm then everything will you'll be

[209:00]

able to understand now let's go ahead

[209:03]

and try to cease for sunny sunny

[209:05]

definitely has 2 yes and three NOS this

[209:08]

has four yes and zero NOS here you have

[209:10]

three Y and two NOS now if I probably

[209:13]

take overcast here you need to

[209:15]

understand understand about two things

[209:17]

one is pure

[209:18]

split and one is impure split now what

[209:22]

does pure split basically mean pure spit

[209:25]

basically means that now see in this

[209:26]

particular scenario in overcast in

[209:29]

overcast I have either yes or no so here

[209:32]

you can see that I have four yes and Zer

[209:35]

NOS so that basically means this is a

[209:37]

pure split anybody tomorrow in my data

[209:40]

set if I just take this Outlook feature

[209:43]

suppose in one day in day 15 the Outlook

[209:46]

is Outlook is basically overcast then I

[209:50]

know directly it is the person is going

[209:52]

to play so this part is already created

[209:54]

and this node is called as pure

[209:58]

node understand this why it is called as

[210:00]

pure node because either you have all

[210:03]

Yes or zeros NOS or zero yes or all NOS

[210:08]

like that in this particular case I have

[210:10]

all yes so if I take this specific path

[210:13]

I know that with respect to overcast my

[210:16]

final decision which is yes it is always

[210:17]

going to become yes so this is what it

[210:19]

basically says so I don't have to split

[210:22]

further so from here I will probably not

[210:25]

split I will definitely not split more

[210:28]

because I don't require it because I

[210:31]

have it is a pure leaf node okay you can

[210:34]

also say that this is a pure leaf node

[210:37]

so I'm just going to mention it again

[210:39]

this one I'm specifically talking about

[210:41]

now let's talk about sunny in the case

[210:43]

of Sunny you have two yes and three NOS

[210:45]

so this is obviously impure so what we

[210:48]

do we take next feature and again how do

[210:52]

we calculate that which feature we

[210:54]

should take next I'll discuss about it

[210:56]

let's say that after this I take up

[211:00]

temperature I take up temperature and I

[211:02]

start splitting again since this is

[211:04]

impure okay and this split will happen

[211:08]

until we get finally a pure split

[211:11]

similarly with respect to rain we will

[211:13]

go ahead and take another feature and

[211:15]

we'll keep on splitting unless and until

[211:18]

we get a leaf node which is completely

[211:21]

pure I hope you understood how this

[211:23]

exactly work now two questions two

[211:27]

questions is that Kish the first thing

[211:29]

is that how do we calculate this

[211:32]

Purity and how do we come to know that

[211:35]

this is a pure split just by seeing

[211:38]

definitely I can say I can definitely

[211:41]

say by just seeing that how many number

[211:43]

of yes or NOS are there based on that I

[211:45]

can def itely say it is a pure split or

[211:47]

not so for this we use two different

[211:50]

things one is

[211:53]

entropy and the other one is something

[211:55]

called as guine coefficient so we will

[211:58]

try to understand how does entropy work

[212:01]

and how does Guinea coefficient work in

[212:04]

decision tree which will help us to

[212:06]

determine whether the split is pure

[212:09]

split or not or whether this node is

[212:11]

leaf node or not then coming to the

[212:13]

second thing okay coming to the second

[212:16]

thing one is with respect to Purity

[212:18]

second thing your first most important

[212:20]

question which you had asked why did I

[212:22]

probably select Outlook how the features

[212:24]

are selected and here you have a topic

[212:27]

which is called as Information Gain and

[212:29]

if you know this both your problem is

[212:32]

solved so now let's go ahead and let's

[212:35]

understand about entropy or guinea

[212:38]

coefficient or Information Gain entropy

[212:40]

or guine coefficient oh sorry Guinea

[212:42]

coefficient I'm saying guine impurity

[212:44]

also you can say over here

[212:46]

I'll write it as guine impurity not

[212:48]

coefficient also I'll just say it as

[212:50]

Guinea impurity but I hope everybody is

[212:53]

understood till here let's go ahead and

[212:55]

let's discuss about the first thing that

[212:57]

[212:58]

entropy how does entropy work and how we

[213:01]

are going to use the formula so entropy

[213:04]

here I will just write guine so we are

[213:07]

going to discuss about this both the

[213:09]

things let's say that the entropy

[213:12]

formula which is given by I will write h

[213:14]

of s is equal to so h of s is equal to

[213:17]

minus P plus I'll talk about what is

[213:20]

minus what is p plus log base 2 p

[213:26]

+- p

[213:28]

minus log base 2 p minus so this is the

[213:32]

formula and in guine impurity the

[213:34]

formula is 1 minus summation of I equal

[213:39]

1 2 N p² I even talk about when you

[213:43]

should use guine impurity when you

[213:44]

should not use guine impurity

[213:46]

when you should use entropy you know by

[213:48]

default the decision tree regression or

[213:51]

classific sorry decision tree

[213:53]

classification uses Guinea impurity now

[213:56]

let's take one specific example so my

[213:58]

example is that I have a feature one my

[214:00]

root node I have a feature one which is

[214:03]

my root node and let's say that in this

[214:05]

root node I have six yes and three NOS

[214:08]

very simple let's say that this has two

[214:11]

categories and based on this two

[214:13]

categories of split has happened that is

[214:16]

a C1 let's say in this I have 3 S3 Nos

[214:20]

and here I have 3 s0 Nos and this is my

[214:24]

second category always understand if I

[214:26]

do the sumission 3s and 3s is 6s see

[214:30]

this this sumission if I do 3 + 3 is

[214:33]

obviously 6 3 + 0 is obviously so this

[214:36]

you need to understand based on the

[214:38]

number of root nodes only almost it'll

[214:40]

be same now let's go ahead and let's

[214:44]

understand how do we Cal calculate let's

[214:46]

take this example how do we calculate

[214:48]

the entropy of this so I have already

[214:50]

shown you the entropy formula over here

[214:52]

now let's understand the components I

[214:55]

will write h of s is equal to minus sign

[214:59]

is there what is p+ p+ basically means

[215:03]

that what is the probability of yes what

[215:07]

is the probability of yes this is a

[215:10]

simple thing for you all out of this

[215:13]

what is the probability of yes yes out

[215:16]

of this so obviously how you'll write if

[215:19]

you want to find out the probability of

[215:20]

yes out of this see when I say plus that

[215:24]

basically means yes when I say minus

[215:27]

that basically means no so what is the

[215:29]

probability of yes so it is be nothing

[215:32]

but yes plus and minus are specifically

[215:35]

for binary

[215:37]

class this can be positive negative so

[215:40]

the probability with respect to yes can

[215:42]

I write 3x 3 only for this what is the

[215:45]

probability out of this total number of

[215:48]

this is there 3x3 similarly if I go and

[215:51]

see the next term log to the base 2 p+

[215:54]

so again if I go ahead and write over

[215:56]

here log to the base 2 p+ p+ is again

[216:03]

3x3 so then again we have minus and this

[216:07]

is now P minus what is p minus 0 by 3

[216:11]

log base 2 0 by 3 this obviously will

[216:15]

become zero this will obviously become 0

[216:18]

because 0 divid by anything is zero what

[216:21]

will this be 1 log to the base 1 what is

[216:25]

this this is nothing but zero log to the

[216:28]

base 1 is nothing but zero tell me

[216:31]

whether this is a pure split or impure

[216:35]

split so this is a pure split whenever

[216:38]

we have a pure split the answer of the

[216:41]

entropy is going to come to zero so here

[216:44]

I'm going to Define one graph

[216:46]

this is H of s and let's say this is p+

[216:49]

or P minus if my probability of plus see

[216:53]

when I say probability of plus is 0.5

[216:56]

what will be probability of minus it

[216:57]

will also be 0. five right because it's

[217:01]

just like P is equal to 1 - Q right if p

[217:04]

is .5 then Q will be 1 - P same thing

[217:07]

right so when it

[217:09]

is5 obviously my h of s will be 1 let's

[217:14]

say so this is this is the graph that

[217:16]

will basically get formed let's go ahead

[217:19]

and try to calculate the entropy of this

[217:21]

guys what will be the entropy of this

[217:24]

node so here I'm going to just make a

[217:26]

graph h of s minus what is p+ p+ is

[217:31]

nothing but 3x 6 log base 2 3x 6

[217:37]

minus three no are there 3x 6 log base 2

[217:43]

3x 6 so if you compute this

[217:46]

log base 2 to the^ of 1 if you do the

[217:50]

calculation here I'm actually going to

[217:52]

get one so when I'm getting one when I'm

[217:55]

actually getting one when you have three

[217:57]

yes and three NOS what is the

[217:59]

probability it is 50/50% right so when

[218:02]

your p+ is5 that basically means your h

[218:06]

of s is coming as one so from this graph

[218:09]

you can see that I'm getting one if this

[218:11]

is zero this is one this is zero and

[218:13]

this is one I hope everybody is able to

[218:15]

to understand guys 0o and one if your p+

[218:20]

[218:21]

zero or if your p+ is one that basically

[218:24]

means it becomes a pure split so in h of

[218:26]

s you are going to get

[218:29]

zero so always understand your entropy

[218:33]

will be between 0 to

[218:36]

1 if I have a impure this is a

[218:39]

completely impure split because here you

[218:42]

have 50% probability of getting yes 50%

[218:45]

probability of getting no h ofs is

[218:48]

entropy this is entropy for the sample H

[218:52]

ofs notation that I'm using is H ofs so

[218:56]

if whenever the split is happening the

[218:59]

first thing is done the purity test the

[219:02]

purity test is done with the help of

[219:04]

entropy right now I'll also show guinea

[219:07]

guinea impurity don't worry so with the

[219:09]

entropy you'll be able to find if I am

[219:11]

getting one that basically means it is a

[219:14]

impure split and if I'm getting zero it

[219:18]

is pure split so this is the graph okay

[219:22]

this is the graph and this graph is

[219:24]

basically the entropy graph again

[219:26]

understand if your probability of

[219:28]

getting yes or no is 0.5 that basically

[219:30]

means 50/50 is there 3s and three NOS

[219:34]

then your entropy is going to be 1 h of

[219:37]

s if your probability is completely one

[219:39]

that basically means either you're

[219:40]

getting completely yes or completely no

[219:43]

so your your entropy will be zero that

[219:46]

basically means it is pure split so in

[219:48]

the case of probability .5 you're

[219:50]

getting plus one then it'll keep on

[219:52]

reducing now let's go ahead and let's

[219:54]

try to understand so here you have

[219:56]

understood about purity test definitely

[219:58]

you'll use entropy try to find out

[220:00]

whether it is pure or impure if it is

[220:02]

impure you go ahead with the further

[220:04]

shift further division of the categories

[220:08]

again you take another feature divide it

[220:10]

because here from this two which split

[220:13]

you will do further you will do this

[220:14]

split as further if you are getting 6 6

[220:18]

is this specific value then you probably

[220:20]

go and draw over here this is your

[220:23]

entropy if your probability is here

[220:25]

which

[220:26]

is.3 then you will go here and create

[220:29]

this this may be0 4 or3 something like

[220:32]

this it will be between 0 to 1 let's go

[220:35]

ahead and discuss about the second issue

[220:37]

I hope everybody is discussed about we

[220:40]

have discussed about checking the pure

[220:42]

split or not and we have understood this

[220:45]

much but the next thing is that okay

[220:47]

fine chish this is very good we have

[220:49]

explained well I know many people will

[220:51]

say that but there are some people I

[220:53]

can't help let's say that I have some

[220:55]

features okay now coming to the second

[220:58]

problem how do we consider which node to

[221:02]

cap which which feature to take and

[221:05]

split because here I may have one one

[221:08]

split so again let's see that what is

[221:10]

the second problem which feature to take

[221:14]

to split right this is the second

[221:16]

problem that we are trying to solve

[221:18]

let's say that I have one feature one

[221:19]

over here and I have two categories

[221:22]

let's say this is there C1 and C2 here

[221:25]

let's say that I have 9 years 5 Nos and

[221:29]

then I have 6 years 2 NOS here I have

[221:32]

basically three yes and three NOS let's

[221:34]

say and in my data set I have features

[221:36]

like F1 FS2 F3 now let's say that

[221:40]

another split I can actually start with

[221:42]

feature two also and in feature two I

[221:45]

may have probably three categories like

[221:47]

C1 C2 C3 so with respect to the root

[221:52]

node and all the other features because

[221:54]

after this also I may have to split

[221:56]

right I may have to take another feature

[221:58]

and keep on splitting right based on the

[222:01]

Pure or impure split how do I decide

[222:03]

should I take fub1 first or F2 first or

[222:07]

F3 first or any other feature first how

[222:10]

should I decide that which feature

[222:12]

should I take and probably do the split

[222:15]

that is the major question so for this

[222:18]

we specifically use something called as

[222:20]

Information Gain so here I'm just going

[222:22]

to say here we basically use Information

[222:26]

Gain now what is this Information Gain

[222:29]

I'll talk about it so Information Gain

[222:31]

first of all I will write the formula we

[222:33]

basically write gain with sample first

[222:37]

with feature one I will compute so first

[222:40]

with feature one I will compute suppose

[222:42]

this is my first split of my data and

[222:44]

probably I'm Computing over here this

[222:46]

can be written as h of s I'll discuss

[222:50]

about each and every parameter don't

[222:51]

worry summation of V belong to values s

[222:56]

of V don't worry guys if you have not

[222:58]

understood the formula I will explain it

[223:01]

then the sample size H of SV I'll

[223:04]

discuss about each and every parameter

[223:06]

let's say that I'm taking this feature

[223:09]

one split I have you have already seen

[223:11]

what is feature one so this is my

[223:13]

feature one I have two categories C1 C2

[223:18]

this has 9 yes 5 NOS this has 6s and two

[223:24]

Nos and this has 3 yes and three NOS now

[223:27]

I will try to calculate the information

[223:29]

gain of this specific split now I will

[223:32]

go ahead and probably take this up now

[223:35]

see over here we'll try to understand

[223:37]

what is this now if I want to compute

[223:40]

the gain of s of F1 first is first first

[223:43]

thing that I need to find out is H of s

[223:45]

now this h of s is specifically of the

[223:48]

root node so I need to first of all

[223:50]

calculate what is h of s h ofs is

[223:52]

nothing but entropy entropy of the root

[223:56]

node so if I want to compute the entropy

[223:58]

of the node node tell me how should I

[224:00]

compute h of s is equal to minus p + log

[224:04]

base 2 p+ calculate guys along with me -

[224:07]

P minus log base to P minus so I hope

[224:11]

everybody knows this so here I'm going

[224:13]

to compute by what is ability of plus

[224:16]

over here in this specific root node it

[224:18]

is nothing but 9 by4 then I have log

[224:22]

base 2 again 9

[224:24]

by4 then I have P minus what is p minus

[224:28]

5x4 log base 2 5 by4 so this calculation

[224:34]

I will probably get it as

[224:36]

94 approximately equal to 94 just check

[224:40]

it whether you're getting this or not

[224:42]

again you can use calculator if you want

[224:44]

now now I have definitely found out this

[224:47]

this is specifically for the root node

[224:50]

now let's see the next thing the next

[224:51]

important thing which is this part what

[224:54]

is s of v and what is s and what is h of

[224:57]

SV now very important just have a look

[225:01]

everybody see this graph okay see this

[225:05]

graph I will talk about h of SV first of

[225:07]

all I'll talk about h of SV okay this

[225:10]

one this is the entropy of category one

[225:13]

you need to find and entropy of category

[225:15]

2 you need to find so if I write h of SV

[225:19]

of category 1 so what is category 1 for

[225:22]

this I'll write SC1 let's say I'm going

[225:25]

to write like this quickly calculate the

[225:28]

H of SV of this and this separately you

[225:31]

need to calculate so h of SV of C1 okay

[225:35]

so here again you'll write - 6X 8 log

[225:38]

base 2 6X

[225:41]

8us 2x 8 log base to 2x 8 I hope

[225:46]

everybody knows this how we got it so h

[225:50]

of SV basically means I'm going to

[225:51]

compute the entropy of this category and

[225:54]

this category so for that I will

[225:56]

basically write h of so here I will

[225:59]

write - 6 by8 log base 2 6X 8 - 2x 8 log

[226:08]

base 2 2x 8 so if I get it I'm actually

[226:12]

going to get 81 and similarly if I if I

[226:15]

calculate h of C2 quickly calculate how

[226:18]

much you are going to get guys 6X 8 6X 8

[226:21]

with respect to this we need to find out

[226:24]

so now we have all these values we'll

[226:25]

start equating them to this equation so

[226:29]

here we have finally gain of s comma

[226:33]

fub1 so let's say that here I'm going to

[226:36]

basically add

[226:38]

94 minus see minus summation of okay

[226:42]

summation of what is s s of V understand

[226:46]

s of V basically means that how many

[226:48]

samples I have over here let's say for

[226:51]

category one how many samples I have for

[226:54]

category one over here simple if you

[226:56]

really want to just calculate it is

[226:58]

nothing but eight and total number of

[227:01]

sample is how much if I go and see over

[227:03]

here there are 9 years five NOS okay 9

[227:07]

years and five NOS that basically means

[227:10]

14 total sample here you have eight

[227:13]

sample Okay so this will become

[227:17]

8x4 then you multiply by what see see

[227:21]

from this equation you multiply by h of

[227:23]

SV so h of SV is nothing but the entropy

[227:26]

of category 1 so entropy of category 1

[227:29]

is nothing but 81 plus then you go again

[227:33]

back to the graph and try to see that

[227:36]

for C2 how much how many total number of

[227:39]

samples are there 3 + 3 is 6 so 6 by 14

[227:42]

it will

[227:43]

become multiplied by 1 right so this is

[227:49]

your entire thing so here after all the

[227:52]

calculation you are going to get

[227:54]

0.041 so this is my gain with s comma F1

[227:59]

so here I have got this value amazing I

[228:02]

did this with feature one only what

[228:05]

about feature two let's say that this

[228:07]

was my split for feature two and suppose

[228:10]

I get the gain for S comma feature 2 as

[228:17]

.51 if I get this now tell

[228:21]

me in using which feature should I start

[228:25]

splitting first whether it should be

[228:28]

fub1 or whether it should be FS2 based

[228:31]

on this value you know that over here

[228:35]

the gain the information gain of s comma

[228:38]

F2 is greater than gain of s comma fub1

[228:43]

so your answer is very much simple we

[228:45]

will definitely use feature 2 to start

[228:48]

the split the thing over here you are

[228:51]

trying to understand that if I really

[228:52]

want to select which feature to select

[228:54]

to start my splitting then I have to

[228:58]

basically calculate the information gain

[229:00]

and go throughout the all the paths and

[229:03]

whichever path has the highest

[229:04]

Information

[229:05]

Gain then we will select that specific

[229:09]

thing now the question Rises Kish

[229:12]

obviously this is good but you had

[229:14]

written about guinea impurity what is

[229:16]

the purpose of that please explain us

[229:19]

and why Guinea impurity is basically

[229:20]

used so let me go ahead with guine

[229:22]

impurity I told that yes you can

[229:25]

obviously

[229:26]

use you can obviously use entropy but

[229:29]

why Guinea impurity so guine impurity

[229:32]

formula which I have specifically

[229:34]

written as 1 minus summation of IAL 1

[229:38]

2 N

[229:41]

p² now what is this p² suppose let's say

[229:45]

that in my n n is the number of outputs

[229:47]

right now how many outputs I have I have

[229:49]

two outputs yes or no so I will expand

[229:52]

this 1 minus since this is summation I

[229:55]

equal to 1 to n I'm basically going to

[229:57]

basically say that okay fine I will

[230:00]

write probability of plus whole

[230:03]

Square uh plus probability of minus

[230:07]

whole Square so this is the formula for

[230:10]

guinea impurity now you may be thinking

[230:14]

okay fine the calculation will be

[230:16]

obviously very much equal easy right

[230:18]

suppose if I have a node sorry if I have

[230:21]

a node which which has 2 yes two NOS now

[230:25]

in this particular case how do I

[230:26]

calculate my this probability if I have

[230:29]

two yes or two NOS suppose let's say

[230:31]

that I have a node over here which is my

[230:33]

split and this is having two yes and two

[230:36]

no so how do I calculate I will write 1

[230:38]

minus what is probability of square 1X 2

[230:41]

square sorry not 1 by two

[230:45]

yeah 1X 2 squ + 1 by 2

[230:49]

squ right then I will say 1 by 1X 4 + 1X

[230:54]

4 is nothing but 2x 4 which is nothing

[230:56]

but 1X 2 so I will be getting 0.5 now

[231:00]

here here you understand this is a

[231:02]

complete impure split right if you have

[231:06]

an impure split in entropy the output

[231:10]

you getting it as one whereas in the

[231:13]

case of Guinea impurity

[231:15]

as Z sorry

[231:17]

0.5 so if I go ahead with the graph that

[231:21]

I probably had created here so my Guinea

[231:24]

impurity line will look something like

[231:27]

this so it will be looking something

[231:29]

like this for zero obviously I'll be

[231:31]

getting zero but whenever my probability

[231:34]

of plus is 0.5 I'm going to get 0.5 over

[231:38]

here and that is the difference between

[231:40]

Guinea

[231:42]

impurity and entropy but the re but you

[231:45]

may be seeing Kish when to use what now

[231:48]

let's understand that when to use Guinea

[231:51]

and when to use entropy tell me guys if

[231:55]

I consider this formula of guine

[231:58]

impurity and if I probably

[232:01]

consider if I consider entropy this

[232:05]

formula where do you think more time

[232:09]

will take for execution for this

[232:11]

particular formula whether for entropy

[232:14]

it will take or for guinea impurity it

[232:18]

will take more time where it will

[232:21]

probably take for the execution purpose

[232:24]

see understand decision tree is having a

[232:29]

worst time complexity because if you

[232:32]

have 100 features probably you'll keep

[232:34]

on comparing by dividing many many

[232:37]

features then probably compute a

[232:38]

Information Gain like this if you have

[232:40]

just 100 features so which is faster

[232:43]

entrop

[232:45]

or guine impurity understand in entropy

[232:48]

you have log function here you have log

[232:52]

function here you have simple maths the

[232:56]

more amount of time out of entropy and

[232:59]

guine impurity the more amount of time

[233:01]

basically is taken

[233:03]

[233:06]

entropy so if you have huge number of

[233:10]

features like 100 200 features and you

[233:12]

are planning to apply decision Tre I

[233:15]

would suggest try to use Guinea impurity

[233:18]

then entropy if you have small set of

[233:20]

features then you can go ahead with

[233:23]

entropy so over here definitely with

[233:25]

respect to fast Guinea is greater than

[233:31]

entropy now let's go ahead and

[233:33]

understand with respect to you may be

[233:36]

thinking Kish okay fine you have

[233:38]

basically explained us about categorical

[233:41]

variables over here see over here you

[233:44]

have you have explained about

[233:45]

categorical variables what if I have

[233:47]

numerical feature let's say I have F1

[233:51]

over here which is a numerical

[233:53]

feature I have an F1 feature which is

[233:56]

numerical feature and I may have values

[233:58]

let's say that I have sorted all the

[234:00]

values over here okay let's say that I

[234:02]

have F1 and output okay so this F1 let's

[234:06]

say that I have values

[234:07]

like ass sorted order values I'm sorting

[234:10]

this features I'm basically doing this

[234:12]

let's say that initially I have this

[234:15]

features like this and let's say I have

[234:17]

values like 2.3 1.3 4 5 7 3 let's say I

[234:23]

have this features now this is a

[234:26]

continuous

[234:27]

feature this is a continuous feature so

[234:29]

for a continuous feature how probably

[234:32]

the decision tree entropy will be

[234:34]

calculated and the Information Gain will

[234:37]

get calculated so here you'll be able to

[234:39]

see that I will first of all sort these

[234:41]

values so in F1 the decision tree will B

[234:44]

basically first of all sort this values

[234:45]

so I have 1.3 then you have 2.3 then you

[234:49]

have four then you have three three then

[234:53]

you have four then you have five and

[234:55]

then you have six now whenever you have

[234:57]

a continuous feature so how the

[234:59]

continuous feature will basically work

[235:01]

in this case first of all your decision

[235:04]

tree node will say

[235:06]

that we'll take this one only one first

[235:10]

record and say that if it is less than

[235:12]

or equal to 1.3

[235:14]

okay if it is less than or equal to 1.3

[235:16]

so you here you'll be getting two

[235:18]

branches yes or no so yes and no

[235:22]

definitely your output over here will be

[235:25]

put over here right and then for the no

[235:28]

here you'll be having another node over

[235:30]

here how many number of Records you'll

[235:31]

be having in this particular case you'll

[235:33]

be having one record in this particular

[235:35]

case you will be having around five to

[235:36]

six records and here also you'll be able

[235:38]

to see right how many yes and NOS are

[235:40]

there definitely this will be a leaf

[235:42]

node so in the first instance they will

[235:45]

go ahead and calculate the information

[235:47]

gain of this then probably once the

[235:50]

Information Gain Is got then what

[235:51]

they'll do they will take the first two

[235:54]

records and again create a new decision

[235:57]

tree let's say that this will be my

[236:00]

suggestion where they'll say it is less

[236:02]

than or equal to 2.3 so I will get one

[236:05]

and one over here so in this now you'll

[236:07]

be having two records which will

[236:09]

basically say how many yes and no are

[236:10]

there and remaining all records will

[236:12]

come over here then again Information

[236:16]

Gain will be computed here then again

[236:17]

what will happen they'll go to the next

[236:19]

record then then again they'll create

[236:21]

another feature where they'll say less

[236:22]

than or equal to three and they will

[236:24]

create this many nodes again they'll try

[236:28]

to understand that how many yes or no

[236:29]

are there and then they'll again compute

[236:31]

The Information Gain like this they'll

[236:34]

do it for each and every record and

[236:36]

finally whichever Information Gain is

[236:38]

higher they will select that specific

[236:40]

value in that feature and they'll split

[236:42]

the node so in a continuous feature

[236:45]

whenever you have a continuous feature

[236:47]

this is how it will basically have and

[236:50]

then it will try to compute who is

[236:51]

having the highest Information Gain the

[236:54]

best Information Gain will get selected

[236:57]

and from there the splitting will

[236:59]

happen now let's go ahead and understand

[237:01]

about the next topic is that how this

[237:04]

entirely things work in decision tree

[237:07]

regressor because in decision tree

[237:09]

regressor my output is an continuous

[237:13]

variable so suppose if I have one

[237:15]

feature one feature two and this output

[237:17]

is a continuous feature it will be

[237:20]

continuous any value can be there so in

[237:23]

this particular case how do I split it

[237:27]

so let's say that f1c feature is getting

[237:30]

selected now in this f1c feature what

[237:32]

value will come when it is getting

[237:34]

selected first of all the entire mean

[237:38]

will get calculated of the output mean

[237:40]

will get calculated so here I will have

[237:42]

the mean and here here the cost function

[237:45]

that is used is not Guinea coefficient

[237:48]

or guinea impurity or entropy here we

[237:51]

use mean squared

[237:53]

error or you can also use mean absolute

[237:56]

error now what is mean squared error if

[237:58]

you remember from our logistic linear

[238:00]

regression how do we calculate 1 by 2 m

[238:03]

summation of I = 1 to n y hat minus y

[238:08]

whole Square y hat of i y - y whole

[238:12]

Square this is what is mean square error

[238:14]

so what it will do first based on F1

[238:17]

feature it will try to assign a mean

[238:20]

value and then it will compute the MSE

[238:23]

value and then it'll go ahead and do the

[238:26]

splitting now when it is doing splitting

[238:29]

based on categories of continuous

[238:31]

variable I will be having different

[238:33]

different categories now in this

[238:35]

categories what will happen after split

[238:37]

some records will go over

[238:40]

here then I will be having a mean value

[238:42]

of this over here

[238:45]

that will be my output and then again

[238:47]

the MSC will get calculated over here as

[238:50]

the msse gets reduced that basically

[238:53]

means we are reaching near the leaf

[238:55]

note and the same thing will happen over

[238:57]

here so finally when you follow this

[239:00]

path whatever mean value is present over

[239:02]

here that will be your output this is

[239:05]

the difference between the decision tree

[239:06]

regressor and the classifier here

[239:09]

instead of using entropy and all you use

[239:12]

mean squar error or mean absolute error

[239:14]

and this is the formula of mean square

[239:16]

error now let's go to the one more topic

[239:19]

which is called as the hyperparameters

[239:22]

tell me decision tree if I keep on

[239:25]

growing this to any depth what kind of

[239:28]

problem it will face regressor part you

[239:31]

want me to explain okay let's

[239:33]

see okay let's let's do the

[239:36]

regression decision

[239:39]

tree

[239:41]

regressor let's say I have feature F1

[239:44]

and this is my output let's say I have

[239:46]

values like 20 24 26 28 30 and this is

[239:53]

my feature one with category one

[239:56]

category one let's

[239:58]

say some categories are there let's say

[240:01]

I have done

[240:03]

the division by

[240:06]

F1 that is this feature initially tell

[240:09]

me what is the mean of this that mean

[240:12]

value will get assigned over here then

[240:14]

using msse that is mean squar error here

[240:18]

you will try to calculate suppose I get

[240:20]

an msse of some 37 47 something like

[240:23]

this and then I will try to split this

[240:27]

then I will be getting two more nodes or

[240:29]

three more nodes it depends then that

[240:31]

specific nodes will be the part of this

[240:33]

again the mean will change again the

[240:36]

mean will change over here suppose this

[240:38]

two is there this two records goes here

[240:41]

right then again MC will get calculated

[240:44]

I'm just taking as an example over here

[240:46]

just try to assume this thing now if I

[240:48]

talk about hyper parameters see this is

[240:51]

what is the formula that gets applied

[240:52]

over MSC now let's see in this hyper

[240:56]

parameter always understand decision

[240:58]

tree leads to overfitting because we are

[241:00]

just going to divide the nodes to

[241:03]

whatever level we want so this obviously

[241:06]

will lead to

[241:07]

overfitting now in order to prevent

[241:10]

overfitting we perform two important

[241:12]

steps one is post pruning and one is

[241:16]

pre- pruning so this two post pruning

[241:18]

and pre pruning is a condition let's say

[241:21]

that I have done some

[241:23]

splits I have done some splits let's say

[241:26]

over here I have seven yes and two

[241:28]

no and again probably I do the further

[241:31]

split like this now in this particular

[241:33]

scenario you know that if 7 yes and two

[241:35]

NOS are there there is a maximum there

[241:37]

is more than 80% chances that this node

[241:40]

is saying that the output is yes so

[241:43]

should we further do more

[241:46]

pruning the answer is no we can close it

[241:49]

and we can cut the branch from here this

[241:52]

technique is basically called as post

[241:54]

pruning that basically means first of

[241:57]

all you create your decision tree then

[241:59]

probably see the decision tree and see

[242:01]

that whether there is an extra Branch or

[242:03]

not and just try to cut it there is one

[242:06]

more thing which is called as

[242:07]

pre-pruning now pre-pruning is decided

[242:10]

by hyperparameters what kind of hyper

[242:13]

parameters you can basically say that

[242:15]

how many number of decision tree needs

[242:17]

to be used not number of decision tree

[242:20]

sorry over here you may say that what is

[242:22]

the max

[242:24]

depth what is the max depth how many Max

[242:27]

Leaf you can

[242:28]

have so this all parameters you can set

[242:31]

it with grid SE

[242:33]

CV and you can try it and you can

[242:36]

basically come up with a pre- pruning

[242:38]

technique so this is the idea about

[242:41]

decision tree uh regressor yes yes it is

[242:44]

possible your guinea value will be one

[242:45]

no this graph is there

[242:47]

no Guinea value are you talking about

[242:50]

this Guinea entropy it will not be one

[242:51]

it will always be between 0

[242:53]

to.5 so the first thing first as usual

[242:57]

what we should do we should import the

[242:59]

libraries so here I will go ahead and

[243:02]

import the librar so I'll say

[243:04]

import pandas as NP PD import matplot

[243:10]

li. pyplot as PLT

[243:14]

[243:16]

import so this basic things I have with

[243:19]

me so I will go and take any data set

[243:22]

that I want from SK

[243:24]

learn. data sets import let's say that

[243:28]

I'm going to take load Iris data set and

[243:31]

then I'm going to upload the iris data

[243:33]

set so I'm going to write load Iris

[243:36]

there is my Iris data set then the next

[243:38]

step uh once you get your iris data set

[243:41]

so this is my iris. dat

[243:45]

okay these are all my features the four

[243:47]

features will be there these four

[243:49]

features are petal length petal width

[243:51]

SLE length and SLE width this is my

[243:54]

independent features then if I really

[243:56]

want to apply

[243:58]

for classifier so decision tree

[244:03]

classifier so I can first of all import

[244:06]

from

[244:08]

skarn do tree import decision let's see

[244:13]

where decision tree present in a scalon

[244:16]

decision tree

[244:17]

classifier the name is absolutely fine

[244:20]

but I was not getting over here

[244:23]

so so this is got no module SK okay SK

[244:29]

skar

[244:31]

skn learn so here you have

[244:35]

classifier right now I'm just going to

[244:37]

overfit the data then I'll probably show

[244:38]

you how you can go ahead with uh

[244:42]

pruning so by default what are the

[244:44]

parameters over here if you probably go

[244:46]

and see in in the classifier over here

[244:49]

you have Criterion see this the first P

[244:52]

parameter is Criterion by default it is

[244:54]

Guinea then you have Splitter Splitter

[244:57]

basically means how you're going to

[244:58]

split and there also you have two types

[245:01]

best and random you can randomly select

[245:04]

the features and do it okay you should

[245:06]

always go with

[245:07]

best max depth is a hyper parameter

[245:11]

minimum sample lift is a hyper parameter

[245:13]

Max Fe features how many number of

[245:14]

features we are going to take in order

[245:16]

to fix that that is also an hyper

[245:17]

parameter so all these things are hyper

[245:19]

parameter okay so I will just by default

[245:22]

executed whatever is giving me in

[245:24]

decision tree and the next thing that

[245:26]

I'm actually going to do is create a

[245:28]

decision tree so for this I will be

[245:31]

using plot. fig size plot. figure inside

[245:35]

figure I have this fix

[245:38]

size okay and I will probably show in

[245:41]

some better figure size so that

[245:43]

everybody body will be able to see it so

[245:45]

here let me say that I'm going to take

[245:47]

an area of

[245:49]

1510 and then probably I'm going to say

[245:51]

tree Dot

[245:54]

Plot and here I'm going to say a

[245:57]

classifier and it should be filled the

[246:00]

coloring should be filled with this so

[246:04]

tree sorry Tre Tre Tre Tre

[246:09]

Tre it should be classifi tree. plot

[246:12]

okay I have to also import uh tree so I

[246:16]

have to basically import tree so from SK

[246:20]

learn

[246:22]

import three again I'm getting

[246:26]

error has no attribute plot

[246:29]

why let me just see the documentation

[246:32]

guys so this plot function is like plot

[246:34]

uncore tree dot tab plot _ tree now what

[246:40]

is the error we are getting okay not

[246:42]

fitted yet

[246:44]

sorry so I'm going to say

[246:47]

classifier do fit on data what data

[246:53]

iris.

[246:55]

data and then I'm going to fit with Iris

[246:58]

dot

[247:00]

Target so once this is done I think now

[247:03]

it will get

[247:04]

executed so this is how your graph will

[247:07]

look like guys so here you can see this

[247:10]

is how your graph looks like now if I

[247:12]

show you the graph over here see you can

[247:14]

see some amazing things over here three

[247:18]

outputs are actually there in this when

[247:21]

you see in this left hand side this

[247:23]

become a leaf node so this first one is

[247:25]

probably vers color uh versol flower

[247:29]

okay if you go on the right hand side

[247:31]

here you can see 50/50 is there so based

[247:32]

on one feature based on one feature here

[247:35]

you'll be able to see that you are

[247:37]

getting a leaf node based on another

[247:39]

Branch here you are getting

[247:41]

05050 so again you have two more

[247:44]

features getting splitted over here so

[247:46]

here you have 495 here you have

[247:48]

471 do we require this split anybody

[247:51]

tell me from here do we require any any

[247:54]

more split just try to think this is

[247:56]

after post pruning I want to find out

[247:59]

whether more splits are required or not

[248:01]

now in this particular case you see this

[248:03]

after this do you require any

[248:05]

split you do not require right here you

[248:08]

are basically getting 47 and one I guess

[248:11]

after this also you require no split

[248:14]

understand this so this is basically

[248:15]

post pruning so you can then decide your

[248:19]

level and probably do it gu value is

[248:22]

more than

[248:24]

0.5 okay this side H this is coming as

[248:29]

0.5 greater than 0.5 it should not had

[248:33]

here it is

[248:34]

0.5 no maximum .5 can come 0 to.5 only

[248:39]

should come I don't know why this is

[248:41]

coming as 667

[248:44]

I'll have a look onto this guys but

[248:47]

anywhere you see other than that you're

[248:50]

everywhere you're getting less

[248:51]

than5 the plotting graph is very much

[248:54]

easy you use SK learn import tree then

[248:57]

you basically do this get classify and

[248:59]

field is equal to true and you can just

[249:02]

do this so the agenda let me Define the

[249:05]

agenda what all things are there first

[249:08]

we'll understand about

[249:11]

emble techniques in this assemble

[249:13]

techniques we are basically going to

[249:15]

discuss about what is the difference

[249:17]

between

[249:19]

bagging and boosting

[249:22]

second what we are basically going to

[249:24]

discuss about is so uh the agenda of

[249:27]

this session is emble techniques bagging

[249:29]

and boosting then we are probably going

[249:31]

to cover random forest and then probably

[249:35]

we will try to cover adab boost and if I

[249:39]

have more energy I will also try to

[249:40]

cover XG boost so all this Al lthms

[249:43]

we'll discuss about it so let's go ahead

[249:46]

and let's start the

[249:48]

topics the first topic that we are going

[249:50]

to discuss is about emble

[249:52]

techniques now what exactly is emble

[249:55]

techniques and we are going to discuss

[249:58]

about it okay so emble techniques what

[250:01]

exactly is emble techniques till now we

[250:03]

have solved two different kind of

[250:04]

problem statement one is

[250:07]

classification and regression and you

[250:09]

have learned about different different

[250:11]

algorithms like uh linear regression

[250:13]

logistic regression we have discussed

[250:15]

about KNN we have discussed about

[250:17]

yesterday what disc what did we discuss

[250:19]

about n bias different different

[250:21]

algorithms we have already finished now

[250:24]

with respect to classification

[250:25]

regression Problem whatever algorithm we

[250:27]

are discussing there was only one

[250:28]

algorithm at a time we were discussing

[250:31]

one algorithm at a time we are

[250:32]

discussing and we are trying to either

[250:33]

solve a classification or a regression

[250:35]

problem now the next thing is over here

[250:38]

is that can we use multiple algorithms

[250:42]

mul multiple algorithm to solve a

[250:44]

problem multiple algorithms basically

[250:46]

means can we I'll just talk about it

[250:49]

okay now the if I ask this specific

[250:52]

question can we use multiple algorithms

[250:54]

to solve a problem at that point of time

[250:57]

I will definitely say yes we can because

[250:59]

we are going to use something called as

[251:00]

emble techniques there now what this

[251:03]

emble techniques is okay so emble

[251:06]

techniques in emble techniques we

[251:08]

specifically use two different ways one

[251:12]

is one one way is that we specifically

[251:15]

use and the other one I'll just go to

[251:16]

write it over here so one that we

[251:19]

basically use is something called as

[251:20]

bagging technique and the other one we

[251:23]

specifically use is something called as

[251:25]

boosting technique so in bagging

[251:27]

Technique we what exactly we can do and

[251:31]

in boosting technique what we can

[251:32]

actually do and how we are combining

[251:34]

multiple models to solve a problem so

[251:36]

let's first of all discuss about bagging

[251:39]

now how does bagging work let's say that

[251:42]

I have a specific data set so this is my

[251:44]

data set with uh with features rows

[251:48]

columns everything like this I have this

[251:50]

specific data set just imagine I have

[251:52]

many many features over here like this

[251:54]

fub1 F2 F3 and probably I have my output

[251:57]

so this is my data set D let's consider

[251:59]

it now what we do in bagging is that we

[252:04]

create models and this model can be

[252:06]

anything it can be logistic it can be

[252:08]

linear for a classification problem

[252:10]

let's say that this is logistic model so

[252:12]

this is my model M1 let's say I have

[252:14]

another model M2 then I may have another

[252:17]

model M3 let's say that this is

[252:20]

logistic and this is probably the other

[252:23]

model which is like decision tree and

[252:25]

then probably we use this model as KNN

[252:29]

classification and this model can again

[252:31]

be decision tree it's fine let's use

[252:34]

another decision tree so now here you

[252:36]

can see that we have used so many models

[252:39]

okay so many models are there now with

[252:41]

respect to this particular model what I

[252:42]

will do is that the first step that I

[252:44]

will do from this particular data set I

[252:46]

will just take up some rows so I'll

[252:48]

basically do row

[252:50]

sampling and I'll take a row sampling of

[252:53]

D Dash D Das basically means this D Das

[252:55]

is always less than D some of the rows

[252:58]

I'll push it to M1 okay I can also use n

[253:01]

fine so what I'll do is that some of the

[253:03]

rows I'll push it to model one this

[253:05]

model one will be training let's say

[253:07]

that for out of this 10,000 record th000

[253:09]

rows I'm actually doing a row sampling

[253:11]

of th rows and giving it to M1 to train

[253:14]

it then what I'm actually going to do

[253:16]

over here I'm basically going to give

[253:18]

this specific model M2 and again I'm

[253:21]

going to do row row sampling and I'm

[253:24]

again going to sample some of the rows

[253:25]

and give it to model two and again

[253:27]

remember some of the rows may get

[253:29]

repeated from this D Dash to next dble

[253:31]

Dash similarly I will do row sampling

[253:33]

and give it to this and again I may have

[253:35]

d triple Dash and D4 Dash so different

[253:38]

different different different rows data

[253:41]

points when I say row sampling basically

[253:42]

I'm talking about data points different

[253:45]

different data points I will give it to

[253:47]

separate separate model and this model

[253:49]

will specifically train when I say D

[253:52]

Dash that basically means uh suppose I

[253:54]

say th 10,000 are my total number of

[253:56]

data points when I say D Dash This D

[253:59]

Dash may be th000 points then D Double

[254:02]

Dash may be another th000 points and

[254:04]

some of the rows may get repeated over

[254:05]

here dle Dash here also I can basically

[254:08]

use so here specifically row sampling

[254:10]

will be used now when I have this many

[254:12]

specific each and every model will be

[254:14]

trained with different kind of data now

[254:17]

how the inferencing will happen for the

[254:18]

test data so first thing first let's say

[254:21]

that I'm going to get a new test data

[254:23]

over here now new test data will be

[254:25]

passed to M1 and this M1 suppose it

[254:28]

gives zero as my output suppose let's

[254:30]

say that I'm doing a binary

[254:31]

classification it gives a Zer as an

[254:33]

output so this is my output of zero next

[254:37]

M2 for the new test data gives one M3

[254:40]

gives one and M4 also gives one as the

[254:43]

the output now in this particular case

[254:46]

in this particular case what will happen

[254:49]

now you can see over here it's simple

[254:51]

what what do you think the output may be

[254:53]

in this particular case now M1 has

[254:55]

predicted for this particular test data

[254:56]

as zero the model M2 has predicted 1 M3

[255:00]

has predicted 1 and M4 has predicted one

[255:02]

so finally all these outputs are going

[255:04]

to get

[255:06]

aggregated are going to get aggregated

[255:08]

and a simple thing that gets applied is

[255:11]

majority voting majority voting so tell

[255:14]

me what will be the output for with

[255:16]

respect to this the output will

[255:18]

obviously be one because the majority

[255:19]

voting that you can see three people are

[255:21]

basically saying it as one so my output

[255:24]

over here will be one okay this is the

[255:26]

concept of bagging wherein you are

[255:29]

providing different different rows with

[255:31]

probably all the features in this case

[255:33]

and giving it to different different

[255:34]

model again which is a classification

[255:36]

model and then finally you are combining

[255:38]

them based on majority voting and you're

[255:40]

getting the answer as one so this step

[255:43]

is called as bootstrap aggregator that

[255:45]

basically means you're aggregating all

[255:48]

the output that is basically coming from

[255:50]

all the specific models all the specific

[255:52]

models now many people will say Krish

[255:54]

what about Tai guys like this kind of

[255:56]

situation you know we will be having

[255:58]

more than 100 to 200 models so it is

[256:01]

very very difficult that it will be a

[256:03]

tie who are repeating questions they

[256:05]

will be put up in time out so what if

[256:09]

you're saying that if the 50% of model

[256:12]

says yes 50% of our models says no

[256:14]

always understand guys we will be having

[256:17]

more than 100 to 200 plus models so in

[256:19]

this particular case there will be high

[256:21]

probability that always there will be a

[256:23]

majority voting available it will always

[256:25]

not be in that specific scenario so this

[256:28]

was the concept about bagging now some

[256:30]

people will be saying that Krish why are

[256:31]

you using different different models

[256:34]

guys I'm not discussing about random

[256:35]

Forest over here random Forest uses only

[256:37]

one type of model that is decision tree

[256:39]

but if we think as an concept of bagging

[256:43]

you can have different different models

[256:44]

over here and you can basically combine

[256:46]

them so this is a technique of emble

[256:49]

techniques and this is basically called

[256:51]

as bagging okay now tell me one point I

[256:54]

missed out fine this is with respect to

[256:56]

the classification problem with respect

[256:58]

to the regression problem what will

[257:00]

happen in case of a regression problem

[257:02]

let's say that I got here 120 here 140

[257:06]

here 122 here 148 as my output so in

[257:09]

regression what will happen is that the

[257:11]

entire mean will be taken mean will be

[257:15]

taken the output mean will be basically

[257:18]

taken and that will be your output of

[257:20]

the model average or mean very simple

[257:22]

right so average or mean will be

[257:25]

basically taken up and here based on the

[257:27]

average you'll be able to solve the

[257:29]

regression problem great now let's go

[257:31]

ahead and try to understand with respect

[257:34]

to bagging and boosting how many

[257:36]

different types of algorithm are but

[257:37]

before that I need to make you

[257:39]

understand what exactly is boosting now

[257:41]

here in bagging you have seen that you

[257:43]

have parallel models right one one one

[257:46]

independent you have parallel models

[257:48]

you're giving some row samples in

[257:49]

different different models and basically

[257:51]

are able to find out the output now in

[257:53]

case of boosting boosting is a

[257:56]

sequential combination of models like

[257:59]

this you have lot of sequential models

[258:03]

like this and one after the model like

[258:06]

first I'll give my training data to this

[258:07]

particular model then it will go to this

[258:09]

data then this model then this model so

[258:12]

this will be my M1 M2 M3 M4 and finally

[258:16]

I will be getting my output so here you

[258:18]

can basically say that boosting is all

[258:21]

about and this M1 M2 M3 we basically

[258:24]

mention it as weak Learners so this will

[258:26]

be weak learner weak learner weak

[258:29]

learner weak learner and finally when we

[258:32]

go till here it it'll if I combine all

[258:35]

these weak ners weak

[258:38]

learner weak learner okay once I combine

[258:41]

all this weak learner it becomes a it

[258:43]

becomes a strong learner finally if I

[258:46]

try to combine this this will basically

[258:47]

become a strong learner so here you have

[258:50]

all the models sequentially one after

[258:52]

the other and then you will probably try

[258:55]

to provide your uh input from one model

[258:58]

to the next model to the next model and

[259:00]

these all models will be a very simpler

[259:01]

weak learner model which will not be

[259:03]

able to predict properly but when you

[259:05]

combine all this particular models

[259:08]

together sequentially it becomes a

[259:09]

strong learner how this specifically

[259:11]

works I'll take an example example of AD

[259:13]

boost XG boost I will show you that okay

[259:16]

week learner basically means the

[259:17]

prediction is very bad but as you go

[259:19]

sequentially you combine them they

[259:21]

become a strong learner okay one example

[259:24]

I want to give you let's say that you

[259:26]

are a data scientist right let's say

[259:30]

that this model one may be a teacher

[259:33]

with respect to physics then this model

[259:35]

two may be a teacher with respect to

[259:37]

chemistry let's say model 3 is basically

[259:40]

a teacher of maths and model four is a

[259:43]

teacher of geography now suppose if you

[259:46]

are trying to solve one problem

[259:48]

obviously if the physics teacher is not

[259:50]

able to solve that particular problem

[259:51]

then probably chemistry can help or

[259:54]

maths can help or geography can help or

[259:56]

someone can help so when we combine this

[259:58]

many expertise together they will be

[260:01]

able to give you the output in an

[260:03]

efficient way Sumit I'll talk about it

[260:05]

where whether all the features are

[260:07]

basically passed to all the models or

[260:08]

not I'll just talk about it just give me

[260:10]

some time okay but I just want to give

[260:12]

you an idea about in short if someone

[260:14]

asks you in an interview what exactly is

[260:17]

boosting okay boosting is you can just

[260:21]

say that it is a sequential set of all

[260:23]

the models combined together and these

[260:25]

all models that I initialized are

[260:27]

usually weak Learners and when they are

[260:29]

combined together they become a strong

[260:30]

learner and based on the strong learner

[260:32]

they gives an amazing output and right

[260:35]

now if I say in most of the kaggle

[260:37]

competition they use different types of

[260:39]

boosting or bagging technique so we have

[260:42]

basically as I said

[260:44]

bagging and boosting in bagging what

[260:47]

kind of algorithm we specifically use we

[260:49]

use something called as random forest

[260:54]

classifier and the second model that we

[260:57]

specifically use is something called as

[260:59]

random

[261:00]

Forest regress so we specifically use

[261:04]

these two kind of models which I'm

[261:05]

actually going to discuss right now

[261:06]

after this and then in boosting we

[261:09]

basically use techniques like ad boost

[261:12]

gradi Boost number three is Extreme

[261:15]

gradient boost which we also say it as

[261:17]

XG boost extreme gradient boost so let's

[261:20]

go ahead and let's discuss about the

[261:22]

first algorithm which is called as

[261:24]

random forest classifier and regressor

[261:28]

now first thing first let's understand

[261:31]

some things from the yesterday's class I

[261:33]

hope uh what is the main problem with

[261:35]

respect to decision tree whenever we

[261:37]

create a decision tree without any

[261:39]

hyperparameter it does it not lead to

[261:42]

overit

[261:43]

does it not lead to overfitting uh

[261:45]

whenever you probably have a decision

[261:48]

tree right it leads to something like

[261:50]

overfitting why overfitting because it

[261:53]

completely splits all the feature till

[261:55]

it's complete depth overfitting

[261:57]

basically means for training data the

[261:58]

accuracy is high for test data the

[262:00]

accuracy is low so training data when

[262:02]

the accuracy is high I may basically say

[262:04]

it as high bias and then I may basically

[262:07]

say it as sorry not high bias low bias

[262:11]

and high V variance so low bias and high

[262:14]

variance yes obviously we can do pruning

[262:16]

and all guys but again understand

[262:18]

pruning is an extensive task probably if

[262:21]

your if you have 100 features if you

[262:23]

have data points which is like 1 million

[262:25]

to do pruning also it is very much

[262:27]

difficult yes pre pruning can be done

[262:29]

but again we cannot confirm that it may

[262:31]

work well or not so right now with

[262:33]

respect to decision tree you have this

[262:35]

specific problem that is low bias and

[262:37]

high variance now in low Biance and high

[262:39]

variance you know that my model is

[262:41]

basically the generalized model that I

[262:43]

should get it should have low bias and

[262:46]

low variance so if somebody asks you why

[262:49]

do you use random Forest you can

[262:51]

basically explain about decision trees

[262:52]

like this now my main aim is to convert

[262:54]

this High variance to low variance now I

[262:58]

will be able to convert this High

[262:59]

variance to low variance using random

[263:01]

forest classifier or random Forest

[263:03]

regressor now what does random Forest do

[263:06]

random Forest is a bagging technique

[263:08]

similarly I have a data set over here

[263:10]

let's say that I have this data set

[263:13]

and then here I will be having multiple

[263:15]

models like

[263:16]

[263:19]

[263:21]

M3 M4 let's say I have this four models

[263:24]

like this we have many many models now

[263:27]

with respect to this models this models

[263:29]

all the models are actually decision

[263:31]

Tree in random forest all are decision

[263:34]

trees you don't have a different model

[263:37]

over there so over here you can see that

[263:39]

all the models are decision trees that

[263:41]

is going to get used used in random

[263:43]

Forest so decision trees always gets

[263:45]

used in random Forest the first thing

[263:47]

that you should know now whenever we are

[263:49]

using decision trees you know that

[263:51]

decision tree if I by default if we try

[263:53]

to create it it may lead to overfitting

[263:56]

and because of that every decision tree

[263:58]

will basically create low V low bias and

[264:01]

high variance but if we combine in the

[264:04]

form of bootstrap aggregator this High

[264:07]

variance will be getting converted to

[264:08]

low variance because why because

[264:10]

majority of voting we will be taking

[264:12]

from this particular decision trees like

[264:14]

there will be many many decision tree so

[264:16]

they lot of outputs will be coming and

[264:19]

with the help of majority voting

[264:20]

classifier this High variance will get

[264:22]

converted to low variance now in random

[264:24]

Forest how it works in the first case if

[264:27]

I talk about random Forest over here two

[264:29]

things basically happen with respect to

[264:30]

the D- data set let's say in first model

[264:34]

we do some kind of row

[264:36]

sampling plus

[264:38]

Feature Feature

[264:40]

sampling that basically means we have to

[264:42]

select some set of rows and some set of

[264:45]

features and give it to M1 similarly you

[264:48]

do row sampling and feature sampling and

[264:50]

give it to M2 then you do row sampling

[264:52]

and feature sampling you give it to M3

[264:54]

and then you do row sampling and feature

[264:56]

sampling you give it to M4 now when you

[265:00]

do this so what will happen

[265:01]

independently you're giving some

[265:03]

features along with some rows now there

[265:05]

may be a situation that your features

[265:07]

may also get repeated it may also get

[265:09]

repeated your records or data points may

[265:11]

also get repeated so when you are

[265:13]

probably training your model with this

[265:15]

specific data sets and specific features

[265:18]

this model become expert in predicting

[265:20]

something right as I said one example

[265:23]

over here I'm giving a physics model

[265:25]

some data I'm giving chemistry data

[265:27]

chemistry model with some data similarly

[265:29]

here I'm giving some information to some

[265:31]

model so the model will be an expert

[265:33]

with respect to that specific data So

[265:36]

based on all this particular data

[265:38]

whenever I get a new test data so what

[265:40]

will happen suppose let's say that this

[265:42]

this is a classification problem the M1

[265:44]

model will be predicting zero this will

[265:46]

be predicting one this will be

[265:47]

predicting zero and this will be

[265:49]

predicting zero now in this particular

[265:51]

case again the majority voting

[265:53]

classifier or majority voting will

[265:55]

happen in the case of classification

[265:57]

problem and then here you will be

[266:01]

specifically able to get the output as

[266:03]

zero so I hope everybody is able to

[266:06]

understand all the models over here are

[266:07]

decision trees and based on that you

[266:10]

will be doing see when in I interview

[266:12]

should be very very uh things the things

[266:15]

that I'm telling you over here is all

[266:17]

all the points are very much important

[266:19]

and similarly if you tell the

[266:21]

interviewer definitely your interview is

[266:22]

cracked in this kind of algorithm I've

[266:25]

seen some of my students saying that

[266:26]

okay uh Kish um when the interviewer

[266:29]

asked me that which is my favorite

[266:30]

algorithm I said random Forest I told

[266:32]

why did you say like that because he

[266:34]

said that because that person let me let

[266:36]

him ask any questions in random Forest

[266:38]

I'm very much confident about it and I'm

[266:40]

also going to prove him you know

[266:42]

why they are very very good so with this

[266:45]

specific case here you can basically see

[266:47]

that because of the overfitting

[266:49]

condition of the decision tree you're

[266:50]

combining multiple decision tree so that

[266:52]

you get a generalized model which has

[266:54]

low bias and low variance so I hope

[266:57]

everybody is able to understand boost

[266:59]

feature sampling basically means suppose

[267:00]

if I have 1 2 3 four feature for the

[267:04]

first model I may give two features for

[267:06]

the second model I may get three

[267:07]

features for the fourth model I may give

[267:09]

four features or uh any one feature ALS

[267:12]

I can give to a specific model so

[267:13]

internally that random Forest it take

[267:15]

carees of over here these things are

[267:18]

there and this is how random Forest

[267:19]

Works only the difference between random

[267:21]

Forest classify and regression is that

[267:23]

in regression again whatever output you

[267:25]

are basically getting you basically do

[267:26]

the mean that's it average you just do

[267:29]

the average you'll be able to get the

[267:31]

output based on all the models output

[267:33]

that you are actually getting now let's

[267:34]

talk about some of the important points

[267:36]

in random Forest the first thing first

[267:38]

question is that is normalization

[267:41]

required in random Forest then the next

[267:43]

question is that in KNN is normalization

[267:47]

when I say normalization or

[267:49]

standardization I I'll just talk about

[267:51]

standardization is standardization is

[267:54]

required so this will be my another

[267:56]

question so is normalization required in

[267:59]

random forest or decision tree you here

[268:01]

you can also say it as decision tree is

[268:03]

it required so for this the answer will

[268:06]

be no because understand decision tree

[268:09]

will basically do the splits if you Mini

[268:12]

minimize the data also that split won't

[268:14]

be that much important but if I talk

[268:17]

about KNN whether standardization

[268:19]

normalization required over here the

[268:21]

answer is yes because here we use two

[268:23]

things one is ukan distance and

[268:26]

Manhattan distance because of this you

[268:28]

definitely have to apply standardization

[268:30]

so that the computation or distance

[268:32]

becomes easy so this is one of the most

[268:34]

common interview questions that is

[268:36]

basically asked in random Forest coming

[268:38]

to the third question is random Forest

[268:40]

impacted by outlier

[268:43]

over here the answer will be no just

[268:46]

check it out outside basically means

[268:48]

Google and check it out check it out in

[268:50]

Google okay perfect so I hope I've

[268:53]

covered most of the things in random

[268:54]

Forest is random Forest impacted by

[268:57]

outliers this is the third question is

[268:59]

KNN impacted by

[269:00]

outliers is this KNN algorithm impacted

[269:04]

by outliers is KNN impacted Byers the

[269:07]

answer is yes big yes perfect so so

[269:12]

these all are the interview questions

[269:13]

that needs to be covered now let's go

[269:15]

ahead and discuss about adab boost now

[269:18]

in bagging most of the time we

[269:20]

specifically use random forest or you

[269:23]

can also create custom bagging

[269:25]

techniques custom bagging techniques

[269:27]

means whatever algorithm you want use

[269:29]

the combination of them and try to give

[269:32]

the output this also you can do it

[269:33]

manually with the help of hands okay

[269:36]

guys so second thing uh we are going to

[269:38]

discuss about is boosting technique in

[269:40]

this

[269:42]

the first thing that uh first algorithm

[269:44]

that we are going to discuss about is

[269:45]

adab Boost so adab boost we going to

[269:48]

discuss about how does adab Boost uh

[269:50]

work now let's solve uh the first

[269:53]

boosting technique which is called as

[269:54]

adab boost okay and uh this is a

[269:57]

boosting technique um in the boosting

[270:00]

technique you have heard that we have to

[270:02]

basically solve in a sequential way this

[270:05]

at least you know I know there is a lot

[270:07]

of confusion within you all but we'll

[270:09]

try to solve a problem let's say so

[270:11]

suppose I have a data set which looks

[270:12]

like this fub1 F2 F3 F4 so these are my

[270:16]

features and probably these are my

[270:18]

output okay so let's say that I'm having

[270:20]

this features like this and this is my

[270:22]

output like yes or no like this so let's

[270:25]

say that how many records I have over

[270:27]

here three

[270:30]

4 5 6 and one more is there 7 so this

[270:36]

seven records are there now in adab

[270:38]

boost the first thing is that

[270:40]

specifically with adab Boost uh you

[270:42]

really need to understand that what all

[270:43]

things we can basically do how do we

[270:45]

solve this classification problem that

[270:47]

we are going to understand the first

[270:49]

thing first is that we Define a weight

[270:51]

and the weight is very much simple

[270:53]

initially to all the records to all this

[270:55]

input records we provide an equal weight

[270:58]

now how do we provide an equal weight we

[270:59]

just go and count how many number of

[271:01]

records are there now in this particular

[271:03]

case the total number of records are one

[271:06]

2 3 4 5 6 7 now every record I have to

[271:12]

provide an equal weight that is between

[271:15]

0 to 1 so the overall sum should be one

[271:19]

so in this particular case what I can do

[271:20]

if I make 1X 7 1X 7 1X 7 to everyone

[271:24]

this will definitely become

[271:26]

a equal weights to all right and if I do

[271:30]

the total sum it will obviously be one

[271:32]

let's go to the next one now after this

[271:34]

what do we do okay after this in adab

[271:37]

the first thing that we do is that we

[271:39]

take any of this feature how do you

[271:41]

decide which feature to take whether we

[271:42]

should go with F1 or whether we should

[271:44]

go with FS2 or whether we should go with

[271:46]

F3 this we can do it with the help of

[271:49]

Information Gain and Information Gain

[271:53]

and entropy or guinea right based on

[271:56]

this we can definitely understand

[271:57]

whether we should start making decision

[271:59]

here also you specifically make decision

[272:01]

trees so here what you do is that you

[272:04]

probably have to determine by using

[272:06]

which feature I have to start my

[272:07]

decision tree so suppose out of all this

[272:09]

feature one feature two feature three

[272:11]

you have selected that okay the

[272:12]

information gain and entropy of feature

[272:13]

one is higher so I'm going to use

[272:15]

feature one and probably divide this

[272:17]

into decision trees now when I divide

[272:21]

this into decision tree let's say that

[272:22]

I'm dividing like this into decision

[272:23]

tree this decision tree depth will be

[272:26]

only one one depth and this depth since

[272:29]

it has only one depth we basically call

[272:31]

it as stumps so what we do over here

[272:34]

specifically we will create a decision

[272:36]

Tre by taking only one feature and we

[272:37]

will only divide it to one level okay

[272:39]

one level or one depth that's

[272:42]

and this is specifically called as stump

[272:45]

what we are going to do next is that

[272:46]

from this particular stump okay the

[272:48]

stump is basically getting created only

[272:51]

one so that is adab Boost right we say

[272:52]

it as weak Learners because this is weak

[272:54]

learner weak learner why there is a

[272:57]

reason we say this as weak learner so

[273:00]

only weak learner so that is the first

[273:02]

thing with respect to uh this particular

[273:06]

adab boost so the first step is that

[273:07]

this is a weak learner so for the weak

[273:09]

learner we basically create a stump

[273:12]

stump basically means one level decision

[273:14]

tree that's it based on the information

[273:17]

gain and entropy I have selected the

[273:18]

feature and then I just made a decision

[273:21]

tree with only one level why it is

[273:24]

called as it is called as weak learner

[273:27]

okay so that is the reason we use only

[273:28]

stum that is just a one level decision

[273:31]

tree now the next step happens is that

[273:33]

we provide all the specific records to

[273:36]

this F1 and we train this specific model

[273:39]

only with one level decision tree we

[273:41]

train them

[273:42]

now after we train them let's say that

[273:44]

we are going to pass all these

[273:45]

particular records to find out how many

[273:47]

are correct and how many are wrong this

[273:49]

decision this decision tree is basically

[273:51]

giving so let's say that out of this

[273:53]

entire records one

[273:55]

record one record was just given as

[273:59]

wrong let's say that this is the this is

[274:01]

the record which was given as wrong okay

[274:04]

so let's say that this record output was

[274:07]

predicted wrong from this particular

[274:09]

model only one wrong was there after

[274:11]

training the model now what we need to

[274:14]

do in this specific case understand a

[274:16]

very important thing so let's say that

[274:18]

we have done this and probably after

[274:20]

this what we are actually going to do we

[274:22]

are going to calculate the total error

[274:24]

so how many error this particular model

[274:26]

made let's say that in this particular

[274:28]

case only one was wrong so this was only

[274:31]

wrong right one was wrong so if I want

[274:35]

to calculate the total error how will I

[274:37]

calculate how many how many of them are

[274:39]

wrong how many of them are wrong only

[274:40]

one is wrong what is the weight of this

[274:42]

so I will go and write 1X 7 so this is

[274:45]

specifically my total error out of this

[274:47]

specific model which is my stump over

[274:49]

here okay which is my F1 stop now this

[274:53]

is my first

[274:54]

step the second step is that I need to

[274:57]

see the performance of stump which stump

[274:59]

this specific stump and the performance

[275:02]

is basically checked by a formula which

[275:04]

is 1 by log e 1us total error divided

[275:09]

total error why we are doing this

[275:11]

everything will make sense okay in just

[275:13]

time every every in just a small time

[275:16]

everything will make sense the first

[275:18]

step that we do in adaab boost is that

[275:20]

we try to find out the total error the

[275:22]

second step we try to find out the

[275:24]

performance of stump now in this

[275:26]

particular case it will be 1 by log e 1

[275:29]

- 1 by 7 / 1X 7 so once I calculate it

[275:35]

it will be coming as

[275:37]

895 F2 and F3 see again understand out

[275:42]

of all these features I found out from

[275:43]

Information Gain and entropy that this

[275:45]

is the best feature let's say that I

[275:47]

have calculated this

[275:49]

as895 so this is my second step the

[275:51]

first step is find out the total error

[275:53]

the second step is performance of stum

[275:55]

what is te te basically means total

[275:57]

error te basically means total error now

[276:01]

see see the steps okay see the steps

[276:03]

whenever I'm discussing about boosting

[276:05]

I'm going to combine weak Learners

[276:07]

together to get a strong learner now

[276:09]

what is the next step out of this now

[276:11]

what what will be my third step

[276:13]

understand over here my third step will

[276:16]

be to update all these weights and that

[276:19]

is the reason why I'm calculating this

[276:20]

total error and performance of Step so

[276:23]

my third step will basically be new

[276:26]

sample weight from the decision tree one

[276:29]

which is my stump so I'll say new sample

[276:32]

weight is equal to I need to update all

[276:34]

these weights why I need to update all

[276:36]

these weights again understand I'll I'll

[276:39]

talk about it just a second so if I want

[276:41]

to up update the sample weights first

[276:44]

update I will do it for correct records

[276:46]

see for correct records whichever are

[276:49]

correct like these all records are

[276:51]

correct these all records are correct

[276:53]

now when I update the weights of this

[276:55]

update the weights of this particular

[276:57]

record it should reduce and when the the

[277:00]

the wrong records that I have this

[277:02]

update should increase why because

[277:06]

because if I increase this weights then

[277:08]

the wrong records that are there that

[277:11]

record should go to the next week

[277:12]

learner that is the reason why I'm doing

[277:14]

it now how to update this particular

[277:17]

weights for correct records for correct

[277:19]

records the formula looks something like

[277:21]

this weight multiplied by weight

[277:25]

multiplied by E to the^ of

[277:28]

minus this specific performance okay

[277:31]

this specific performance so e to the

[277:33]

power of PS I'll write performance of

[277:35]

stump and then I will basically be able

[277:38]

to write 1X 7 * e to the^ of minus

[277:43]

895 if I do the calculation everybody

[277:45]

try to do it the answer will be

[277:48]

05 now this is for correct records what

[277:50]

about incorrect records for the

[277:52]

incorrect

[277:53]

records the the weights that is going to

[277:56]

the formula that we going to apply is

[277:58]

multiplied by E to the^ of plus PS not

[278:02]

minus PS plus PS so here I'll write 1 by

[278:05]

7 multiplied e to the^ of

[278:08]

895 so if I go and probably calcul this

[278:12]

I'm going to get it

[278:13]

as 349 so this two are the weights that

[278:18]

I have got that basically means all

[278:20]

these records now which are correct 1X 7

[278:23]

the new updated weights will be 05 05

[278:28]

[278:30]

05 sorry not for the wrong

[278:33]

records then this will be 05 then 05 and

[278:38]

05 so let me just see what is 1x 7 so

[278:41]

here you can see initially it was. 142

[278:45]

now it has got reduced to 05 because all

[278:47]

these records are correct but the wrong

[278:50]

record value is 349 so my weights will

[278:53]

now become over here as 349 now I will

[278:56]

just go and go ahead and write over here

[278:58]

my new weight my new weight is nothing

[279:01]

but 05

[279:06]

055

[279:07]

05 05 05 1 2 how many 1 2 3 okay fourth

[279:14]

record is here fourth record is there 1

[279:18]

2 3 4 05 05 okay how many records are

[279:22]

there 1 2 3 4 5 6 7 so my fourth record

[279:27]

will basically become the new value that

[279:30]

I'm having is something called as

[279:34]

349 now tell me guys if I do the

[279:37]

summation of all these weights is this

[279:39]

is it one so prob

[279:41]

no I don't think so it is one because if

[279:44]

I try to add it up it is not one but if

[279:46]

I go and see over here these all are one

[279:48]

if I combine all the things 1 2 3 4 5 6

[279:51]

7 these all are one so here I have need

[279:53]

to find out my normalized weight now in

[279:55]

order to find out the normalized

[279:57]

weight all I have to do is that what I

[280:00]

have to do because the entire sumission

[280:03]

should be one so we have to

[280:05]

normalize now in order to normalize all

[280:08]

you have to do is that go and find out

[280:10]

what is the sum of all this things the

[280:12]

summation of all these things will be

[280:15]

0 649 all you have to do is that divide

[280:18]

all the numbers

[280:20]

by 649 divided by

[280:24]

649

[280:26]

649 like this divide all the numbers by

[280:28]

649 and tell me what will be the answer

[280:30]

that you'll be getting so here your

[280:32]

normalized weight will now look like

[280:35]

077 07 and this value will be somewhere

[280:39]

around uh

[280:41]

537 I guess in this case then this will

[280:44]

be 07

[280:47]

077 here we are going to divide by all

[280:50]

this 64 649 now this is my normalized

[280:53]

weight now after you get a normalized

[280:56]

weight we will try to create something

[280:57]

called as buckets because see one

[281:00]

decision tree we have already created

[281:02]

which is a stump and you know from this

[281:04]

particular stum what you're going to get

[281:06]

okay as an output then in the sequential

[281:09]

model we will go and combine another

[281:11]

model over here now it's the time that I

[281:13]

have to create this specific model now

[281:15]

in order to create this specific model I

[281:17]

need to provide some specific rows only

[281:19]

to this model to train because this

[281:21]

model is giving one wrong now what I

[281:24]

have to do is that whatever is wrong

[281:26]

along with other data points I need to

[281:28]

provide this specific model with those

[281:30]

records so that this model will be able

[281:33]

to train on this and probably be able to

[281:35]

get the output now let's create buckets

[281:38]

now based on buckets how the buckets

[281:39]

will be created over here I will take 07

[281:43]

until

[281:45]

sorry whatever is the value over here

[281:48]

normal we value okay so I will start

[281:50]

creating my buckets buckets basically

[281:52]

from 0 to

[281:53]

07 what did I say now for this decision

[281:57]

tree or stump I need to provide some

[282:00]

records so the maximum number of record

[282:02]

that should be going should be the wrong

[282:05]

records that should go over here now how

[282:07]

do we decide that okay there should be a

[282:09]

way that we should be able to say that

[282:11]

that specific wrong number of Records

[282:13]

should go to that decision tree so for

[282:16]

that purpose what we do is that this

[282:18]

decision tree will randomly create some

[282:20]

numbers between 0 to 1 randomly create

[282:25]

those numbers between 0 to 1 and

[282:27]

whichever bucket it will come in like 07

[282:30]

to 014 014 to 07 basically means 0 2 1

[282:37]

then 0 2 1 2 see how the bucket is

[282:40]

getting cre this value is getting added

[282:42]

to this so that becomes this bucket 021

[282:45]

+3 537 how much it is it is nothing but

[282:50]

470 747 then 747

[282:55]

[282:57]

751 like this you create all the buckets

[283:00]

okay you can create all the buckets now

[283:02]

tell me which record is basically having

[283:04]

the biggest bucket size obviously this

[283:07]

record so if I randomly create a number

[283:10]

between 0 to one what is the highest

[283:13]

probability that the values will be

[283:15]

going in so in this particular case most

[283:17]

of the wrong records will be passed

[283:18]

along with the other records obviously

[283:20]

other records there are chances that

[283:22]

other records will go to the next

[283:24]

decision tree but understand maximum

[283:26]

number will go with the wrong records

[283:28]

because the bucket is high over here so

[283:31]

the bucket is high over here so most of

[283:32]

the time this specific record will get

[283:35]

create selected and then it will be gone

[283:37]

to the second tree now suppose I have

[283:40]

this all records

[283:41]

so this is my first stump this is my

[283:44]

second stump this is my third stump

[283:47]

similarly the third stump from the

[283:48]

second stump whichever wrong records

[283:50]

will be going maximum number of Records

[283:52]

will go over here then again it will be

[283:54]

trained like this we'll be having lot of

[283:56]

stumps minimum 100 decision trees can be

[283:59]

added you know that every decision tree

[284:01]

will give one output for a new test data

[284:03]

new test data this week learner will

[284:05]

give one output this week learner will

[284:07]

give one output this week learner and

[284:09]

this will week learner will be giving

[284:10]

one output obviously the time complexity

[284:12]

will be more now from this particular

[284:14]

output suppose it is a binary

[284:16]

classification I will be getting 0 1 1 1

[284:19]

so again over here majority voting will

[284:21]

happen and the output will be one in

[284:24]

case of regression problem I will be

[284:25]

having a continuous value over here and

[284:28]

for this the average average will be

[284:31]

computed and that will give me an output

[284:33]

over here so for regression the average

[284:36]

will be done for classification what

[284:39]

will happen majority will be be

[284:41]

happening so everywhere that same part

[284:43]

will be going on buckets is very much

[284:45]

simple guys buckets basically means

[284:47]

based on this weights normalized weight

[284:49]

we are going to create bucket so that

[284:51]

whichever records has the highest bucket

[284:53]

based on this randomly creating code you

[284:55]

know it will select those specific

[284:57]

buckets and put it into random Forest

[284:59]

understand why this bucket size is Big

[285:02]

the other wrong records which are

[285:03]

present right suppose they are have more

[285:05]

than four to five wrong records their

[285:06]

bucket size will also be bigger and

[285:08]

because based on this randomly creating

[285:10]

num between 0 to 1 most of the wrong

[285:12]

records will be selected and given to

[285:14]

the second stum similarly this

[285:16]

particular decision tree will be doing

[285:17]

some mistakes then that wrong records

[285:19]

will get updated all the weights will

[285:20]

get updated and it will be passed to the

[285:22]

next decision tree guys when I say wrong

[285:24]

record the output will be same only no

[285:26]

zero and one so interesting everyone I

[285:29]

hope you understood so much of maths in

[285:31]

adab boost and how adab boost actually

[285:33]

work three main things one is total

[285:35]

error one is performance of stump and

[285:37]

one is the new sample weight these

[285:39]

things are getting calculated extensive

[285:41]

max normalized weight was basically used

[285:43]

because the sum of all these weights are

[285:45]

approximately equal to one when boosting

[285:48]

why not take the last output no no no we

[285:50]

have to give the importance of every

[285:52]

decision tree output every decision tree

[285:55]

output are important okay let me talk

[285:57]

about one model which is called as

[285:59]

blackbox model versus white box what is

[286:03]

the difference between blackbox model

[286:04]

and white box if I take an example of

[286:07]

linear regression tell me what kind of

[286:09]

model it is is is it a white box model

[286:12]

or black box if I take an example of

[286:14]

random

[286:15]

Forest is this a white box or black box

[286:18]

if I take an example of decision tree it

[286:21]

is a white box of blackbox model if I

[286:23]

take an example of a Ann is it a white

[286:26]

box of blackbox model linear regression

[286:28]

is basically called as an wide Box model

[286:30]

because here you can basically visualize

[286:33]

how the Theta value is basically

[286:35]

changing and how it is coming to a

[286:36]

global Minima and all those things in

[286:38]

random Forest I will say this as

[286:40]

blackbox model because it is impossible

[286:42]

to see all the decision tree how it is

[286:44]

working so that is the reason the maths

[286:46]

is so complex inside this if I talk

[286:49]

about decision tree this is basically a

[286:50]

white box model because in decision tree

[286:52]

we know how the split are basically

[286:54]

happening with the help of paper and pen

[286:55]

you'll be able to do it in the case of

[286:58]

an Ann this is a blackbox model because

[287:00]

here you don't know like how many

[287:02]

neurons are there how they are

[287:03]

performing and how the weights are

[287:05]

getting updated so this is the basic

[287:07]

difference between the blackbox and uh

[287:10]

uh white box model this entire thing is

[287:13]

the agenda of today's session so let's

[287:15]

start uh the first algorithm that we are

[287:17]

probably going to discuss today is

[287:19]

something called as K

[287:21]

means

[287:23]

clustering K means clustering and this

[287:26]

is a kind of unsupervised machine

[287:28]

learning now always remember

[287:31]

unsupervised machine learning basically

[287:33]

means that uh the one and the most

[287:35]

important thing is that in unsupervised

[287:38]

machine learning

[287:41]

in unsupervised ml you don't have any

[287:44]

specific output so you don't have any

[287:46]

specific output so suppose you have

[287:48]

feature one and feature two and suppose

[287:50]

you have datas different different data

[287:53]

you know and based on this data what we

[287:55]

do we basically try to create clusters

[287:58]

this clusters basically says what are

[288:00]

the similar kind of data so this is what

[288:03]

we basically do from uh clustering and

[288:06]

there are various techniques like K

[288:08]

Mains uh it is hierle clustering and all

[288:10]

so first of all we'll try to understand

[288:12]

about K means and how does it

[288:14]

specifically work it's simple uh suppose

[288:17]

you have a data points like this okay

[288:20]

let's say that this is your F1 feature

[288:21]

F2 feature and based on this in two

[288:23]

dimensional probably I will be plotting

[288:26]

this points and suppose this is my

[288:28]

another points so our main purpose is

[288:31]

basically to Cluster together in

[288:34]

different different groups okay so this

[288:36]

will be my one group and probably the

[288:38]

other group will be this group right so

[288:40]

two groups because obviously you can see

[288:42]

from this clusters here you have two

[288:44]

similar kind of data which is basically

[288:47]

grouped together right this is my

[288:49]

cluster one and this is my cluster 2 let

[288:51]

me talk about this and why specifically

[288:54]

it'll be very much useful then we'll try

[288:56]

to understand about math intuition also

[288:58]

now always understand guys uh where does

[289:00]

clustering gets used okay in most of the

[289:03]

Ensemble techniques I told you about

[289:05]

custom emble technique right so custom

[289:08]

emble techniques in custom assemble

[289:11]

techniques you know whenever we are

[289:13]

probably creating a model first of all

[289:15]

on our data set what we do is that we

[289:18]

create clusters so suppose this is my

[289:20]

data set during my model creation the

[289:22]

first algorithm we will probably apply

[289:24]

will be clustering algorithm and after

[289:26]

that it is obviously good that we can

[289:28]

apply regression or classification

[289:30]

problem suppose in this clustering I

[289:32]

have two or three groups let's say that

[289:34]

I have two or three groups over here for

[289:36]

each group we can apply a separate

[289:40]

supervis machine learning algorithm if

[289:42]

we know the specific output that we

[289:44]

really want to take ahead I'll talk

[289:46]

about this and uh give you some of the

[289:48]

examples as I go ahead now let's go on

[289:51]

go ahead and focus more on understanding

[289:53]

how does kin's clustering algorithm work

[289:56]

so let's go over here the word K means

[289:59]

has this K value this K are nothing but

[290:02]

this K basically means centroids K

[290:05]

basically means centroids so suppose if

[290:08]

I have a data set which looks like this

[290:10]

let's say that this is my data set now

[290:12]

over here just by seeing the data set

[290:14]

what are the possible groups you think

[290:16]

definitely you'll be saying K is equal

[290:18]

to 2 So when you say k is equal to two

[290:20]

that basically means you will be able to

[290:22]

get two groups like this and each and

[290:24]

every group will be having a centroid a

[290:28]

centroid Point here also there will be a

[290:30]

centroid point so this centroid will

[290:32]

determine basically this is a separate

[290:34]

group over here this is a separate group

[290:36]

over here so over here here you can

[290:38]

definitely say that fine this is two

[290:40]

groups but but how do we come to a

[290:41]

conclusion that there is only two groups

[290:44]

okay we cannot just directly say that

[290:46]

okay we'll try to just by seeing the

[290:48]

data because your data will be having a

[290:50]

high dimension data right right now I'm

[290:52]

just showing your two Dimension data but

[290:55]

for a high dimension data definitely

[290:56]

you'll not be able to see the data

[290:58]

points how it is plotted so how do you

[291:00]

come to a conclusion that only two

[291:02]

groups are there so for this there is

[291:03]

some steps that we basically perform in

[291:05]

K mins the first step is that we try

[291:08]

with different K values we try with

[291:11]

different K values and which is the

[291:13]

suitable K value K is nothing but

[291:15]

centroids okay it is nothing but

[291:18]

centroids we try with different

[291:20]

different centroids in this particular

[291:22]

case let's say that I have this

[291:24]

particular data point and I actually

[291:27]

start with k is equal 1 or 2 or 3 any

[291:29]

one you want let's say that I'm going to

[291:31]

start with k is equal 2 how to come up

[291:34]

with this K is equal to 2 as a perfect

[291:37]

value that I'll talk about it we need to

[291:39]

know there is a concept which is called

[291:41]

as within cluster sum of square so when

[291:43]

we try different K values let's say that

[291:45]

for K is equal to 2 what will happen the

[291:47]

first step we select a we try K values

[291:50]

so let's say that we are considering K

[291:52]

is equal to 2 the second step is that we

[291:54]

initialize K number of centroids now in

[291:57]

this particular case I know my K value

[291:59]

is 2 so we will be initializing randomly

[292:02]

let's say that K is equal to 2 so what

[292:05]

we can actually do let's say that this

[292:07]

is this is my one centroid I will I'll

[292:09]

put it in another color so this will be

[292:11]

my one centroid and let's say that this

[292:13]

is my another centroid so I have

[292:15]

initialized two centroids randomly in

[292:17]

this space now after this particular

[292:19]

centroid what we have to do is that

[292:21]

after initializing this centroid what we

[292:23]

have to do is that we have to basically

[292:26]

find out which points are near to the

[292:29]

centroid and which points are near to

[292:31]

this centroid now in order to find out

[292:33]

it is a very easy step we can basically

[292:35]

use ukan distance to find out the

[292:38]

distance between the points in an easy

[292:40]

way if I really want to show you that

[292:44]

you know like how many points I want to

[292:46]

in an easy way what I can do I can

[292:48]

basically draw a straight line over here

[292:50]

let's say that I'm drawing a straight

[292:51]

line over here in another color I can

[292:54]

draw a straight line and I can also draw

[292:56]

one parallel line like this so This

[292:58]

basically indicates that whichever

[293:01]

points you see over here suppose if I

[293:03]

draw a straight line in between all

[293:05]

these points you will be able to see

[293:07]

that let's say that I'm drawing one more

[293:09]

parallel line

[293:11]

which is intersecting together so from

[293:14]

this you can definitely find out let's

[293:16]

say that these are all my points that

[293:17]

are nearer to this green line Green

[293:20]

Point so what I'm actually going to do

[293:21]

in this particular case all these points

[293:24]

that you are seeing near the green it

[293:26]

will become green color so that

[293:28]

basically means this is basically nearer

[293:30]

to this centroid and whichever points

[293:33]

are nearer to this particular point that

[293:35]

will become red point so that basically

[293:38]

means this belongs to this group okay

[293:40]

this belongs to this group so I hope

[293:42]

everybody's clear till here then what

[293:44]

will happen is that this summation of

[293:48]

all the values then we initialize the K

[293:51]

number of centroids that is done then we

[293:53]

try to calculate the distance we try to

[293:55]

find out which all points is nearer to

[293:57]

the centroid let's say that this is my

[293:58]

one centroid this is my another centroid

[294:01]

and we have seen that okay these all

[294:02]

points belong to this centroid it near

[294:05]

to this particular centroid so this is

[294:07]

becoming red so that is based on the

[294:09]

shortage distance and here it is

[294:11]

becoming green now the next step let's

[294:13]

see what is the next step after this so

[294:15]

I am going to remove this thing now the

[294:17]

next step will be that the entire points

[294:20]

that is in red color all the average

[294:22]

will be taken so here again the average

[294:25]

will be taken now third step here I'm

[294:28]

going to write here we are going to

[294:30]

compute the average the reason we

[294:32]

compute the average is that because we

[294:34]

need to update the centroid so compute

[294:37]

the average to update centroid to update

[294:40]

centroids so here you'll be able to see

[294:42]

that what I'm actually doing as soon as

[294:45]

we compute the average this centroid is

[294:47]

going to move to some other location so

[294:50]

what location it will move it will

[294:51]

obviously become somewhere in Center so

[294:53]

here now I'm going to rub this and now

[294:56]

my new centroid will be this point where

[294:58]

I am actually going to draw like this

[295:00]

let's say this is my new centroid now

[295:02]

similarly this thing will happen with

[295:04]

respect to the green color so with

[295:06]

respect to the green color also it will

[295:08]

happen and this green will also Al get

[295:10]

updated so I'm going to rub this and

[295:12]

this will be my new Green Point which

[295:14]

will get updated over here then again

[295:16]

what will happen again the distance will

[295:18]

be calculated and again a perpendicular

[295:20]

line will be calculated here you can see

[295:22]

that now all the points are towards

[295:25]

there okay again the centroid based on

[295:27]

this particular distance again it will

[295:29]

be calculated and here you can see that

[295:31]

all the points are in its own location

[295:33]

so here now no update will actually

[295:36]

happen let's say that there was one

[295:38]

point which was red color over here

[295:41]

then this would have become green color

[295:42]

but since the updation has happened

[295:44]

perfectly we are not going to update it

[295:46]

and we are not going to update the

[295:48]

centroid right so now you can understand

[295:51]

that yes now we have actually got the

[295:53]

perfect centroid and now this will be

[295:56]

considered as one group and this will be

[295:58]

basically considered as the another

[296:00]

group it will not intersect but right by

[296:02]

default here intersection is happening

[296:05]

so I hope everybody's understood the

[296:07]

steps that you have actually followed in

[296:09]

initializing the centroids in updating

[296:12]

the centroids and in updating the points

[296:14]

is it clear everybody with respect to K

[296:17]

means now let's discuss about one

[296:20]

point how do we decide this K value okay

[296:24]

how do we decide this K value so for

[296:26]

deciding the K value there is a concept

[296:27]

which is called as elbow method so here

[296:31]

I'm going to basically Define my elbow

[296:32]

method now elbow method says something

[296:35]

very much important because this will

[296:37]

actually help us to find out what is the

[296:40]

optimized K value whether the K value

[296:42]

should be two whether uh the K value is

[296:45]

going to be three whether the K value is

[296:47]

going to become four and always

[296:49]

understand suppose this is my data set

[296:51]

suppose this is my data set initially

[296:53]

let's say that I have my data points

[296:54]

like this we cannot go ahead and

[296:57]

directly say say that okay K is equal to

[296:59]

2 is going to work so obviously we are

[297:01]

going to go with iteration for I is

[297:04]

equal to probably 1 to 10 I'm going to

[297:06]

move towards iteration from 1 to 10

[297:09]

let's say so for every iteration we will

[297:11]

construct a graph with respect to K

[297:14]

value and with respect to something

[297:16]

called as W CSS now what is this W CSS W

[297:20]

CSS basically means within cluster sum

[297:23]

[297:24]

square okay this is the meaning of wcss

[297:27]

within cluster sum of square now let's

[297:30]

say that initially we start with one

[297:33]

centroid so one centroid let's say it is

[297:35]

initialized here one centroid is

[297:37]

basically initialized here if we go and

[297:39]

compute the distance

[297:40]

between each and every points to the

[297:43]

centroid and if we try to find out the

[297:45]

distance will the distance value be

[297:47]

greater or it will be smaller will it be

[297:50]

smaller or greater tell me if you try to

[297:53]

calculate this distance from this

[297:55]

centroid to every point this is what is

[297:57]

within cluster sum of square it will

[298:00]

always be very very much greater so

[298:02]

let's say that my first point has come

[298:04]

somewhere here it is going to be

[298:06]

obviously greater let's say that my

[298:07]

first point is coming over here find

[298:10]

So within K is equal to 1 initially we

[298:12]

took and we found out the distance of w

[298:14]

CSS and it is a very huge value okay

[298:17]

because we're going to compute the

[298:18]

distance between each and every point to

[298:20]

the centroid now the next thing that I'm

[298:23]

actually going to do is that now we'll

[298:26]

go with next value that is K is equal to

[298:28]

2 now in K is equal to 2 I will

[298:31]

initialize two points okay I will

[298:34]

initialize two points and then probably

[298:36]

I will do the entire process which I

[298:38]

have written on the top now tell me me

[298:40]

whichever points is nearer to this green

[298:42]

point if we compute the distance and

[298:46]

whichever points is nearer to the red

[298:48]

point if you compute the distance like

[298:52]

this now this summation of the distance

[298:55]

will be lesser than the previous W CSS

[298:57]

or not obviously it is going to be

[299:00]

lesser than the previous W CSS so what

[299:02]

I'm actually going to do probably with K

[299:04]

is equal to 2 your value may come

[299:06]

somewhere here then with K is equal to 3

[299:09]

your value May come somewhere here then

[299:10]

K is equal to 4 will come here to 5 6

[299:13]

like this it will go so here if I

[299:15]

probably join this line you'll be able

[299:17]

to see that there will be an Abrupt

[299:19]

changes in the W CSS value in the wcss

[299:23]

value there will be an Abrupt changes

[299:25]

and this this is basically called as

[299:27]

elbow curve now why we say it as elbow

[299:30]

curve because it is in the shape of

[299:32]

elbow and here at one specific point

[299:34]

there will be an Abrupt change and then

[299:36]

it will be straight so that is the

[299:38]

reason why we basically say this as

[299:41]

elbow okay so this is a very important

[299:43]

thing see in finding the K value we use

[299:46]

elbow method but for validating purpose

[299:49]

how do we validate that this model is

[299:52]

performing well we use silard score that

[299:54]

I'll show you just in some time but

[299:57]

understand that in K means clustering we

[300:00]

need to update the centroids and based

[300:02]

on that we calculate the distance and as

[300:05]

the K value keep on increasing you'll be

[300:07]

able to see that the distance will

[300:09]

become normal or the wcss value will

[300:12]

become normal and then we really need to

[300:14]

find out which is the phys K value where

[300:17]

the abrupt change see over here suppose

[300:20]

abrupt change is there and then it is

[300:21]

normal then I will probably take this as

[300:24]

my K value so obviously the model

[300:26]

complexity will be high because we are

[300:28]

going to check with respect to different

[300:30]

different K values and wcss values and

[300:33]

this basically means that the value that

[300:36]

we'll probably get first of all we need

[300:38]

to construct this elbow curve then see

[300:40]

the changes where it is basically

[300:42]

happening we'll need to find out the

[300:43]

abrupt change and once we get the abrupt

[300:46]

change we basically say that this may be

[300:49]

the K value so K is equal to 4 as an

[300:52]

example I'm telling you so unless and

[300:54]

until if you really want to find the

[300:56]

cluster it is very much simple we take a

[300:59]

k value we initialize K number of

[301:01]

centroids we compute the average to

[301:03]

update the centroids then again we try

[301:05]

to find out the distance try to see that

[301:07]

whether any points has changed and

[301:08]

continue that process unless and until

[301:10]

we get separate groups okay so this is

[301:14]

the entire funa of claim in clustering

[301:16]

so finally you'll be able to see that

[301:18]

with respect to the K value we will be

[301:20]

able to get that many number of groups

[301:22]

if my K value is four that basically

[301:24]

means I will be probably getting four

[301:26]

different groups like this 1 two right

[301:30]

three like this and four I will be

[301:32]

getting four groups like this with K is

[301:34]

equal to 4 that basically means K is

[301:35]

equal to four clusters and every group

[301:38]

will be having its own centroids okay

[301:41]

every group will be having okay

[301:42]

centroids are very much important yes

[301:45]

I'll try to show you in the coding also

[301:47]

guys let's go towards the second

[301:48]

algorithm the second algorithm that we

[301:51]

will be probably discussing is called as

[301:54]

hierarchical clustering now hierarchal

[301:56]

clustering is very much simple guys all

[301:58]

you have to do is that let's say this is

[302:00]

your data points this is your data

[302:01]

points and this is my P1 let's say P2

[302:04]

now hierle clustering says that we will

[302:07]

go step by step the first thing is that

[302:10]

we will try to find out the most nearest

[302:12]

Value let's say this is my X and Y let's

[302:15]

say these are my points like this is my

[302:18]

P1 point this is my P2 point this is my

[302:21]

P3 point this is my P4 Point P5 Point P6

[302:25]

point p7 point okay so these are my

[302:28]

points that I have actually named over

[302:29]

here let's say that this may be the

[302:31]

nearest point to each other so what it

[302:32]

will do it will combine this together

[302:34]

into one cluster this we have computed

[302:37]

the distance so it will C create one

[302:39]

cluster now what will happen on the

[302:41]

right hand side there will be another

[302:42]

notation which you may be using in

[302:45]

connecting all the points one so suppose

[302:46]

this is my P1 this is my P2 this is my

[302:50]

P3 P4 let's say that I have this many

[302:53]

points and probably I will also try to

[302:56]

make

[302:57]

p7 so these are my points p7 now you

[303:00]

know that the nearest point that we are

[303:02]

having okay this will probably be

[303:04]

distance 1 2 3 this is distance okay 4 5

[303:09]

6 like this we have lot of distance so

[303:12]

hierle clustering will first of all find

[303:14]

out the nearest point and try to compute

[303:17]

the distance between them and just try

[303:18]

to combine them together into one what

[303:21]

do we do we basically combine them into

[303:23]

one group okay so P1 and P2 has been

[303:26]

combined let's say then it'll go and

[303:29]

find out the other nearest point so

[303:31]

let's say P6 and p7 are near so they are

[303:33]

also going to combine into one group so

[303:35]

once they combine into one group then we

[303:37]

have P6 and p7 which will be obviously L

[303:40]

greater than the previous distance and

[303:42]

we may get this kind of computation and

[303:44]

another combination or cluster will form

[303:47]

get formed over here then you have seen

[303:49]

that okay P3 and P5 are nearer to each

[303:52]

other so we are going to combine this so

[303:54]

I'm going to basically combine P3 and

[303:57]

P5 okay and let's say that this distance

[303:59]

is greater than the previous one because

[304:02]

we are basically going to sh start with

[304:03]

the shortest distance and then we are

[304:05]

going to capture the longest distance

[304:07]

now this is done now you can see that

[304:08]

the next point that is near right to

[304:11]

this particular group is P4 so we are

[304:13]

going to combine this together into one

[304:15]

group so once we combine this into one

[304:17]

group this P4 will get connected like

[304:20]

this let's say it is getting connected

[304:23]

like this P4 has got connected then what

[304:25]

is the nearest Point whether it is P6 p7

[304:28]

group or P1 P2 obviously here you can

[304:30]

see that P1 P2 is there so I am probably

[304:32]

going to combine this group together

[304:34]

that basically means P1 P2 let's say I'm

[304:38]

just going to combine this group group

[304:40]

together again circle is coming so I

[304:42]

will make a dot let's say I'm going to

[304:43]

combine this group together because

[304:45]

these are my nearest groups so what will

[304:47]

happen P1 and P2 will get combined to P5

[304:50]

sorry P4 P5 this one so I will be

[304:53]

getting another line like this and then

[304:55]

finally you'll be seeing that P6 p7 is

[304:57]

the nearest group to this so this will

[305:00]

totally get combined and it may look

[305:02]

something like this so this will become

[305:05]

a total group like

[305:07]

this so all the groups are combined so

[305:10]

finally you'll be able to see that there

[305:11]

will be one more line which will get

[305:13]

combined like

[305:14]

this this is basically called as

[305:17]

dendogram dendogram okay which is like

[305:21]

bottom root to top now the question

[305:24]

arises is that how do you find that how

[305:25]

many groups should be here how do you

[305:27]

find out that how many groups should be

[305:29]

here the funa is very much Clear guys in

[305:32]

this is that you need

[305:34]

to find the longest

[305:41]

vertical line you need to find out the

[305:43]

longest vertical line that has no

[305:46]

horizontal line pass through it no

[305:49]

horizontal

[305:51]

line passed through it this is very much

[305:54]

important that has no horizontal line

[305:56]

pass through it now what this is

[305:58]

basically meaning is that I will try to

[306:00]

find out the longest line longest

[306:03]

vertical line in such a way that none of

[306:06]

the horizontal line passes through it

[306:07]

what is horizontal line suppose if I

[306:09]

consider this vertical line This

[306:11]

vertical line over here if you see that

[306:13]

if I extend this green line it is

[306:15]

passing through this if I extend this

[306:17]

line it is passing through this right if

[306:20]

I'm extending this line it is passing

[306:21]

through this right so out of this the

[306:25]

longest line that may be passing in such

[306:27]

a way that no horizontal line probably

[306:29]

is this line that I can actually see so

[306:31]

what you do over here is that you

[306:33]

basically just create a straight line

[306:35]

over this and then you try to find out

[306:37]

that how many clusters it will be there

[306:39]

by understanding that how many lines it

[306:41]

is passing through if it is passing

[306:42]

through this one line two line three

[306:44]

line four line that basically means your

[306:47]

clusters will be four

[306:49]

clusters this is how we basically do the

[306:52]

calculation in heral clustering again

[306:56]

here it may not be the perfect line I've

[306:58]

just drawn with some assumptions but if

[307:00]

you are trying to do this probably you

[307:02]

have to do in this specific way okay

[307:04]

I've already uploaded a lot of practical

[307:06]

videos with respect to highill

[307:08]

clustering and all now now tell me

[307:11]

maximum effort or maximum time is taken

[307:15]

by is taken

[307:18]

by K

[307:20]

means or hierle clustering this is a

[307:25]

question for you yes guys number of

[307:26]

clusters may be three but here I'm just

[307:29]

showing you that how many lines it may

[307:31]

be passed by how do you basically

[307:34]

determine whether maximum time will be

[307:36]

taken by kin or Hier clustering this is

[307:38]

an interview question the maximum time

[307:40]

that will be taken is by hierarchical

[307:45]

clustering why because let's say that I

[307:48]

have many many many data points at that

[307:51]

point of time hierle clustering will

[307:53]

keep on constructing this kind of

[307:55]

dendograms and it will be taking many

[307:58]

many many time lot time right so hierle

[308:02]

clustering will take more time maximum

[308:05]

time that it is going to basically take

[308:07]

so it is very much important that that

[308:09]

you understand which is making basically

[308:12]

taking more time so if your data set is

[308:15]

small you may go ahead with hierle

[308:18]

clustering if your data set is large go

[308:21]

with K means clustering go with K means

[308:23]

clustering in short both will take more

[308:25]

time but K Min will perform better than

[308:28]

hle clustering see guys you will be

[308:30]

forming this kind of dendograms right

[308:33]

and just imagine if you have 10 features

[308:34]

and many data points how you're going to

[308:37]

do it it will be a cubers some process

[308:40]

you'll not be even able to see this

[308:42]

dendogram properly and manually

[308:44]

obviously you cannot do it so this was

[308:46]

with respect to K means clust swing and

[308:49]

H mean clust swing I hope everybody's

[308:51]

understood now the next topic that we'll

[308:53]

focus on is that how do we

[308:56]

validate see how do we validate a

[308:59]

classification problem we use

[309:00]

performance metric like confusion Matrix

[309:03]

accuracy um different different true

[309:05]

positive rate Precision recall but how

[309:07]

do we validate clustering model Model S

[309:10]

we are going to use something called as

[309:12]

so we are going to basically use

[309:14]

something called as

[309:15]

Sil score I'll show you what Sid score

[309:19]

is I'm going to just open the Wikipedia

[309:21]

so this is how a CID score looks like a

[309:25]

very very amazing topic okay how do we

[309:28]

validate whether my model basically has

[309:32]

perfect three or four model perfect

[309:35]

three suppose if I find out my K value

[309:37]

is three how do we find out now see one

[309:40]

more one more issue with K means one

[309:42]

issue with K means which I forgot to

[309:44]

tell you let's say that I have a data

[309:46]

point which looks like this and suppose

[309:49]

I have some data points like this I have

[309:51]

some data points which looks like this

[309:55]

let's say I have like this now in this

[309:58]

one issue will be that suppose I try to

[310:01]

make a cluster over here obviously

[310:03]

you'll be saying my K value will be two

[310:05]

okay in this particular case suppose

[310:07]

this is one cluster this is my another

[310:08]

cluster

[310:10]

right because of my wrong initialization

[310:13]

of the points okay understand because

[310:16]

suppose if I initialize just randomly

[310:18]

some centroids like this then what may

[310:20]

happen is that there is a possibility

[310:22]

that we may also have three clusters

[310:24]

like like like this kind of clusters one

[310:27]

cluster will be here one cluster will be

[310:29]

here one cluster will be here so this

[310:32]

initialization of the centroids one

[310:35]

condition is that it should be very very

[310:37]

far if we initialize our centroids very

[310:41]

very far at that point of time we will

[310:43]

be able to find the centroid exactly in

[310:46]

the center because it will keep on

[310:47]

updating it'll keep on going ahead right

[310:50]

but if we don't initialize that very far

[310:53]

then there will be a situation that

[310:55]

probably if I wanted to get only the

[310:57]

real thing was to get only two centroids

[310:59]

I was probably getting three centroids

[311:01]

right so this is a problem so for this

[311:04]

there is an algorithm which is called as

[311:06]

K means Plus+ and what this K means

[311:08]

Plus+ will do which I will probably show

[311:10]

you in Practical this will make sure

[311:12]

that all the centroids that are

[311:14]

initialized it is very very

[311:16]

far okay all the in centroids that is

[311:19]

basically there it is initialized very

[311:21]

very far we'll see that in practical

[311:23]

application where specifically those

[311:26]

centroids are basically used now let me

[311:28]

go ahead and let me show

[311:30]

you with respect to Sid clust string now

[311:34]

what is the solo color string I'm going

[311:36]

to explain you in an amazing way this is

[311:38]

important

[311:39]

if someone says you how do we validate

[311:43]

how do we validate cluster

[311:46]

model then at that point of time we

[311:48]

basically use this site it will be used

[311:51]

in it will be used with respect

[311:55]

to it will be used with respect to K

[311:58]

means it can be used in hierle mean

[312:00]

right if you want to validate how do we

[312:03]

validate okay that is what we are

[312:04]

basically going to see over here now in

[312:08]

C's clustering

[312:09]

what are the most important things the

[312:12]

first and the most important thing is

[312:13]

that we will try to find out we will try

[312:16]

to find out a ofi we will try to find

[312:19]

out a of I now what is this a ofi see

[312:22]

this a ofi that you basically see a ofi

[312:25]

is nothing but see three major steps

[312:28]

happens in order to validate cluster

[312:30]

model with the help of solo first thing

[312:33]

is that I will probably take one cluster

[312:36]

okay there will be one point

[312:39]

which will be my centroid let's say and

[312:42]

then what I'm going to do I'm just going

[312:44]

to whatever points are there inside this

[312:46]

cluster I'm going to compute the

[312:49]

distance between them so I'm going to do

[312:52]

the summation and I'm also going to do

[312:54]

the average of all this distance so here

[312:57]

you can see that when I said distance of

[312:59]

I comma J I basically means this point J

[313:03]

basically means all these points I is

[313:06]

nothing but it is the centroid so here

[313:08]

is nothing but this this is the centroid

[313:09]

let's say that I'm having the centroid

[313:11]

so I'm going to compute all the distance

[313:13]

over here which is mentioned by this and

[313:15]

this value that you see that I'm

[313:17]

actually dividing by C of I minus one in

[313:20]

Short I am actually trying to calculate

[313:22]

the average

[313:24]

distance so this is the first point

[313:26]

where I'm actually Computing the a ofi

[313:28]

now similarly what I will do is

[313:31]

that what I will do is that the next

[313:34]

point will be that suppose I have

[313:36]

computed a ofi the next the next that we

[313:39]

need to compute is B ofi now what is b

[313:41]

ofi b ofi is nothing but there will be

[313:44]

multiple clusters in a k means problem

[313:47]

statement we will try to find out the

[313:50]

nearest cluster okay suppose let's say

[313:52]

that this is the nearest cluster and in

[313:54]

this I have all the variety of points

[313:58]

then B ofi basically says that I will

[314:00]

try to compute the distance between each

[314:03]

point and the other point in this

[314:06]

centroid sorry in this cluster so this

[314:08]

is my cluster one this is my cluster two

[314:12]

so what I'm actually going to do is that

[314:14]

here I'm going to compute the distance

[314:16]

between this point to this point then

[314:17]

this point to this point then this point

[314:20]

to this point this point to this point

[314:22]

this point to this point this point to

[314:24]

this point every point I'm actually

[314:26]

going to compute the distance once this

[314:28]

point is done we will go ahead with the

[314:30]

next point and we'll try to compute the

[314:31]

distance and once we get all this

[314:34]

particular distance what we are going to

[314:35]

do we are going to do the average of

[314:37]

them average

[314:39]

now tell me if I try to find out the

[314:42]

relationship between a of I and B of I

[314:45]

if my cluster model is good will a of

[314:50]

I will be greater than b of I or

[314:54]

will B of I will be greater than a ofi

[314:58]

if I have a good clustering model if I

[315:01]

have a good clustering model will a of I

[315:05]

is greater than b of I will be greater

[315:08]

than b of I or whether B of I will be

[315:10]

greater than a of I out of this if we

[315:13]

have a really good model obviously the

[315:16]

distance between B of I will be greater

[315:19]

than a of I in a good model that

[315:22]

basically means if I talk about sloid

[315:24]

clustering the values will be between -1

[315:27]

to +1 the more the value is towards +1

[315:32]

that basically means the good the model

[315:34]

is the good the clustering model is the

[315:37]

more the values towards negative one

[315:39]

that basically means this condition is

[315:40]

getting applied now what does this

[315:42]

condition basically say that basically

[315:43]

means that this distance is far than the

[315:46]

cluster distance this is what this

[315:48]

information is getting portrayed and

[315:51]

this is the importance of CID

[315:53]

clustering finally when we apply the

[315:55]

formula of CID clustering you'll be able

[315:57]

to see that sloid clustering is nothing

[316:00]

but let me rub this everything guys for

[316:03]

you let me just show you what is CID

[316:05]

clustering CID clustering formula will

[316:08]

be something like this this B of I so

[316:11]

here you have solid clustering this is

[316:13]

the formula B of I minus a of I Max of a

[316:18]

of I comma B of I if C of I is greater

[316:21]

than one right so by this you will be

[316:24]

getting the value between -1 to + 1 and

[316:28]

more the value is towards + one the more

[316:31]

good your model is more the values

[316:33]

towards minus1 more bad your model is

[316:36]

because if it is towards minus1 that

[316:38]

basically means your a of I is obviously

[316:41]

greater than b of I so this is the

[316:43]

outcome with respect to cot crust string

[316:46]

if s is equal to zero that basically

[316:47]

means still your model needs to be uh

[316:50]

per basically the clustering needs to be

[316:52]

improved what is I over here I is

[316:54]

nothing but one data point you you can

[316:56]

just read this guys data point in I in

[316:59]

the cluster C of I so I hope everybody's

[317:01]

understood this now let's go ahead and

[317:03]

let's discuss about the next topic we

[317:05]

have obviously finished up solart

[317:07]

clustering over here let's discuss about

[317:09]

something called as DB

[317:11]

scan so for DB scan clustering this is

[317:14]

an amazing clustering algorithm we'll

[317:17]

try to understand how to actually do DB

[317:20]

clustering and probably you'll be able

[317:22]

to understand a lot of things from this

[317:24]

now in DB scan clustering what are the

[317:27]

important things so let's start with

[317:29]

respect to DB scan clustering and let's

[317:32]

understand some of the important points

[317:33]

over here the first point that you

[317:35]

really need to remember is something

[317:37]

called as score point points I'll also

[317:39]

talk about when do you say core points

[317:42]

or when do you say other points as such

[317:44]

so the first point that I will probably

[317:46]

discuss about is something called as Min

[317:49]

points the second point that I will

[317:51]

probably discuss about is something

[317:53]

called as score points the third thing

[317:56]

that I will probably discuss about is

[317:57]

something called as border points and

[318:00]

the fourth point that I will definitely

[318:02]

talk about is something called as noise

[318:04]

Point okay guys now tell me in C's

[318:07]

clustering

[318:09]

if I have this kind of groups don't you

[318:11]

think with the help of two different

[318:14]

clusters I may combine this two like

[318:16]

this with the help of two different

[318:18]

clusters I may combine something like

[318:22]

this right but understand over here what

[318:25]

what problem is basically happening with

[318:27]

the second clustering this is actually

[318:30]

an outliers let's say that let's say one

[318:32]

thing very nicely I will put okay let's

[318:35]

say I have one point over here I have

[318:38]

one point over here here so if I do

[318:39]

clustering probably I will get one

[318:41]

cluster

[318:43]

here and I may get another cluster which

[318:45]

is somewhere here now understand one

[318:47]

thing this point is definitely an

[318:50]

outlier even though this is an outlier

[318:53]

with the help of K means what I'm

[318:54]

actually doing I'm actually grouping

[318:56]

this into another group so can we have a

[318:59]

scenario wherein a kind of clustering

[319:01]

algorithm is there where we can leave

[319:03]

the outlier separately and this outlier

[319:06]

in this particular algorithm and this is

[319:08]

B basically uh we will be using DB scan

[319:11]

to relieve the outlier and this point

[319:13]

will be called as a noisy Point noisy

[319:15]

point or I can also say it as an outlier

[319:18]

so this will be a noise point for this

[319:20]

kind of algorithm where you want to skip

[319:22]

the outliers we can definitely use DB

[319:25]

scan that is density based spatial

[319:27]

clustering of application with noise a

[319:31]

very amazing algorithm and definitely I

[319:33]

have tried using this a lot nowadays I

[319:36]

don't use K means or Hier means instead

[319:38]

use this kind of algorithm now see this

[319:41]

what are the important things over here

[319:42]

first of all you need to go ahead with

[319:44]

Min points Min points so first thing is

[319:47]

that you need to have Min points this

[319:50]

Min points is a kind of

[319:52]

hyperparameter this basically says what

[319:55]

does hyper parameter says and there is

[319:57]

also a value which is called as

[319:59]

Epsilon which I forgot I will write it

[320:01]

down over here this is called as Epsilon

[320:04]

now what does epsilon mean Epsilon

[320:06]

basically means if I have a point like

[320:08]

this

[320:09]

and if I take Epsilon this is nothing

[320:11]

but the radius of that specific Circle

[320:13]

radius of that specific Circle okay so

[320:16]

Epsilon is nothing but radius over here

[320:19]

in this specific T what does minimum

[320:21]

points is equal to 4 mean let's say that

[320:24]

I have I have taken a point over here

[320:26]

let's say that this is my

[320:28]

point and I have drawn a circle which

[320:31]

looks like this and let's say that this

[320:33]

is my Epsilon

[320:34]

value okay this is my Epsilon value if I

[320:37]

say my Min point point is equal to 4

[320:40]

which is again a hyper

[320:41]

parameter that basically means I can if

[320:45]

I have four at least four points over

[320:47]

here near to this particular Circle

[320:49]

based on this Epsilon value then what

[320:52]

will happen is that this point this red

[320:55]

point will actually become a core

[320:58]

point a core point which is basically

[321:01]

given over here if it has at least that

[321:04]

many number of Min points inside or near

[321:07]

to this particular within this

[321:09]

Epsilon okay within this particular

[321:11]

cluster suppose this is my cluster with

[321:14]

the help of Epsilon I have actually

[321:15]

created it is there a particular unit of

[321:17]

Epsilon or we simply take the unit of

[321:19]

distance no Epsilon value will also get

[321:21]

selected through some way I I'll show

[321:23]

you I'll show you in the practical

[321:24]

application don't worry now the next

[321:26]

thing is that let's say let's say I have

[321:28]

another another point over here let's

[321:30]

say that I have another point over here

[321:32]

and this is my circle with respect to

[321:35]

Epsilon I have created it let's say that

[321:38]

here I have only one

[321:41]

point I have only one point inside this

[321:45]

particular cluster at that point this

[321:48]

point becomes something called as border

[321:52]

Point border Point border point also we

[321:55]

have discussed over here right so border

[321:58]

point is also there so here I'm saying

[322:00]

that at least one at least one if it is

[322:04]

only one it is present then it will

[322:06]

become a border point if it has Force

[322:08]

definitely this will become a core Point

[322:10]

core Point like how we have this red

[322:11]

color so and there will be one more

[322:14]

scenario suppose I have this one cluster

[322:16]

let's say this is my Epsilon and suppose

[322:19]

if I don't have any points near this

[322:21]

then this will definitely become my

[322:23]

noise point and this noise point will

[322:26]

nothing be but this will be a

[322:28]

cluster okay so here I have actually

[322:30]

discussed about the noise point also so

[322:33]

I hope everybody is able to understand

[322:34]

the key terms now what is basically

[322:36]

happening is that whenever we have a

[322:39]

noise Point like in this particular

[322:40]

scenario we have a noise point and we

[322:42]

don't find any points inside this any

[322:45]

core point or border point if you don't

[322:47]

find inside this then it is going to

[322:49]

just get neglected that basically means

[322:52]

this is basically treated as an outlier

[322:55]

I hope everybody is able to understand

[322:57]

here this point will be treated as an

[322:59]

outlier or it can also be treated as a

[323:02]

noise point and this will never be taken

[323:05]

inside a group okay it will never never

[323:08]

be taken inside a group suppose I have

[323:10]

this set of points which you see

[323:12]

basically over here red core and all and

[323:14]

there is also a border Point by making

[323:18]

multiple circles over here here you can

[323:20]

definitely say that how we are defining

[323:22]

core points and the Border points and

[323:24]

this can be combined into a single group

[323:27]

okay this can be combined into a single

[323:29]

group because how the connection is now

[323:31]

see this this yellow line is basically

[323:33]

created by one sorry this yellow point

[323:35]

is basically created by one Epsilon and

[323:37]

we have one One Core point over here

[323:40]

remember over here it should be at least

[323:43]

one core Point okay not one point but

[323:47]

one core point at least if it is having

[323:50]

one core point then it will become a

[323:52]

border point this will become a border

[323:54]

point that basically means yes this can

[323:56]

be the part of this specific group so

[323:59]

what we are doing Whenever there is a

[324:00]

noise we are going to neglect it

[324:02]

wherever there is a broader and core

[324:03]

points we are going to combine it so

[324:05]

I'll show you one more diagram which is

[324:06]

an amazing diagram which will help you

[324:09]

understand more in this a k means

[324:10]

clustering and Hier mean clustering now

[324:12]

see this everybody now the right hand

[324:15]

side of diagram that you see is based on

[324:19]

DB scan clustering and the left hand

[324:21]

side is basically your traditional

[324:23]

clustering method let's say that this is

[324:25]

K means which one do you think is better

[324:28]

over here you see this these all

[324:30]

outliers are not combined inside a group

[324:34]

But whichever are nearer as a core point

[324:37]

and the broader point separate separate

[324:38]

groups are actually

[324:40]

created right so this is how amazing a

[324:44]

DB scan clustering is a DB scan

[324:47]

clustering is pretty much amazing that

[324:50]

is basically the outcome of this here in

[324:53]

C's clustering you can see this all

[324:54]

these points has also been taken as blue

[324:57]

color as one group because I'll be

[324:58]

considering this as one group but here

[325:00]

we are able to determine this in a

[325:03]

amazing groups so in I'm saying you guys

[325:06]

directly use DB scan with without

[325:08]

worrying about anything so now let's

[325:10]

focus on the Practical part uh I'm just

[325:12]

going to give you a GitHub link

[325:14]

everybody download the code guys I've

[325:16]

given you the GitHub link quickly

[325:18]

download and keep your file ready I'm

[325:20]

going to open my anaconda prompt

[325:23]

probably open my jupyter notbook we'll

[325:25]

do one practical problem I've given you

[325:28]

the link guys please open it so this is

[325:30]

what we are going to do today this will

[325:32]

be amazing here you'll be able to see

[325:34]

amazing things how do you come to know

[325:37]

that over fitting or underfitting is

[325:39]

happening you don't know the real value

[325:41]

right so in in clustering there will not

[325:43]

be any underfitting or overfitting so uh

[325:46]

what all things we'll be importing first

[325:48]

is that we'll try cin clustering we'll

[325:50]

do silot scoring and then probably we'll

[325:52]

see the output and um and we'll do DB

[325:56]

scan Also let's say DB scan is also

[325:58]

there so uh what are the things we have

[326:00]

basically imported one is the cin

[326:03]

clustering one is the Sout samples and

[326:05]

Sout scores these all are present in the

[326:08]

SK learn and it is present in metrics

[326:11]

that basically means we use this

[326:12]

specific parameter to validate

[326:15]

clustering models okay now we'll try to

[326:18]

execute this and apart from that mat

[326:20]

plot lib we are just trying to import

[326:22]

numai we are trying to import and all

[326:24]

here we are executing it perfectly the

[326:26]

next thing is that here the next step is

[326:29]

that generating the sample data from

[326:31]

make underscore blobs first of all we

[326:33]

are just trying to generate some samples

[326:35]

with some two features and we are saying

[326:37]

that okay should have four centroids or

[326:39]

C centroids itself with some features

[326:43]

I'm trying to generate some X and Y data

[326:45]

randomly and this particular data set

[326:47]

will basically be used in performing

[326:50]

clustering algorithms okay forget about

[326:52]

range undor ncore clusters because we

[326:54]

need to try with different different

[326:55]

clusters and try to find out the solid

[326:57]

score so right now I just initialized

[326:59]

with 2 3 4 5 6 values it is very simple

[327:02]

so if I go and probably see my X data so

[327:05]

my X data will look something like this

[327:07]

so this is my X data with two features

[327:09]

and this is my Y data with one feature

[327:12]

which is my output which belongs to a

[327:13]

specific class okay so that you can

[327:16]

actually do with the help of make

[327:17]

underscore blobs let's say how to apply

[327:21]

kin's clustering algorithm so as I said

[327:23]

that I will be using W CSS W CSS

[327:26]

basically means within cluster sum of

[327:28]

square so I'm going to import K means

[327:30]

over here for I in range 1A 11 that

[327:33]

basically means I'm going to use

[327:35]

different different K values or centroid

[327:37]

values and try to C which is having the

[327:39]

minimal wcss value and I'll try to draw

[327:42]

that graph which I had actually shown

[327:44]

you with respect to Elbow method so here

[327:47]

I will basically be also using K means

[327:50]

number of clusters will be I and

[327:52]

initialization technique I will will be

[327:54]

using K means Plus+ so that the points

[327:57]

the centroids that are initialized those

[327:59]

those points are very very far and then

[328:01]

you have random state is equal to zero

[328:03]

then we do fit and finally we do wcss do

[328:06]

upend cins doin inertia okay this dot

[328:10]

inertia will give you the distance

[328:13]

between the centroids and all the other

[328:16]

points and this is what I'm going to

[328:18]

append in this wcss value and finally

[328:20]

I'll just plot it now here you can see

[328:22]

that I'm just plotting it obviously by

[328:25]

seeing this graph this graph looks like

[328:27]

an elbow okay this graph looks like an

[328:29]

elbow so the point that I'm actually

[328:31]

going to consider over here see which is

[328:34]

the last abrupt change so if I talk

[328:36]

about the last abrupt change here I have

[328:38]

the specific value with respect to this

[328:41]

okay I have one specific value with

[328:43]

respect to this this is my abrupt change

[328:45]

from here the changes are normal so I'm

[328:48]

going to basically select K is equal to

[328:50]

4 now what I'm actually going to do with

[328:52]

the help of sart with the help of s CL

[328:57]

score we are going to compare whether K

[329:00]

is equal to 4 is valid or not so that is

[329:03]

what we are going to do valid or not so

[329:06]

here we are going to do this now let's

[329:09]

go ahead and let's try to see it how we

[329:11]

are going to do it so here you can see n

[329:13]

clusters is equal to 4 then I'm actually

[329:15]

able to find out the prediction and this

[329:17]

is specifically my output okay this is

[329:19]

done now see this code okay this code is

[329:23]

a huge code I have actually taken this

[329:24]

code directly from the SK learn page of

[329:28]

Silo if you go and see this this code is

[329:30]

directly given over there but I'm just

[329:33]

going to talk about like what are the

[329:35]

important things we need to see over

[329:37]

here with respect to different different

[329:39]

clusters see see this clusters 2 3 4 5 6

[329:43]

I'm going to basically compare whether

[329:46]

the K value should be four or not with

[329:48]

the help of solid scoring so let's go

[329:51]

here and here you can see that I'm

[329:54]

applying this one first I will go with

[329:56]

respect to for Loop for ncore clusters

[329:59]

in range underscore clusters different

[330:00]

different cluster values are there first

[330:02]

we'll start with two so here you can see

[330:04]

initialize the cluster with and cluster

[330:06]

value and a random generator seed of 10

[330:09]

for reproducibility so ncore clusters

[330:12]

first I take took it as two and then I

[330:14]

did fit predict on X after I did fit

[330:17]

predictor on X I'm using this score on X

[330:21]

comma cluster label now what this is

[330:23]

going to do understand in Solo what did

[330:25]

we discuss it will it will try to find

[330:27]

out all the Clusters the Clusters over

[330:30]

here like this and it'll try to

[330:32]

calculate the distance between them

[330:34]

which is the a of I then it'll try to

[330:36]

compute the B of I then finally it'll

[330:39]

try to compute the score and if the

[330:41]

value is between minus1 to +1 the more

[330:43]

the Valu is towards + one the more

[330:45]

better it is right so these all things

[330:47]

we have already discussed and that is

[330:49]

what this specific function will do and

[330:51]

this will give my solo average value

[330:53]

over here solid value will be over here

[330:55]

okay this we have done and then we can

[330:58]

continuously do it for another another

[331:00]

things you can actually find it over

[331:02]

here and this value that you see this

[331:05]

code that you see is nothing nothing so

[331:08]

complex okay this is just to display the

[331:11]

data properly in the form of graphs okay

[331:15]

in the form of graphs so again I'm

[331:17]

telling you I did not write this code

[331:18]

I've directly taken it from the uh SK

[331:22]

learn page of solid okay so just try to

[331:25]

see this particular uh plotting diagrams

[331:27]

and all that you can definitely figure

[331:29]

out but let's see I will try to execute

[331:31]

it and try to find out the output now

[331:33]

see for ncore cluster is equal to 2 the

[331:37]

average solid score is 70 I told you the

[331:40]

value will be between -1 to +1 and I'm

[331:43]

actually getting 704 which is very very

[331:46]

good and then for ncore cluster is equal

[331:48]

to 3 588 then ncore cluster is equal to

[331:52]

4 I'm getting 65 which is pretty much

[331:54]

amazing and then for ncore cluster equal

[331:57]

to 5 the average score is 563 and ncore

[332:00]

cluster is equal to 6 you are saying

[332:02]

.45 here directly you can actually say

[332:05]

that fine for _ cluster equal to 2 I'm

[332:08]

getting an amazing score of

[332:10]

704 obviously you're you're getting the

[332:12]

highest value over this so should we

[332:14]

select ncore cluster isal to two Okay we

[332:17]

should not directly conclude from it

[332:19]

because here we need to also see that

[332:21]

any feature value or any cluster value

[332:24]

is also coming as negative value that

[332:26]

also we need to check so here we will go

[332:28]

down over here you will see the first

[332:30]

one over here with respect to the first

[332:32]

one you see that I'm get getting the

[332:35]

value from 0 to 1 it is not going going

[332:38]

to Min -.1 so definitely two clusters

[332:40]

was able to solve the problem so I'll

[332:43]

keep it like this with me I definitely

[332:45]

have a chance that this may this may

[332:48]

perform well I may have a chance that

[332:50]

this K uh K is equal to 2 May perform

[332:53]

well okay so I may have a chance let's

[332:55]

see to the next one to the next one over

[332:57]

here you can see that for one of the

[332:59]

cluster the value is negative if the

[333:01]

value is negative that basically means

[333:03]

the AI is obviously greater than b ofi

[333:06]

so I'm not going to prer this because it

[333:08]

is having some negative values even

[333:10]

though my cluster looks better but again

[333:13]

understand what is the problem with

[333:14]

respect to this cluster is that if I

[333:17]

take this cluster and probably compute

[333:19]

the distance between this point to this

[333:20]

point and if I probably compute from

[333:22]

this point to this point or this point

[333:24]

to this point this point is obviously

[333:26]

nearer to this right it is obviously

[333:29]

nearer to this so that is the reason why

[333:31]

I'm getting a negative value over here

[333:33]

okay negative value over here this is my

[333:36]

uh output my score this point that you

[333:40]

see dotted points this is my score 58

[333:43]

what whatever it is this is basically my

[333:45]

score so obviously this basically

[333:46]

indicates that this point is near the

[333:48]

other cluster point is nearer to this so

[333:50]

I'm actually getting a negative value

[333:52]

right so this you really need to

[333:54]

understand okay now similarly if I go

[333:56]

with respect to ncore Cluster is equal

[333:58]

to 4 this looks good because here I

[334:00]

don't have any negative value and here

[334:03]

you can see how cooly it has basically

[334:06]

divided the points amazing inly with the

[334:08]

help of k equal to 4 right and similarly

[334:11]

if I go with five obviously you can see

[334:13]

some negative values are here some

[334:15]

dotted line negative value are there

[334:17]

with respect to six you also have some

[334:18]

negative values so definitely I'll not

[334:21]

go with six I may either go with four or

[334:23]

I may either go with two now whenever

[334:26]

you have this options always take a

[334:27]

bigger number instead of two take four

[334:30]

because four is greater than two because

[334:32]

it will be able to create a generalized

[334:34]

model so from this I'm actually going to

[334:37]

take and is equal to 4 K is equal to 4

[334:39]

now should we compare with this with the

[334:41]

elbow method here also I got four right

[334:44]

so both are actually matching so this

[334:47]

indicates that with the help of this

[334:49]

clustering this siluette score we can

[334:51]

definitely come to a conclusion and

[334:53]

validate our clustering model in an

[334:55]

amazing way so I hope everybody is able

[334:57]

to understand and this way you basically

[335:00]

validate a model and definitely you can

[335:03]

try it out you can understand this code

[335:04]

definitely I but till here you have

[335:06]

understood that here I'm going to get

[335:08]

the average value then for iore clusters

[335:12]

whatever cluster this is matching it is

[335:14]

just mapping over there and it is

[335:16]

basically giving so this was the session

[335:19]

and uh yes in today's session we

[335:22]

efficiently covered many topics we

[335:24]

covered kin hierle clustering solid

[335:27]

score DB clustering in tomorrow's

[335:29]

session the topics that are probably

[335:31]

pending is first I'll start with svm and

[335:34]

svr second I will go ahead with XG boost

[335:37]

and and third I will cover up PCA let's

[335:40]

see whether I'll be able to complete

[335:41]

this session uh one one amazing thing

[335:45]

that I want to teach you guys because

[335:46]

many people ask me the definition of

[335:48]

bias and variance so guys uh many people

[335:52]

get confused when we talk about bias and

[335:55]

variance you know because let's say that

[335:57]

uh I have a model for the training data

[335:59]

set it gives us somewhere around 90%

[336:02]

accuracy let's say I'm getting a 90%

[336:04]

accuracy for the test data I may

[336:07]

probably getting somewhere around 70%

[336:10]

accuracy now tell me which scenario is

[336:12]

basically this most of the people will

[336:14]

be saying that okay fine it is

[336:16]

overfitting now when I say overfitting I

[336:19]

basically mention overfitting by low

[336:23]

bias and high

[336:25]

variance right so many people get

[336:28]

confused Krish tell me just the exact

[336:30]

definition of bias and variance low bias

[336:33]

obviously you are saying that because

[336:34]

the training is performed like the model

[336:37]

is performing well with the help of

[336:39]

training data set but with respect to

[336:41]

the test data set the model is not

[336:43]

performing well with respect to training

[336:45]

data set why do we always say bias and

[336:48]

with respect to test data set why do we

[336:50]

always say variance so for this you need

[336:52]

to understand the definition of bias so

[336:54]

let me write down the definition of bias

[336:56]

over here so here I can definitely write

[336:59]

that bias it is a

[337:02]

phenomena that

[337:05]

skews the

[337:08]

result of an

[337:13]

algorithm in

[337:15]

favor in favor or against an

[337:20]

idea against an idea I'll make you

[337:23]

understand the definition uh um but

[337:27]

understand the understand understand

[337:28]

what I have actually written over here

[337:30]

it is a phenomena that skewes the result

[337:32]

of an algorithm in favor or against an

[337:34]

idea whenever I say this specific idea

[337:37]

this idea I will just talk about the

[337:39]

training data set initially now when we

[337:42]

train a specific model suppose if I have

[337:44]

this specific model over

[337:46]

here and I'm training with this specific

[337:49]

training data set so this is my training

[337:52]

data set now based on the definition

[337:54]

what does it basically say it is a

[337:55]

phenomenon that skews the result of an

[337:57]

algorithm in favor or against an idea or

[338:00]

a this specific training data set so

[338:03]

even though I'm training this particular

[338:04]

model with this training data set

[338:07]

with this data set it may it may be in

[338:11]

favor of that or it may be against of

[338:12]

that that basically means it may perform

[338:14]

well it may not perform well if it is

[338:15]

not performing well that basically means

[338:17]

the accuracy is down if the accuracy is

[338:19]

better at that point of time what will

[338:21]

say see if the accuracy is better that

[338:23]

time what we'll say we we'll come up

[338:25]

with two terms from here obviously you

[338:27]

understand okay there are two scenarios

[338:28]

of bias now here if it is in favor that

[338:32]

basically means it is performing well

[338:33]

with respect to the training data set I

[338:35]

will basically say that it has high bu

[338:38]

if it is not able to perform well with

[338:40]

the training data set then here I will

[338:42]

say it as low

[338:44]

bias I hope everybody is able to

[338:46]

understand in this specific thing

[338:47]

because many many many people has this

[338:49]

kind kind of confusion now similarly if

[338:51]

I talk about variance let's say about

[338:53]

variance because you need to understand

[338:55]

the definition a definition is very much

[338:59]

important okay if I if I just talk about

[339:01]

the definition of variance I'm just

[339:03]

going to refer like this the variance

[339:07]

refers to the changes in the model when

[339:13]

using when using different

[339:16]

portion of the

[339:20]

training or test

[339:23]

data now let's understand this

[339:25]

particular

[339:27]

definition variance refers to the

[339:29]

changes in the model when using

[339:31]

different proportion of the test

[339:32]

training data or test data we obviously

[339:34]

know that whenever initially if I have a

[339:38]

model understand from the definition

[339:39]

everything will make sense I am

[339:41]

basically training initially with the

[339:43]

training

[339:44]

data okay because we divide our data set

[339:47]

see our data set whenever we are working

[339:49]

with we divide this into two parts one

[339:52]

is our train data and test data okay

[339:56]

because this is a tra test data is a

[339:58]

part of that particular data set right

[340:00]

and suppose in this particular training

[340:02]

data it gets trained and performs well

[340:04]

here I'm actually talking about bias but

[340:07]

when we come with respect to the

[340:09]

prediction of the specific model at that

[340:12]

point of time I can use other training

[340:14]

data that basically means that training

[340:15]

data may not be similar or I can also

[340:18]

use test data now in this test data what

[340:21]

we do we do some kind of predictions

[340:23]

these are my predictions and in this

[340:25]

prediction again I may get two

[340:27]

scenario I may get two scenario which is

[340:30]

basically mentioned by variance it

[340:32]

refers to the changes in the model when

[340:34]

using when using different portion of

[340:37]

the training or test data refers to the

[340:40]

changes basically means whether it is

[340:42]

able to give a good prediction or wrong

[340:44]

predictions that's it so in this

[340:46]

particular scenario if it gives a good

[340:48]

prediction I may definitely say it as

[340:50]

low variance that basically means the

[340:53]

accuracy with the accuracy with respect

[340:55]

to the test data is also very good if I

[340:58]

probably get a bad if I probably get a

[341:01]

bad accuracy at that time I basically

[341:04]

say it as high variance so if I talk

[341:06]

about three scenarios over here let's

[341:08]

say this is my model one and this is my

[341:11]

model

[341:12]

two and this is my model

[341:15]

three now in this scenario let's

[341:18]

consider that my model one has the

[341:21]

training

[341:23]

accuracy of 90% and test accuracy of

[341:30]

75% similarly I have here as my train

[341:33]

accuracy of 60% and my test accuracy

[341:38]

[341:40]

55% now similarly if I have my train

[341:44]

accuracy of 90% And my test accuracy of

[341:49]

92% now tell me what what things you

[341:51]

will be getting here obviously you can

[341:54]

directly say that fine your training

[341:56]

accuracy is better now you're talking

[341:58]

about bias so this basically indicates

[342:00]

that this has low

[342:02]

bias and since your test accuracy is bad

[342:07]

because it is when compared to the train

[342:08]

accuracy it is less so here you are

[342:10]

basically going to say high

[342:14]

variance understand with respect to the

[342:16]

definition similarly over here what

[342:18]

you'll say high

[342:20]

bias High variance because obviously it

[342:22]

is not performing

[342:25]

well this is another scenario last the

[342:28]

last scenario is that this is the

[342:30]

scenario that we want because it is low

[342:33]

bias and low variance

[342:37]

okay many many people have basically

[342:39]

asked me the definition with respect to

[342:41]

bias and variance and here I've actually

[342:43]

discussed and this indicates this gives

[342:45]

me a generalized model and this is what

[342:50]

is our aim when we are working as a data

[342:53]

scientist so I hope you have understood

[342:55]

the basic difference between V bias and

[342:57]

variance and I was able to give you lot

[343:00]

of examples lot of understanding with

[343:02]

respect to this so I hope you have

[343:05]

actually got this particular uh

[343:08]

understanding of this uh two terms which

[343:10]

we specifically talk about high bias low

[343:12]

bias High variance low variance right so

[343:16]

this was it from my side guys uh and uh

[343:19]

I hope you have understood

[343:22]

this

[343:29]

okay so let's take let's consider a data

[343:34]

set credit

[343:37]

and let's say this is a

[343:39]

approval so we are going to take this

[343:42]

sample data set and understand how does

[343:43]

XG boost work suppose salary is less

[343:47]

than or equal to 50 and the credit is

[343:50]

bad so approval the loan approval will

[343:52]

be zero that basically means he he or

[343:54]

she will not get if it is less than or

[343:56]

equal to 50 if the credit score is good

[344:00]

then probably approval will be one if it

[344:02]

is less than or equal to 50 if it is

[344:06]

good

[344:07]

again then it is going to get one if it

[344:10]

is greater than

[344:12]

50 and if it is bad then obviously

[344:16]

approval will be

[344:19]

zero if it is greater than

[344:22]

50 if it is good we are going to get it

[344:25]

as one if it is greater than

[344:29]

50k and probably if it is normal then

[344:33]

also we are going to get

[344:35]

it so this is this is my data set so how

[344:38]

does XG boost classifier work understand

[344:41]

the full form of XG boost is

[344:44]

Extreme gradient

[344:47]

boosting extreme gradient boosting so we

[344:50]

will basically understand about extreme

[344:52]

gradient boosting now extreme gradient

[344:55]

boosting uh will be actually used to

[344:58]

solve both classification and the

[345:00]

regression problem statement so first of

[345:02]

all let's understand how it is basically

[345:05]

exib basically how it actually if you if

[345:08]

you just talk about XG boost you

[345:10]

understand that it is a boosting

[345:11]

technique and internally it tries to use

[345:13]

decision tree so how does this decision

[345:16]

Tre is basically getting constructed in

[345:18]

the case of XV boost and how it is

[345:20]

basically solved we are going to discuss

[345:21]

about it so whenever we start exib boost

[345:24]

classifier understand that first of all

[345:26]

we create a specific base model suppose

[345:29]

if I say this is my base model and this

[345:32]

base model will be a weak learner okay

[345:36]

and this base model will always give an

[345:38]

output of probability of 0.5 in the case

[345:42]

of classification problem so suppose if

[345:45]

I say this is probability 0.5 then I

[345:48]

will try to create a field over here

[345:50]

this field is called as residual field

[345:53]

so first base model what I'm going to do

[345:55]

any data set that you give from here to

[345:57]

train it will always give you the output

[345:59]

as 0.5 so this is just a dummy base

[346:02]

model now tell me if my probability

[346:06]

output is is 0.5 if I want to calculate

[346:08]

the residual that basically means I need

[346:10]

to subtract approval minus this

[346:12]

particular value so what will be the

[346:16]

value over here 0 -.5 will be

[346:20]

-.5 1 -.5 will be5 1 -.5 will

[346:25]

be5 and 0 -.5 will be -.5 and this 1 -.5

[346:31]

will

[346:32]

be uh 0.5 and this will also be 0.5

[346:36]

let's consider that I have one more

[346:37]

record uh and this specific record can

[346:40]

be anything uh because I want to keep

[346:43]

some more records over here so let's

[346:45]

consider that I have one more record

[346:46]

which is less than or equal to 50K and

[346:49]

if the credit scod is normal you're

[346:51]

going to get zero so here also if I try

[346:53]

to find out the residual it will be

[346:55]

minus5 now the first step I hope

[346:58]

everybody's understood we have to create

[346:59]

a base model okay this base model is

[347:01]

very much important because we have to

[347:04]

create all the decision Tree in a

[347:06]

sequential manner so the first

[347:09]

sequential base tree which is again this

[347:11]

is also a decision tree kind of thing

[347:12]

you can consider but this is a base

[347:15]

model which takes any inputs and gives

[347:17]

by default the probability as 05 now

[347:20]

let's go ahead and understand what are

[347:22]

the steps in constructing decision tree

[347:24]

after creating the base model the first

[347:27]

step is that create uh binary decision

[347:31]

tree so I'm going to write it down all

[347:34]

the steps please make sure that you note

[347:35]

it down so so create a binary tree

[347:39]

binary decision tree using the features

[347:43]

second step we basically Define we we we

[347:47]

say it as okay Second Step what we do we

[347:50]

actually calculate the similarity weight

[347:54]

we calculate the similarity weight I'll

[347:57]

talk about this similarity weight what

[347:59]

exactly it is if I want to use this a

[348:02]

formula it is summation of residual

[348:05]

Square

[348:07]

divided

[348:08]

by summation of probability 1 minus

[348:13]

probability plus Lambda I'll talk about

[348:16]

this what is exactly Lambda it is the

[348:18]

kind of hyperparameter again so that it

[348:20]

does not overfit the third thing is that

[348:23]

we calculate the Information Gain okay

[348:26]

Information Gain so these are the steps

[348:28]

we basically use in constructing or in

[348:32]

solving uh in creating an HD boost

[348:34]

classifier the first step is that we

[348:36]

create a inary decision tree using the

[348:38]

feature then we go ahead with

[348:40]

calculating the similarity weight and

[348:42]

finally we go ahead and calculate the

[348:43]

information gain so how does it go ahead

[348:46]

let's understand over here and let's try

[348:47]

to find out okay now let's go ahead and

[348:50]

let's try to construct the decision tree

[348:53]

as I said that let's consider that I'm

[348:55]

considering salary feature So based on

[348:58]

using salary feature what I'm actually

[348:59]

going to do I am going to take this as

[349:02]

my node and I'm going to split this up

[349:05]

and remember whenever we are creating

[349:07]

decision Tree in this particular case it

[349:09]

will be a binary decision tree let's say

[349:13]

that in salary one is less than or equal

[349:15]

to one is greater than 50 so this two

[349:18]

you obviously have in the case of binary

[349:20]

in case of credit where there are three

[349:22]

categories I'll also show you how that

[349:25]

further split will happen and how that

[349:27]

will get converted into a binary team so

[349:29]

here you have less than or equal to 50K

[349:32]

and greater than 50k now let's go ahead

[349:35]

and understand how many vales are there

[349:37]

in this salary so if I see before the

[349:40]

split you can definitely see that I'm

[349:42]

going to use this residual and probably

[349:45]

train this entire model now if I really

[349:48]

wanted to find out the residual

[349:49]

initially these are my residuals over

[349:51]

here so one resid is -.5 then I have 0.5

[349:56]

over here then I have .5 then again I

[349:59]

have -.5 then again I have 0.5 then

[350:03]

again I have 0.5 and finally I have

[350:06]

minus .5 so these are my total residuals

[350:09]

that are there suppose if I make this

[350:11]

split less than or equal to 50 First

[350:14]

less than or equal to 50 the residuals

[350:16]

what are things are there so here I'm

[350:18]

going to have minus5 then less than or

[350:21]

equal to 50 again I'm going to have 05

[350:23]

then again less than or equal to 50 I'm

[350:25]

going to have 0.5 and less than or equal

[350:27]

to again one more 0.5 is there I'm just

[350:30]

going to remove this the last5 which is

[350:33]

nothing but Min -.5 so I hope you

[350:35]

understood this split so half of the

[350:37]

things came over here the remaining half

[350:40]

will be greater than or equal to greater

[350:41]

than 50 so you have one value here one

[350:44]

value here one value here so it will be

[350:46]

Min -.5 then you have 0.5 and then

[350:50]

finally you have 0.5 residuals how do we

[350:53]

get it guys see from the base model

[350:55]

which is by default giving 0.5 first my

[350:58]

data goes over here by default

[351:00]

probability I'm going to get 0.5 so

[351:02]

residual is basically calculated from

[351:04]

this probability and approval so this

[351:07]

probability minus approval so if you

[351:09]

subtract 0 -.5 sorry I'm just going to

[351:12]

rub this so if you subtract 0 -.5 you're

[351:16]

going to get -.5 1 -.5 you're going to

[351:19]

get .5 1 -.5 you're going to get .5 so

[351:22]

everybody I hope is very much clear with

[351:24]

respect to this so this is the first

[351:26]

step we constructed a binary tree now in

[351:28]

the second step it says calculate the

[351:30]

similarity weight now how to calculate

[351:33]

the similarity weight similarity weight

[351:35]

formula is sum of residual Square now

[351:37]

what is residual Square let's say that

[351:39]

I'm going to calculate the the the uh

[351:43]

I'm going to calculate for this okay

[351:45]

similarity weight now in this particular

[351:47]

case if I go and calculate my similarity

[351:49]

weight it will be summation of residual

[351:52]

Square this is my residual values this

[351:55]

is my residual Valu so I'm going to do

[351:57]

the summation of this Square okay this

[352:01]

value square you can see over here sum

[352:03]

of residual Square everybody you can see

[352:06]

sum of of residual squares so what do

[352:08]

you think sum of residual squares will

[352:09]

be in this particular case how I have to

[352:12]

do it I will just take up this all

[352:14]

values like

[352:16]

-.5

[352:17]

[352:20]

+5 and

[352:22]

-.5 whole square right I'm just going to

[352:24]

do the squaring of this divided by

[352:27]

understand what it is divided by it is

[352:29]

divided by probability of 1 minus

[352:31]

probability now where do we get this

[352:33]

probability value where do we get this

[352:35]

probability value value we get this

[352:37]

probability value from our base model

[352:40]

right so here I'm basically going to say

[352:42]

that we are going to do the summation of

[352:44]

probability of 1 minus probability 1

[352:47]

minus probability that basically means

[352:50]

for each and every point for each and

[352:52]

every Point what is the probability see

[352:54]

probability is basically coming from the

[352:56]

base model so for each Pro each point

[352:59]

I'm going to come compute two things one

[353:01]

is the probability and then 1 minus

[353:04]

probability and this I'm going to do the

[353:06]

summ

[353:07]

like this I will do it four times 1 -.5

[353:10]

then .5 * 1 -.5 and finally you'll be

[353:15]

able to see one more will be there which

[353:17]

[353:18]

+5 1 -.5 so this will be your total

[353:21]

things with respect to this so I hope

[353:24]

you have understood till here uh where

[353:26]

you are able to understand that what we

[353:28]

have done this is summation of uh

[353:31]

residual square and this is the

[353:33]

remaining probability multiplied by 1

[353:35]

minus probability now tell me what are

[353:39]

you able to find out from this if you

[353:41]

cancel this and this this and this this

[353:44]

value is going to become zero so this

[353:47]

entire value is going to become Zer

[353:48]

because 0 divided by anything is 0er so

[353:51]

here I hope everybody is understood what

[353:53]

is the similarity weight of this

[353:55]

specific node if I want to write it is

[353:57]

nothing but zero now you may be

[353:59]

considering where is Lambda

[354:01]

value okay we will initially initialize

[354:04]

Lambda by 1 I'll talk about this hyper

[354:05]

parameter let's consider it as 1 so here

[354:09]

+ 1 or plus 0 let's let's consider

[354:12]

Lambda value 0 let's say for right now

[354:14]

okay I'm just going to make it Lambda is

[354:16]

equal to0 I'm just going to talk about

[354:19]

it because it is a kind of hyper

[354:21]

parameter by Z -.5 -.5 +5 +5 if I do the

[354:28]

summation if I do the summation here you

[354:31]

will be able to see that I'm going to

[354:32]

get zero so this calculation we have

[354:34]

done and we have got uh the sumission of

[354:36]

weight is equal to Z and let's go ahead

[354:39]

and calculate the sumission of the

[354:40]

weight of the next node no no no it's

[354:43]

not first Square it is whole squar so

[354:46]

here also if I do so it is5 +5 now let's

[354:51]

do it for this if I want to find out the

[354:53]

similarity weight again see I'm going to

[354:55]

repeat it .5 +5 whole squ and since

[355:00]

there are three points so I'm going to

[355:01]

basically use probability 1 minus

[355:04]

probability for one point then plus

[355:08]

probability 1 minus probability second

[355:11]

point and then probability and 1 minus

[355:14]

probability for the third point and

[355:16]

Lambda is zero so I'm not going to write

[355:18]

anything now go let's go and do the

[355:20]

calculation for this node so - 5 - 5 it

[355:24]

becomes zero then .5 whole square right

[355:27]

so here I'm going to get 0.25 here if

[355:30]

you do the calculation here you are

[355:31]

going to get 75 so this value is going

[355:34]

to be 1x3 and which is nothing at33 so

[355:37]

the similarity weight for this node for

[355:40]

this node

[355:42]

is33 so here you can see probability of

[355:45]

multiplied by 1 minus

[355:47]

probability okay now the next step that

[355:50]

we do is that calculate the information

[355:53]

gain now you know how to calculate the

[355:55]

information gain but before that let's

[355:57]

do the computation for this also for

[355:59]

this root node also go ahead and

[356:01]

calculate the similarity weight of

[356:04]

this okay they

[356:06]

why the base model probability is5

[356:09]

because it is just understand that it is

[356:11]

a dummy dummy model I have just put a if

[356:14]

condition there saying that it is going

[356:15]

to give 0.5 now do it for this one guys

[356:17]

root node what it will be see I can

[356:20]

calculate from here only minus1 gone

[356:23]

this is also gone this is also gone this

[356:25]

will be .25 divided by something now

[356:29]

tell me guys what should be for the root

[356:32]

node what is the similarity similarity

[356:34]

weight what is the similarity weight for

[356:36]

for this do this calculation everyone up

[356:39]

one I know it will be. 25 divided by

[356:44]

this will be 1.75 are you getting this

[356:48]

similarity weight which will be nothing

[356:50]

but 1 by 7 and if I divide 1 by 7 if I

[356:54]

say what is 1 by 7 it

[356:57]

is42 so it is nothing but .14 if I want

[357:00]

to calculate the root node similarity

[357:02]

weight over here

[357:05]

is4 so I know 0.14 here 0 here 33 now

[357:09]

see over here we calculate the

[357:11]

Information Gain Next Step the third

[357:13]

step what we do is that we calculate the

[357:15]

information gain now Information Gain is

[357:19]

nothing but in this particular case the

[357:21]

root node similarity weight we'll try to

[357:24]

add up so I will be getting

[357:27]

0.33 minus this particular Top Root node

[357:31]

whatever split has happened that

[357:33]

similarity weight I'll take 0 +33

[357:36]

-14 so Point

[357:39]

-14 and if I do it it is nothing but

[357:42]

just open your calculator again and

[357:46]

[357:48]

-14 so it is nothing but .19 I'm getting

[357:52]

.19 as my information gain the

[357:56]

information gain of this specific tree I

[357:59]

got it

[358:00]

as19 obviously you know how the features

[358:03]

will get selected based on the

[358:06]

Information Gain but let's say that the

[358:08]

highest Information Gain that is given

[358:10]

by salary okay now we will go ahead and

[358:13]

do the further split let's go ahead and

[358:16]

do the further split so I I know my

[358:18]

information gain now it is1 n and

[358:20]

Information Gain is basically used to

[358:23]

select that specific node through which

[358:26]

the split will happen now I'll further

[358:27]

go and do the split let's say that I'm

[358:29]

going to do the further split with the

[358:31]

next feature that is which one credit so

[358:33]

I'm going to take credit over here I'm

[358:36]

going to take credit over here and again

[358:39]

I have to do a binary split again but

[358:42]

you may be considering chish here are

[358:43]

only three categories how we are going

[358:45]

to basically do this particular split

[358:48]

right because we don't know how to do

[358:50]

the split because we have three

[358:51]

categories over here so in this case

[358:53]

what I will do is that we what we can

[358:56]

definitely do is that in this particular

[358:58]

case the split that we are probably

[359:00]

going to do is that let's consider two

[359:02]

categories like good and normal at one

[359:04]

side bad at one side so here it becomes

[359:06]

a binary split again now let's go ahead

[359:09]

and let's try to see that how many data

[359:11]

points will fall here and how many data

[359:12]

points will fall here so for writing

[359:14]

down the data points let's say if it is

[359:17]

less than or see go to the path if it is

[359:19]

less than or equal to 50 it'll go this

[359:21]

path and if it is B then we are probably

[359:24]

going to get how much is the residual we

[359:26]

are going to get one residual over here

[359:28]

first of all so this is my one residual

[359:31]

that is -.5 then similarly if I see less

[359:34]

than or equal to 50 good is there right

[359:37]

good or normal is there so here again 0.

[359:39]

five will come I hope everybody is able

[359:42]

to understand see the second record less

[359:44]

than or equal to 50 we go in this path

[359:45]

but it is good we come over here again

[359:48]

less than or equal to 50 good again we

[359:50]

are going to get 1

[359:51]

more5 then go with respect to greater

[359:55]

than or equal to 50 which is coming over

[359:57]

here we'll not worry about it right now

[359:59]

again less than or equal to 50 normal

[360:01]

again it is

[360:03]

-.5 right so this many records

[360:06]

definitely coming over here only one

[360:08]

record is basically coming over here

[360:10]

then again we will start the same

[360:12]

process again we will start the same

[360:14]

process now for the same process what we

[360:16]

are going to do again try to calculate

[360:18]

the similarity weight now in order to

[360:20]

calculate the similarity weight what I

[360:22]

will do I will basically say this is my

[360:24]

similarity weight this will become .25

[360:28]

divided 025 why because this whole

[360:31]

square right this whole Square residual

[360:33]

square right summation of residual

[360:36]

square but here I have only one residual

[360:38]

so this Square it will become and then

[360:40]

what I'm actually going to do I'm going

[360:41]

to basically write .5 - 1 -.5 this is

[360:45]

nothing for only for one data point so

[360:47]

this is nothing but .5 * .5 which is

[360:50]

nothing but 0.25 right now in this

[360:53]

particular case I will get similarity

[360:54]

weight as I hope everybody I'm getting

[360:56]

it as one now what about this similarity

[360:58]

weight if you want to compute it is

[361:00]

again very very simple this and this

[361:02]

will get cancelled then again it will be

[361:03]

025 divided by um if I say one like this

[361:08]

.25 then again it will be 75 then this

[361:11]

will also be 1 by3 that is nothing but

[361:13]

33 so similarity weight will

[361:16]

be33 then again I have to calculate the

[361:19]

information gain of this node what I

[361:21]

will do I will add this up see 1

[361:24]

+33 I'll add like 1

[361:27]

+33 minus 0 why zero because the

[361:30]

information gain the similarity weight

[361:32]

of this uh the up one is basically 0

[361:37]

right for this particular credit node

[361:39]

similarity weight is zero so 1

[361:41]

+33 minus 0 this will be 1.33 so like

[361:45]

this further split will again happen

[361:47]

over here with different different node

[361:49]

and we will only be getting a binary

[361:51]

split but we will be comparing based on

[361:54]

Information Gain which one is coming

[361:55]

good now let's say that I have created

[361:57]

this path I have I have designed I have

[362:00]

developed my entire binary decision tree

[362:02]

which is a speciality in XG boost now

[362:06]

what I'm going to do over here is that

[362:08]

see everybody what I'm going to do let's

[362:10]

consider the inferencing part let's say

[362:12]

this record is going to go how we are

[362:15]

going to calculate the output so this

[362:17]

first of all went to this base model now

[362:21]

let's go ahead and see how the

[362:22]

inferencing will happen suppose This

[362:24]

Record is going right so first of all

[362:26]

this record will go to this base model

[362:29]

the base model is giving the probability

[362:30]

as 0.5 so the first base model is

[362:34]

basically giving 0.5 now base based on

[362:36]

this 05 how do we calculate the real

[362:39]

probability how do we calculate the real

[362:41]

probability in this okay so we apply

[362:43]

something called as logs so we basically

[362:45]

say log of P / 1us P so this is the

[362:49]

formula we basically apply in only the

[362:52]

case of base model so if we try to see

[362:55]

this it is nothing but log

[362:57]

of5 / .5 which is nothing but zero log

[363:01]

of one is nothing but zero so in the

[363:03]

first case whenever any record goes I

[363:05]

will be getting the zero value over here

[363:08]

okay zero value over here then plus why

[363:11]

plus I'm doing because it will now go to

[363:13]

the binary decision tree now this record

[363:15]

will go to my binary decision Tre

[363:17]

whatever value I'm getting from this I'm

[363:19]

actually adding that up and now it will

[363:21]

go over here now when it goes over here

[363:24]

first of all let's see which branch it

[363:25]

is following it is following less than

[363:27]

or equal to 50 Branch first Branch over

[363:29]

here then this is bad it'll go and

[363:32]

follow here so here I can see that the

[363:34]

similarity weight is one now the

[363:36]

similarity weight is basically one in

[363:38]

this case so what we do in the case of

[363:40]

this we pass it to a learning rate

[363:44]

parameter so this specifically is my

[363:46]

learning rate multiplied by 1 one

[363:49]

because why similarity weight is one

[363:51]

over here so this will basically be my

[363:54]

first references and Alpha over here is

[363:57]

my learning rate it can be a very small

[363:59]

value based on the learning parameter

[364:01]

that we use like how we have defined

[364:04]

learning parameters elsewhere on top of

[364:06]

this we apply an activation function

[364:09]

which is called as sigmoid since this is

[364:11]

a classification problem we apply an

[364:14]

activation function which is called as

[364:15]

sigmoid and I hope you know what is the

[364:17]

use of sigmoid based on this based on

[364:20]

the alpha value based on this the output

[364:22]

will be between 0 to 1 now I hope you

[364:25]

getting it guys this is how the entire

[364:27]

inferencing will probably happen now

[364:30]

similarly what I will do I will try to

[364:32]

construct this kind of decision tree

[364:33]

parall so we we can also write our

[364:37]

entire function will look something like

[364:40]

this Alpha 0 + alpha 1 and this will be

[364:46]

your decision tree 1 output then Alpha 2

[364:50]

your decision tree output Alpha 3 your

[364:53]

decision 3 output like this Alpha 4 your

[364:57]

decision 3 output fourth decision tree

[365:00]

like this it will be alpha n your

[365:02]

decision tree n output and this will be

[365:06]

your output finally when you're trying

[365:09]

to inference from any new

[365:12]

record now the reason why we say this as

[365:15]

boosting because see understand we are

[365:17]

going to add each and every decision

[365:19]

tree output slowly to finally get our

[365:22]

output with respect to the working of

[365:23]

the decision tree this is how XG boost

[365:26]

actually work don't credit further needs

[365:28]

to be simplified yes see like this

[365:31]

similarly we can split credit with the

[365:33]

help of like we can make blue green one

[365:35]

side normal at one side But whichever

[365:37]

will be giving the information gain more

[365:40]

that will be taken into consideration

[365:41]

right and this is how your entire X

[365:43]

boost classifier works it is very very

[365:46]

difficult to basically calculate all

[365:48]

those things so that is the reason we

[365:50]

say that XG boost is also a blackbox

[365:53]

model so this is basically a blackb

[365:56]

model it is it prone to overfitting see

[365:59]

at one stage we also need to perform

[366:02]

hyperparameter tuning and this we

[366:05]

specifically say pre- pruning we tend to

[366:08]

do pre pruning and since we are

[366:10]

combining multiple decision trees no no

[366:14]

this decision tree this decision tree is

[366:17]

this one this independent decision tree

[366:19]

which I have created now parall after

[366:21]

this what I'll do I'll create one more

[366:22]

decision tree so it'll be looking like

[366:24]

this see finally how it will look so

[366:26]

this is my base model then my data then

[366:29]

my data will go to this decision tree

[366:31]

which I have actually done as a binary

[366:33]

split on different different records

[366:36]

then again we will make another decision

[366:38]

tree which will again be a binary tree

[366:40]

the splits will look like this then this

[366:43]

is my base model where I'm getting the

[366:45]

value as zero this will be alpha 1

[366:47]

multiplied by decision tree 1 which is

[366:50]

this then this is Alpha 2 multiplied by

[366:53]

decision tree 2 which is this and like

[366:55]

this we will keep on continuously adding

[366:58]

more decision trees unless and until

[367:00]

this entire things becomes a very strong

[367:04]

learner so this is how how we basically

[367:06]

do the combination of all these things

[367:08]

so I hope everybody is able to

[367:10]

understand about the XG boost classifier

[367:14]

now you may be thinking how does

[367:15]

regressor work do you want a regressor

[367:17]

problem statement also the decision tree

[367:19]

will get constructed based on

[367:21]

Independent features and again Lambda

[367:23]

value is a hyperparameter we basically

[367:26]

set up Lambda value with the help of

[367:28]

cross validation now uh let's go ahead

[367:30]

and discuss about ex boost regressor the

[367:33]

second algorithm that we we will

[367:35]

probably discuss about is something

[367:37]

called as XG boost regressor and how

[367:41]

does X boost regressor actually work

[367:43]

some fundamental is follow in random

[367:45]

Forest no in random Forest it is

[367:47]

completely different there bagging

[367:49]

happens bagging happens so over here

[367:52]

let's go ahead with the regressor so

[367:54]

here I'm going to take some example

[367:56]

let's say that I have this many

[367:57]

experience this many Gap and based on

[368:00]

that we need to determine the salary my

[368:02]

salary is my output feature let's say

[368:04]

the experience is 2 2.5 3 4 4.5 okay now

[368:10]

in this Gap let's say it is yes

[368:13]

yes no no yes and let's say that the

[368:17]

salary is somewhere around 40K it is

[368:20]

41k

[368:22]

52k and uh let's see some more data set

[368:25]

over here 60k and 62k now the first step

[368:29]

in classifier we created a base model

[368:32]

here also we'll try to create a base

[368:33]

model first of all this base model what

[368:36]

output it will give it will give the

[368:38]

average of all these values what is the

[368:40]

average of all these values okay what is

[368:42]

the average of all these value 40 81 52

[368:45]

60 62 if I just do the average it is

[368:48]

nothing but 51k so by default I will

[368:50]

create a base model which will take any

[368:52]

input and just give the output as 51

[368:54]

this is the first step now based on this

[368:56]

I will try to calculate my residual now

[368:58]

how do I calculate my residual I will

[369:00]

just subtract 40 by 51k so this will

[369:03]

basically be - 11k

[369:06]

and uh this will be 10 K - 10 K - 10 and

[369:11]

this will be 1 this will be 9 and this

[369:16]

will be 11 I hope everybody's able to

[369:18]

get this let's say that I I make this as

[369:21]

42k okay for just making my calculation

[369:23]

little bit easy so I have 9 over here so

[369:26]

this is my residual then again the first

[369:28]

step is that I construct my uh decision

[369:32]

tree now let's say say that I'm going to

[369:35]

use The Experience over here so this is

[369:37]

my experience node and based on this

[369:39]

experience node I have my features over

[369:42]

here so here I will take up all my

[369:44]

residuals - 11 99 1 99 11 and then how

[369:50]

do I do the split based on experience

[369:52]

this is a continuous feature so I have

[369:56]

to basically do split with respect to

[369:58]

continuous feature which I have already

[369:59]

shown you in decision tree how do we do

[370:01]

so here is my residual here it is 40

[370:04]

minus this

[370:05]

is - 11 K - 9 K uh this is 1 K this is 9

[370:12]

K and

[370:14]

11k - 9k so now I will just create take

[370:17]

up my first node here I'm going to use

[370:20]

my experience feature I know my values

[370:23]

what all things are going to come 11k in

[370:25]

the root node - 9 1 9 and 11 now what we

[370:30]

are going to do over here is that so I'm

[370:32]

going to do again a binary split over

[370:34]

here now the binary split will happen

[370:36]

based on the continuous feature that is

[370:38]

experienced so two types of Records I

[370:40]

may get one is less than or equal to two

[370:42]

and one is greater than 2 less than or

[370:46]

equal to two and one is greater than two

[370:48]

now less than or equal to two when I do

[370:49]

the split let's see how many values we

[370:51]

are getting less than or equal to two I

[370:53]

will get only one value that is -1 and

[370:56]

here I'm actually going to get all the

[370:58]

other values - 9 1 9 11 now what we are

[371:02]

going to do after this is that calculate

[371:04]

the similarity weight now here the

[371:06]

similarity weight will little bit the

[371:08]

formula will change with respect to

[371:10]

regression so similarity weight is

[371:12]

nothing but summation of residual

[371:15]

squares divided by number of residuals

[371:18]

plus Lambda again here we are going to

[371:20]

consider Lambda is zero because this is

[371:22]

a hyper parameter tuning more the value

[371:25]

of Lambda that basically means more more

[371:27]

we are penalizing with respect to the

[371:29]

residuals so this will be the formula

[371:31]

that we are going to apply okay so let's

[371:33]

see for the first number that that we

[371:35]

want to apply so how this will get

[371:37]

applied again I'm going to write this

[371:39]

formula here it'll be better let's say

[371:42]

here similarity weight is equal to

[371:46]

summation of residual square and here

[371:49]

you have number of residuals plus Lambda

[371:52]

see previously we were using probability

[371:54]

and then all those things we are using

[371:56]

so if you want to calculate the

[371:58]

similarity weight of this this will

[371:59]

become 121 divided by number of residual

[372:03]

is 1 plus Lambda is 0 so this is going

[372:06]

to be 121 so here we are going to

[372:08]

calculate the similarity weight which is

[372:10]

nothing but 121 if if we probably take

[372:13]

Alpha let's let's do one thing if we

[372:15]

probably take uh if if we probably take

[372:19]

Alpha is equal to 1 then what will

[372:20]

happen if you take Alpha is equal to 1

[372:22]

just think over here what will what may

[372:23]

happen we may directly penalize the

[372:26]

similarity weight right by just adding

[372:28]

one okay so let's do that also suppose I

[372:30]

say I'm going to take Alpha is equal to

[372:32]

1 so what will happen this will not be

[372:35]

the formula now now what will become 121

[372:38]

divided number of residual is 1 + 1 this

[372:41]

is nothing but 65.5 let's say that I now

[372:44]

have 65.5 as my similarity weight now

[372:47]

similarly I will go ahead and compute

[372:49]

the similarity weight for the next one

[372:52]

so here it will become - 9 + 9 + 9 + 11

[372:58]

whole Square divided 4 + 1 so this and

[373:01]

this will get subtracted 12 squ is

[373:04]

nothing but 14 4 144 divid 5 so if I go

[373:07]

ahead and calculate 144 ID 5 it is

[373:10]

nothing but 28.5 so here I get

[373:15]

28.5 so the similarity weight for this

[373:18]

[373:20]

28.5 similarly I can go ahead and

[373:22]

calculate the similarity weight for this

[373:24]

for the top one so it'll be nothing but

[373:27]

what it will be 11 + sorry - 11 - 11 - 9

[373:34]

+ + 1 + 9 + 11 divided 1 2 3 4 5 5 + 1

[373:41]

is 6 so this is getting subtracted this

[373:44]

will be 1X 6 anyhow this will be whole

[373:46]

square right so anyhow it will be 1X 6

[373:48]

only so 1X 6 will be my similarity

[373:51]

weight over here okay 28.8 hits okay now

[373:54]

finally The Information Gain that we

[373:56]

need to compute will be very much simple

[373:58]

what will be the Information Gain 65.5 +

[374:03]

28.8

[374:06]

minus 1X 6 so try to get it whatever we

[374:09]

are trying to get it over here just tell

[374:11]

me what will be the output is it 98.34%

[374:35]

60.5 60.5 + 28 88 then this will change

[374:40]

just a second 89.1 3 understand you

[374:44]

don't have to worry about calculation

[374:46]

automatically that things will be doing

[374:48]

it okay so you don't have to worry now

[374:50]

see we have now further the decision

[374:52]

tree can be splitted into any number of

[374:54]

times probably the next split what we

[374:56]

can do is that we can we can do next

[374:58]

split something like this this will be

[375:00]

my experience the two splits that may

[375:03]

happen with respect to less than or

[375:05]

equal to 2.5 less than or equal to 2.5

[375:08]

or greater than 2.5 now if this probably

[375:11]

gives the Information Gain better then

[375:13]

the split will happen like this

[375:14]

otherwise whichever gives the better

[375:16]

information again the split will

[375:17]

basically happen like this I hope like

[375:20]

let's say that this is this is the split

[375:22]

that is required - 11 - 11 is 9 is over

[375:25]

here and then we have 1 comma 9A 11 okay

[375:28]

because less than or equal to 2.5 this

[375:30]

two records will definitely go over here

[375:32]

and this two This Record will definitely

[375:34]

go over here now if I try to calculate

[375:36]

the similarity weight for this it will

[375:38]

be nothing but - 11 - 9 - 11 - 9 whole S

[375:43]

ided 2 + 1 right now in this particular

[375:46]

case it will be - 20 s / 3 which is

[375:51]

nothing but 400 2 20 into 20 is 400

[375:55]

which is nothing but 3 so if I go and

[375:57]

probably use a

[375:59]

calculator and show it to you

[376:02]

400 / 3 which is nothing but

[376:06]

133.33 so the similarity weight for this

[376:08]

[376:10]

133.33 similarly I can go ahead and

[376:12]

compute for this it will be 1 + 9 + 11

[376:15]

whole s / 3 + 1 right so it will be 10 +

[376:19]

11 10 + 11 is nothing but 21 whole s/ 4

[376:24]

so what it is 21 whole square if I open

[376:27]

my calculator 21 s 21 * 21 which is

[376:33]

nothing but 441 divid by 4 divid by 4 so

[376:37]

this will probably 110 110.

[376:41]

2.25 and similarly I can go ahead and

[376:44]

compute for this so if I want to compute

[376:46]

for this what it will be the same thing

[376:49]

that we have got over here that is 1x 6

[376:51]

so this will basically be 1X 6 so

[376:53]

finally if I compute the information

[376:55]

again it will be what it will be 133

[377:01]

1333 +

[377:03]

1.25 - 1X 6 obviously this value will be

[377:06]

greater than the previous one what we

[377:08]

have got that is

[377:10]

8913 so definitely we are going to use

[377:12]

this split which is better than the

[377:14]

previous split right let's say that this

[377:17]

split has been considered finally how do

[377:20]

we see the output okay I hope everybody

[377:23]

is able to understand right let's say

[377:24]

that this split has worked well so I'm

[377:26]

going to rub all these things

[377:29]

11.25 is there now suppose I want to do

[377:33]

the inferencing how the inferencing will

[377:35]

be done

[377:37]

11.25 here 110.2 now suppose any record

[377:41]

comes from here first of all any record

[377:43]

that will go it will go to the base

[377:45]

model so the base model whenever it goes

[377:47]

the value is 51 51 plus alpha 1 this is

[377:51]

my learning rate one suppose if it goes

[377:54]

in this route then what we have we have

[377:56]

- 11 - 9 whenever we go in this rote

[377:59]

which has - 11 and - 9 the average of

[378:02]

both these numbers will be considered

[378:03]

what is average of both these numbers -

[378:05]

11 - 1 9/ 2 this is nothing but - 10

[378:10]

right so - 10 will get multiplied here

[378:13]

suppose if it goes in this route then

[378:15]

here what will happen here will 1 + 9 +

[378:18]

11 divide by 3 average will be taken so

[378:20]

21 divid 3 7 will be there so this will

[378:23]

get replaced by 7 so similarly anything

[378:27]

that you are doing this is with respect

[378:28]

to decision tree 1 like this we will

[378:30]

again construct decision tree separately

[378:33]

and again it will become Alpha 2 by

[378:35]

decision Tre 2 Alpha 3 by decision 3 3

[378:39]

and like this you will be doing till

[378:42]

Alpha and decision 3 n and once you

[378:45]

calculate this this will be your

[378:47]

specific output in a regression tree so

[378:49]

in this particular case what will happen

[378:51]

you're just trying to play with

[378:53]

parameters and you're trying to use in a

[378:55]

different way to compute all this things

[378:57]

everybody clear but again it is a

[378:59]

blackbox model you cannot visualize all

[379:02]

this things now let's go to the third

[379:03]

algorithm which is called as s VM see

[379:05]

svm is almost like decision uh logistic

[379:08]

regression okay so the major aim of svm

[379:12]

[379:13]

that major aim of svm is that suppose if

[379:16]

I have a do data points like this okay

[379:20]

we obviously use uh logistic regression

[379:23]

to split this data points right like

[379:25]

this we try to create a best fit line

[379:28]

which looks like this and probably based

[379:30]

on this best fit line we try to divide

[379:32]

the point now in svm what we do is that

[379:36]

we not only create a best fit line but

[379:40]

instead we also create a point which is

[379:44]

called as marginal

[379:45]

planes so like this we create some

[379:48]

marginal

[379:49]

plane so this is your hyper plane and

[379:53]

this is your marginal plane and

[379:55]

whichever plane has this maximum

[379:58]

distance will be able to divide the

[380:01]

points more efficiently but usually in

[380:05]

in a normal scenario you know whenever

[380:07]

we talk about hyper plane or whenever we

[380:10]

talk about marginal plane there will be

[380:11]

lot of overlapping of points right

[380:13]

suppose if I have some specific points I

[380:16]

have one point which looks like this I

[380:18]

may also have another points which may

[380:20]

overlap so it is very difficult to get

[380:23]

an exact straight marginal planes and

[380:26]

split the point based on this now this

[380:28]

specific marginal plane should be

[380:30]

maximum because we can create any type

[380:32]

best fit line and probably

[380:35]

uh use this marginal plane now if we

[380:38]

have this overlapping right if for what

[380:40]

do we call for this kind of plane this

[380:42]

kind of plane is basically called as

[380:44]

hard marginal plane so this is basically

[380:47]

called as hardge marginal plane okay and

[380:51]

similarly if any points are overlapping

[380:54]

suppose this yellow points can also get

[380:56]

overlapped over here and there may be

[380:58]

some kind of Errors so for this

[381:00]

particular case we basically say as soft

[381:02]

marginal plane because here we will be

[381:05]

able to see that errors will be there

[381:07]

now in asvm what we focus on doing is

[381:10]

that we focus on creating this marginal

[381:13]

plane with maximum distance even though

[381:15]

there are some errors we consider it in

[381:17]

solving it by providing some kind of

[381:19]

hyper parameter now how do we go ahead

[381:22]

and basically create this all marginal

[381:24]

planes and how do we go ahead with this

[381:26]

it's very much simple uh just imagine in

[381:29]

this specific way that initially let's

[381:32]

consider that I have this data point

[381:33]

suppose this is my

[381:35]

best fit line how do we give this best

[381:38]

fit line as equation we basically say

[381:40]

yal mx + C right we we basically say

[381:43]

this equation as y mx + C no hard hard

[381:47]

marginal it is impossible in a normal

[381:50]

data set obviously you'll not be able to

[381:52]

get it but definitely we go ahead with

[381:55]

creating a soft marginal plan now Y is

[381:56]

equal to MX plus C what does this m

[381:59]

indicate m is nothing but slope and C

[382:02]

indicates nothing but intercept

[382:05]

can I say that this both equations are

[382:07]

same ax + b y + C isal 0 can I also say

[382:12]

that this is the equation of a straight

[382:14]

line can I say that this is also the

[382:16]

equation of straight line I will say

[382:18]

that both of them are equal can I say

[382:20]

both of them are equal see if I try to

[382:22]

prove this to you if I take this

[382:24]

equation and try to find out y it will

[382:26]

be nothing but minus C Min - c

[382:30]

minus a sorry - a x and this will be

[382:34]

divided by B this will be divided by

[382:37]

B this will be divided by B so here you

[382:40]

can see that it is almost the same in

[382:42]

this particular case my M value will be

[382:44]

- A by B and my C will basically be

[382:47]

minus C by B so both the equation are

[382:49]

almost same

[382:51]

so let's consider that this is my

[382:53]

equation and I am actually and whenever

[382:57]

I say Y is equal to mx + C can I also

[383:00]

write something like this Y is equal to

[383:03]

[383:05]

X1 + W2 X2 plus like this plus C or plus

[383:10]

b same thing no so here also we can

[383:13]

write y w transpose x + B same equation

[383:17]

right we are basically using same

[383:19]

equation yes we can also write it in a

[383:21]

different way but at the end of the day

[383:23]

we are also treating something like this

[383:25]

let's say that this slope is in this

[383:28]

direction if this slope is in this

[383:30]

direction then I can basically say that

[383:32]

let's consider that the slope is minus

[383:33]

one

[383:35]

let's say that this slope is minus one

[383:36]

see it is in the negative Direction

[383:38]

let's say that this slope is minus one

[383:40]

I'm just trying to prove that this slope

[383:42]

is negative value let's consider this

[383:44]

now suppose this is one of my point - 4a

[383:48]

0 and obviously this particular equation

[383:50]

is given by this particular line is

[383:52]

given by this equation now if I really

[383:55]

want to find out the Y value let's say

[383:57]

that this is my

[383:59]

X1 this is my X1 and this is my X2 let's

[384:03]

say that

[384:05]

I want to find out I want to find out

[384:08]

this W transpose x + b the Y value based

[384:12]

on this line if I want to compute the y-

[384:14]

value based on this line how will I

[384:16]

compute W transpose X basically means

[384:18]

what w value what all things will be

[384:20]

there one value is B right B is

[384:23]

intercept right now intercept is passing

[384:25]

from origin can I say my B will be zero

[384:28]

obviously I can assume that b will be

[384:30]

zero now in this particular case if I

[384:32]

talk about w w in this case is minus one

[384:35]

which I have initialized over here so if

[384:37]

I want to do this matrix multiplication

[384:39]

it will be W transpose can be written as

[384:41]

like this and this x value can be

[384:44]

written as -4 comma - 4 and 0 -4 and 0

[384:49]

right so I can basically write like this

[384:52]

now if I do this multiplication what

[384:54]

will my value I get I will basically get

[384:57]

four right so this is a positive

[385:01]

value this is a positive value Now

[385:04]

understand since this is a positive

[385:05]

value any points that are below this

[385:08]

line any points that I consider below

[385:11]

this line and if I try to calculate the

[385:13]

Y can I say that it will always be

[385:15]

positive yes or no similarly if I could

[385:18]

probably consider one point over here as

[385:21]

4A 4A 4 now tell me in this 4A 4 if I

[385:25]

calculate the Y value what will you get

[385:27]

whether you'll get a positive value or a

[385:29]

negative value if I try to calculate the

[385:30]

Y value in this case because here only

[385:32]

positive values will'll be getting right

[385:34]

so if I calculate the Y value will the Y

[385:37]

value be negative or positive just try

[385:39]

to calculate how do you calculate again

[385:41]

I will use y equation this time again my

[385:44]

slope is minus1 my intercept is zero and

[385:46]

here I will have 4 comma

[385:49]

4 now here Min

[385:51]

-4 and then this is + 0 this will be Min

[385:54]

-4 right so this will be a negative

[385:57]

value negative value guys negative see -

[386:00]

4 + 0 negative so any point that I will

[386:05]

probably have in top of this any

[386:08]

points Above This Plane right and if I

[386:12]

try to calculate the Y value it will

[386:13]

always be negative so what two things

[386:16]

you are able to get positive and

[386:17]

negative so you can consider this

[386:19]

entirely one category this another

[386:22]

category at least these two things you

[386:24]

can basically

[386:25]

consider guys I hope everybody is able

[386:27]

to understand this so this will be my

[386:29]

one

[386:30]

category and this will be my another

[386:32]

category obviously so that basically

[386:34]

means I can definitely use a plane and

[386:35]

split this point I hope everybody is

[386:37]

able to understand now let's go ahead

[386:39]

and let's see how this marginal plane

[386:41]

will get created and what is the cost

[386:44]

function to basically do this or what is

[386:46]

the cost function in making sure that

[386:48]

the marginal plane will definitely work

[386:50]

right it becomes difficult right so

[386:52]

suppose let's consider an

[386:55]

example suppose I say that this is my

[386:58]

lines let's say uh I want to basically

[387:01]

create a kind of I have two variety of

[387:03]

points one is this point let's say I

[387:06]

have all this points like this and the

[387:07]

other points I have somewhere here let's

[387:10]

consider I am just using directly good

[387:13]

number of points so that I can split it

[387:15]

okay because I will try to talk about it

[387:17]

what I'm actually trying to prove so

[387:20]

obviously this is my best fit line that

[387:21]

splits and apart from that what I will

[387:24]

do is that I'll also create a marginal

[387:26]

points so in order to create the

[387:27]

marginal point I may use some different

[387:30]

color let's see which color this will be

[387:32]

my one marginal point remember it will

[387:35]

be to the nearest point over here and

[387:38]

basically we will construct like like

[387:40]

this and similarly here we will be

[387:43]

constructing like this I've already told

[387:45]

you guys this equation can be mentioned

[387:48]

at w transpose x + B = 0 right I can

[387:51]

definitely say this because ax + b y + C

[387:55]

is equal to 0 so this I can also write

[387:57]

it as W transpose x equal to 0 sorry

[388:00]

plus b plus b equal to 0 so both are

[388:03]

same okay this I don't have to prove it

[388:05]

I hope everybody's clear with this now

[388:08]

what I'm going to do let's represent

[388:10]

this line also with some equation so

[388:12]

this line if I want to represent this

[388:14]

will be W transpose x + B what value

[388:17]

will come over here positive or negative

[388:19]

C from this line anything above this

[388:21]

plane right any any any distance that we

[388:24]

try to find out it will always be

[388:25]

negative so let's say that I'm using it

[388:27]

as minus one to just read as it is a

[388:30]

negative value and this line that I am

[388:32]

going to mention it it will be W

[388:34]

transpose x + B is equal to + 1 Min -1

[388:37]

above + 1 because we have already

[388:39]

discussed from this point if you're

[388:41]

trying to calculate the Y value it is

[388:43]

always going to be + one this is going

[388:45]

to be minus one here I should definitely

[388:48]

say this as K okay but I'm not

[388:50]

mentioning K in many articles you'll see

[388:53]

it as minus one uh many research paper

[388:55]

also they use it as minus one but I

[388:57]

would like to specify uh minus and plus

[388:59]

K but here let's go and write minus1 and

[389:02]

plus now my aim is to increase this

[389:05]

distance okay this distance I really

[389:07]

want to increase this distance now in

[389:09]

order to increase this if I increase

[389:11]

this distance that basically means my

[389:13]

model is performing well so let's say I

[389:16]

want to find this distance first of all

[389:18]

so if I write w transpose X Plus Bal to

[389:20]

1 and here I will write w transpose x +

[389:23]

B isal minus1 so what I'm going to do

[389:25]

I'm going to do the computation and

[389:28]

subtract it like this so here obviously

[389:31]

this will be my X1 this will be my X2

[389:34]

okay because these are my another points

[389:35]

X2 and X1 so I can write w transpose X1

[389:40]

[389:42]

X2 B and B will get cancell and here I

[389:45]

will be writing two right so from here

[389:49]

we can definitely write two different

[389:50]

things let's see what all things we can

[389:52]

write so here this is nothing but the

[389:54]

difference between my this plane and

[389:56]

this plane which is given by like this

[389:58]

okay now always understand whenever we

[390:01]

consider any any vector vors right any

[390:06]

vectors right it also has something

[390:07]

called as

[390:09]

magnitude so if I want to remove this

[390:12]

magnitude I can divide this by W this

[390:16]

magnitude of w then only my Vector will

[390:18]

remain which is indicated like this so

[390:20]

I'm going to basically divide by this

[390:22]

particular operation both both the side

[390:24]

I'm dividing by this magnitude of w and

[390:27]

I don't care about the directions over

[390:29]

here right now we just care about the

[390:30]

vectors now when I write like this what

[390:33]

is our aim our aim is to can I say our

[390:36]

aim is to our aim is to

[390:40]

maximize 2 byw can I say this guys yes

[390:43]

[390:46]

no what is our aim our aim is to

[390:49]

basically maximize this right by

[390:52]

updating W comma B value I need to

[390:56]

maximize this yes everybody's clear with

[390:59]

this can I say that yes I want to

[391:01]

maximize this yes or no everybody I want

[391:05]

to maximize this if I maximize this that

[391:07]

basically means my marginal plane will

[391:08]

become bigger my marginal plane will be

[391:10]

bigger okay now can I write along with

[391:13]

this that such that y of I my output

[391:17]

will be dependent on two different

[391:18]

things one is I can say that my y y of I

[391:22]

is plus of uh is + one when w transpose

[391:26]

x + B is greater than or equal to 1

[391:29]

everybody see in this equation what I'm

[391:31]

actually trying to specify such that y

[391:33]

of I is + 1 when w transpose x + B is

[391:36]

greater than 1 and when it is minus 1

[391:38]

that basically means w transpose of X is

[391:40]

B is less than or equal to minus now

[391:42]

what does this basically mean see all my

[391:46]

values whenever I compute W transpose x

[391:49]

+ B is greater than or equal to 1 I'm

[391:51]

obviously going to get this + one when w

[391:54]

transpose X+ B is less than or equal to

[391:56]

1 I'm always going to get the output as

[391:58]

minus one I hope that is the reason why

[392:00]

I have actually written like this so

[392:02]

this two we have already discussed why

[392:03]

we are specifically writing we want to

[392:05]

increase the marginal plane which is

[392:07]

this this is my marginal plane and I'm

[392:09]

writing one condition that my Yi value

[392:11]

will be+ one when w transpose X plus b

[392:14]

is greater than or equal to 1 otherwise

[392:16]

it when it is less than or equal to

[392:17]

minus one it is going to be very much

[392:18]

clear with this transpose condition we

[392:20]

have already done it everybody clear

[392:22]

with this now on top of it we can add

[392:25]

one more very important Point instead of

[392:28]

writing such that and all you can also

[392:30]

say that our major

[392:32]

aim our major aim is that if I multiply

[392:36]

y i multiplied by W transpose X of I + B

[392:41]

If I multiply this two this will always

[392:44]

be able greater than or equal to 1 for

[392:48]

correct points right for correct points

[392:52]

because understand if it is minus one if

[392:55]

I'm multiplying with this and if it is a

[392:57]

correct Point minus into minus will

[392:59]

obviously be greater than or equal to

[393:01]

one only right similarly for this it

[393:03]

will be greater than 1 so I can also

[393:05]

definitely say that my major M If I

[393:07]

multiply y of I with this it will be

[393:10]

always greater than or equal to + 1 U

[393:12]

which is definitely saying that it will

[393:14]

be a positive value so this is just a

[393:16]

representation guys but understand what

[393:19]

is the minimized cost function this is

[393:21]

my minimized cost function maximized

[393:23]

cost function now I'm going to again

[393:26]

write it down

[393:28]

maximize W comma B maximize W comma b 2

[393:33]

by magnitude of w I can also write

[393:37]

something like this minimize W comma B

[393:40]

and I can just inverse this which looks

[393:43]

like this are these both are same or not

[393:45]

because always understand in machine

[393:48]

learning algorithm why do we write

[393:51]

minimize things because we are trying to

[393:54]

minimize something okay both are

[393:57]

equivalent these both are equivalent and

[393:59]

why we specifically write minimization

[394:01]

because in the back propagation when we

[394:03]

we are continuously updating the weights

[394:05]

of w and B so we can definitely write

[394:08]

like this so here my main target is to

[394:12]

minimize this particular value by

[394:14]

changing W and B and I will start adding

[394:17]

some more parameters over here this is

[394:19]

fine till here I think everybody has got

[394:22]

it this is our aim and we are going to

[394:23]

do this but I'm going to add two more

[394:26]

parameters in this Optimizer one is C of

[394:29]

I and one is summation of I equal 1 to n

[394:33]

and here I will use something called as

[394:35]

EA EA of I first of all I'll tell what

[394:38]

is C of I see if I have this specific

[394:41]

data point let's say if some of my

[394:44]

points are over here then is it a right

[394:47]

right prediction or wrong prediction if

[394:49]

some of my points are over here is it a

[394:51]

right prediction or wrong prediction

[394:54]

obviously it is a wrong prediction if my

[394:56]

points are somewhere here is it a WR

[394:58]

prediction wrong wrong incorrect

[394:59]

prediction right so this C value

[395:02]

basically says that how many errors we

[395:04]

can have how many errors we can have if

[395:06]

it says that fine we can have six errors

[395:08]

or seven errors how many errors we can

[395:11]

have even though we are using the

[395:13]

marginal plane how many errors we can

[395:16]

have so here I'm specifically writing

[395:18]

how many errors we can have this is what

[395:21]

is specified by C ofi EA of I basically

[395:24]

says that what is the summation of I'm

[395:26]

going to write it down since we are

[395:28]

doing the sumission this entire term

[395:31]

basically mentions that sumission

[395:34]

of the distance of the values distance

[395:37]

of the wrong points and how do we

[395:39]

calculate the distance from here to here

[395:42]

suppose this is a wrong point I will try

[395:44]

to calculate the distance from here to

[395:45]

here I will do the sumission of this

[395:47]

I'll do the sumission of this I will do

[395:49]

the sumission of this similarly for the

[395:51]

Green Point another sumission will

[395:53]

happen from here to here like this here

[395:56]

to here and we going to do that specific

[395:57]

sumission so we are telling that fine if

[396:01]

you are not able to fit properly try to

[396:05]

apply this two hyperparameters and try

[396:07]

to make sure that this many errors are

[396:10]

also there it is well and good no

[396:11]

problem we will go ahead with that try

[396:14]

to do the submission of the data points

[396:15]

and based on that try to construct the

[396:18]

best fit line along with the marginal

[396:20]

plane like this even though there are

[396:23]

some errors over here or errors over

[396:25]

here we are good to go with respect one

[396:27]

more thing is there which is called as

[396:28]

Al svr svr only one thing is getting

[396:32]

changed in svr only this value will get

[396:36]

changed so I want you all to explore and

[396:38]

just let me know this will be one

[396:40]

assignment for you only this value will

[396:42]

be changing remaining everything are

[396:43]

same so just try to if you change this

[396:46]

particular value that becomes an svr

[396:49]

just try to explore and just try to find

[396:51]

out and just try to let me know so

[396:52]

overall uh did you like the entire

[396:55]

session everyone okay in this one more

[396:57]

thing is there which is called as kernel

[396:59]

Matrix svm kernel we say it as svm

[397:02]

kernel now in s VM kernel what happens

[397:04]

suppose if I have a specific data points

[397:06]

which looks like this which looks like

[397:08]

this so we obviously cannot use a

[397:10]

straight line and try to divide it so

[397:11]

what we do we convert this two Dimension

[397:14]

into three dimensions and then probably

[397:17]

we push our Point like this one point

[397:19]

will go like this and the white point

[397:21]

will go down and then we can basically

[397:24]

use a plane to split it so I uploaded a

[397:26]

video around uh around that and uh you

[397:29]

can definitely have a look onto that and

[397:31]

I have also shown you practically how to

[397:33]

do it that is the reason I've created

[397:35]

that specific video so great uh this was

[397:37]

it from my side I hope you like this

[397:39]

session so thank you everyone have a

[397:41]

great day keep on rocking keep on

[397:43]

learning and never give up

Raw Transcript

Full transcript without timestamps

so today's session what all things we are basically going to discuss so first of all we going to discuss about different types of machine learning algorithm like how many different types of machine learning algor understand the purpose of taking this session is to clear the interviews okay clear the interviews once you go for a data science interviews and all the main purpose is to clear the interviews I've seen people who knew machine learning algorithms in a proper way okay they were definitely able to clear it because they just explain the algorithms in a better way to the recruiter so that they got hired first of all is the introduction to machine learning here I'm just specifically going to talk about AI versus ml versus DL versus data sign then the second thing that we are going to talk about over here is the difference between supervised MS and unsupervised ml the third thing that we are probably going to discuss about is something called as linear regression so we are going to clearly understand the maths and geometric intuition the next thing that we are probably going to discuss about is R square and adjusted R square the fifth topic that we are going to discuss about is Ridge and lasso regression the first topic that we are going to discuss about is AI versus ml versus DL versus data science so this is the first topic that we are probably going to discuss if you really want to understand the difference between AI versus ml versus DL versus data science we will go in this specific format so just imagine the entire universe so this entire universe I will probably call it as an AI now specifically when I say AI this basically means AI artificial intelligence whatever role you are in you are as a machine learning developer you working as a deep learning developer Vision developer or a data scientist or an AI engineer at the end of the day you are actually creating AI application so if I really want to Define what is this artificial intelligence you can just say that it is a process wherein we create some kind of applications in which it will be able to do its task without any human intervention so that basically means a person need not monitor this AI application automatically it'll be able to make decisions it will be able to perform its task and it will be able to do many things so this is what an AI application is some of the examples that I would definitely like to consider so the first example that I would like to consider AI application AI module Netflix has an AI module suppose if you see a kind of action movie for some time then the kind of AI work or AI work that is basically implemented over here is something called as recommendation so here through this application what happens is that when you're continuously seeing the action movies then automatically the AI module that is present inside Netflix will make sure that it gives us recommendation on action movies second if I take an example of comedy movie If I continuously see comedy movie then also it'll give us the recommendation of the comedy movie so this through this what happens is that it understands your behavior and it is being able to do its task without asking you anything the second example that I would like to take up in is amazon.in now amazon.in again if you buy an iPhone then it may recommend you a headphones so this kind of recommendation is also a part of AI module that is integrated with the amazon.in website the ads that you see probably when you opening my channel through which I get paid a little bit from my from a from the hard work that I do in YouTube right so through that ads how that is recommended to you uh that is also an AI engine that is included in the YouTube channel itself which really plays it is a business-driven goal understand it is a business driven things that we basically do with the help of AI one more example that I would like to give you is if I consider it self-driving cars so here you'll be able to see self-driving cars if you take an example of Tesla so self-driving cars what happens based on the road it is able ble to drive it automatically who is doing that there is an AI application integrated with the car itself right so if I consider all these things these all are AI application at the end of the day whatever role you do you are going to create an AI application this is the common mistake what people do you know like our CEO sudhansu Kumar he has written in his profile that he's an AI engineer that basically means his goal is to create an AI application so probably in a product based companies you'll be seeing this kind of roles called as AI engineer now let's go to the next role which is called as machine learning so where does machine learning comes into existence so if I try to create this machine learning is a subset of AI and what is the role of machine learning it provides stats tools to analyze the data visualize the data and apart from that to do predictions I'm forecasting so you will be seeing a lot of machine learning algorithms so internally those machine learning algorithm the equation that we are basically using it is basically using it is having a kind of stats tool stat techniques because whenever we work with data statistics is definitely very much important so this exactly is called as machine learning so it is a subset of AI this is very much important to understand ml is a subset of AI so here you can see that it is a part of this now let's go to the next one which is called called as deep learning deep learning is again a subset of ml now let's consider why deep learning came into existence because in 1950s 60s scientists thought that can we make machine learn like how we human being learn so for that particular purpose deep learning came into existence here the plan is to basically mimic human brain so when I say mimicking human brain that basically means we are trying to mimic the human brain to implement something to learn something so for this you use something called as multi-layered neural networks so this is what deep learning is it is a subset of machine learning its main aim is to mimic human brain so they actually create multi-layer neural network and this multi-layered neural network will basically help you to train the machines or applications whatever we are trying to create and deep learning has really really done an amazing work with the help of deep learning we are able to solve such a complex complex complex use cases that we will be probably discussing as we go ahead now if I come to data science see this is the thing guys if you want to say yourself as a data scientist tomorrow you given a business use case and situation comes that you probably have to solve that use case with the help of machine learning algorithms or deep learning algorithms again the final goal is to create an AI application right you cannot say that I am a data scientist and I'll just work in machine learning I or I'll work in deep learning or I may I don't know how to analyze the data no you cannot do that when I was working in Panasonic I got various different kind of task sometime I was told to use W powerbi to visualize analyze the data sometime I was given a machine learning project sometime I was given a deep learning project so as a data scientist if I consider where does data scientist fall into this it will be a part of everything so if I talk about machine learning and deep learning with respect to any kind of problem statement that we solve the majority of the business use cases will be falling in two sections one is supervised machine learning one is unsupervised machine learning so most of the problems that you are basically solving this is with respect to this two problem statement two different types of machine learning algorithms that is supervised machine learning and deep learning if I talk about supervised machine learning two major problem statements that you are basically solving here also one is regression problem and the other one is something called as classification problem and in the case of unsupervised machine learning problem statement you are basically solving two different types of problem one is clustering and one is dimensionality reduction and there is also one more type which is called as reinforcement learning reinforcement learning I can I I will definitely talk about this not right now right now we are just focusing on all these things now understand what happens in supervised machine learning let's consider consider a data set so here I have a data set which says this is my age and this is my weight suppose I have these two specific features let's say that I have values like 24 62 25 63 21 72 257 uh 62 and many more data over here let's say that my task is to basically take this particular data and create a model wherein so suppose my task is that I need to create a model whenever it takes the New Age first of all we train this model with this data and whenever we take age a new age it should be able to give us the output of weight this particular model is also called as hypothesis okay I'll discuss about this today when I we discussing about linear regression now what are the important components whenever we have this kind of problem statement first of all you need to understand there are two important things one is independent features and the other one is something called as dependent features now let's go ahead and discuss what is independent feature independent feature basically means in this particular case since the input that I'm basically training in all those features becomes an independent feature now in this particular case my age is independent feature and whatever I'm actually predicting so when I say predicting I know this is my output okay this is the what I have to basically make my model uh give this as a an output so in this particular casee my dependent feature becomes weight why we specifically say a dependent feature because this is completely dependent on this value whenever this is increasing or decreasing this value is basically getting changed so that is the reason why we basically say this has independent and dependent feature whenever we are solving a problem right in the case of supervised machine learning remember they will be one dependent feature and there can be any number of independent features now let's go ahead and let's discuss about regression and classification what is the difference between them now let let's go ahead and let's discuss about two things one is let's say I want a regression problem statement suppose I take the same example as age and weight so I have values like as discussed 24 72 23 71 uh 24 or 25 71.5 okay so this kind of data I have see this is my output variable which is my dependent feature now in this particular dependent feature now whenever I'm trying to find out the output and in this particular output you have a continuous variable when you have a continuous variable then this becomes a regression problem statement now one example I would like to give suppose this is my data set right this is my age this is my weight suppose I am populating this particular data set with the help of scatter plot then in order to basically solve this problem what we'll do suppose if I take an example of linear regression I will try to draw a straight line and this particular line is my equation which is called as yal mx + C and with the help of this particular equation I will try to find out the predicted points so this will be my predicted point this will be my predicted point this this any new points that I see over here will basically be my predicted point with respect to Y so in this way we basically solve a regression problem statement so this is very much important to understand let's go to the always understand in a regression problem statement your output will be a continuous variable the second one is basically a classification problem now in classification problem suppose I have a data set let's say that number of hours study number of study hours number of play hours so this is my independent feature let's say a number of sleeping hours and finally I have my output which will will be pass or fail so in this I have all this as my independent features and this is my dependent feature so I will be having some values like this and here either you'll be pass or fail or pass or fail now whenever you have in your output fixed number of categories then that becomes a classification problem suppose it just has two outputs then it becomes a binary classification if you have more than two different categories at that time it becomes a multiclass classification so this is the difference between regression problem statement and the classification problem statement now let's go ahead and let's discuss about something called as unsupervised machine learning now in unsupervised machine learning which is my second main topic over here I'm just going to write unsupervised machine learning now what exactly is unsupervised machine learning here whenever I talk about there are two main problem statement that we solve one is clustering one is dimensionality reduction let's take one example of a specific data set over here let's say that my data set is something called as salary and age now in this scenario we don't have any output variable no output variable no dependent variable then what kind of assumptions that we can take out from this particular data set suppose I have salary and age as my values so in this particular case I would like to do something called as clustering now why clustering is used just understand let's say I am going to do something called as customer segmentation now what does this customer segmentation do clustering basically means that based on this data I will try to find out similar groups groups of people suppose this is my one group this is my another group this is my third group let's say that I was able to create this many groups this many groups are clusters I'll say cluster 1 2 three each and every cluster will be specifying some information this cluster May specify that this person uh he was very young but he was able to get some amazing salary this person it may some specify that these people are basically having more age and they are getting good salary these people are like middle class background where with respect to the age the salary is not that much increasing so here what we are doing we are doing clustering we are grouping them together main thing is grouping this word is very much important now why do we use this suppose my company launches is a product and I want to just Target this particular product to rich people let's say product one is for rich people product two is for middle class people so if I make this kind of clusters I will be able to Target my ads only to this kind of people let's say that this is the rich people this is the middle class people I will be able to Target this particular ads or this particular product or send this particular things to those specific group of people by that that is basically called as ad marketing and this uses something called as customer segmentation a very important example and based on this customer segmentation we can later apply any regression or classification kind of problem statement now coming to the second one after clustering which is called as dimensionality reduction now in dimensionality reduction what we are focusing on suppose if we have th000 features can we reduce this features to lower Dimensions let's say that I want to convert this uh th000 feature to 100 features lower Dimension so can we do that yes it is possible with the help of dimensionality deduction algorithm there are some algorithms like PCA so I'll also try to cover this as we go ahead understand clustering is not a classification problem clustering is a grouping algorithm there is no output feature no dependent variable in clustering sorry in unsupervised ml so yes I will also try to cover up LDA we'll cover up PCA and all as we go ahead so with respect to supervised and unsupervised so first thing that we are going to cover is something called as linear regression the second algorithm that we will try to cover after linear regression is something called as Ridge and lasso third that we are going to cover is something called as logistic regression the fourth that we are basically going to cover is something called as decision tree decision tree includes both classification and regression four fifth that we are going to cover is something called as adab boost sixth that we are going to cover is something called as random Forest seventh that we are going to cover is something called as gradient boosting eighth that we are going to cover is something called as XG boost N9 that we are going to cover is something called as n bias then when we go to the unsupervised machine learning algorithm the first algorithm that we are going to do is something called as K means K means algorithm then we also have DV scan then we are also going to do higher C clustering there is also something called as K nearest neighbor clustering fifth we'll try to see about PCA then LDA so different different things we will try to cover up yes svm I have missed here I'm going to include svm KNN will also get covered so I have that in my list probably I may miss one or two but we are going to cover everything so let's start our first algorithm linear regression so let's go ahead and discuss about linear regression linear regression problem statement is very simple guys so suppose I have let's say I have two features one is my X feature and one is my y feature let's say that X is nothing but age and Y is nothing but weight so based on these two features I have some data points that has been present over here so in linear regression what we try to do is that we try to create a model with the help of this training data set so this will be my training data set what I'm actually going to do is that I'm going to basically train a model and this model is nothing but a kind of hypothesis testing or it is just kind of hypothesis which takes the new age and gives the output of the weights and then with the help of performance metrics we try to verify whether this model is performing well or not now in short what we are going to do in linear regression is that we'll try to find out a best fit line which will actually help us to do the prediction that basically means if I get my new age over here then what should be my output with respect to Y okay so with respect to this what should be my output over here in this particular case whenever we are drawing a diagram like this I can basically say that Y is a linear function of X so this is what we are going to do now understand how we are going to create this best fit line this is very much important whenever we say linear regression it basically means that we are going to create a linear line over there you may be thinking sir why to create linear line why not nonlinear line that I'll discuss about it as we go ahead see other other algorithms so to begin with let's consider this line that you see over here right this line equation can be given by multiple equations someone some people people write yal mx + C some people write uh H some people write yal beta 0 + beta 1 into X some people write H Theta of xal to Theta 0 + Theta 1 into X many many equations are there for this this straight line this straight line many many equations are there with respect to many many different kind of notations but the first algorithm that I have probably learned of linear regression is from Andrew Ng definitely I would like to give him the entire credits and based on his notation whatever he has explained I'll try to explain you over here so the credits for this algorithm specifically goes to Andrew NG so let's consider this one over here in order to create this straight line I will basically use a equation which is called as H Theta so this is the equation of a straight line if I know the equation of the straight line whatever I can write I can write many things yal mx + C yal beta 0 + beta 1 * X and then I can also write one more that is H Theta of xal theta 0 + Theta 1 into X of I here also you can basically say x of I here also you can say x of I now let's go ahead and let's take this equation for now let's take this equation of now so I'm I'm going to take out this equation and just write one equation through which I have also studied but I will definitely be adding some points which probably Andrew and could not mention mention in his video but I'll try my level best obviously he is the best I cannot even compare myself to him so Theta 0 + Theta 1 into X now let's understand what is Theta 0 Theta 1 as I said that let's say I have a problem statement over here let's say I this is my X and this is my y this is my data points now what I'm doing I'm trying to create a best fit line like this now what is this best fit line what is uh when I say this best fit line is basically given by this equation what does Theta 0 basically indicate Theta 0 over here is something called as intercept now what exactly is intercept intercept basically means that when your X is zero then H Theta of X is equal to Theta 0 so in this particular case intercept basically indicates that at what point you are meeting the Y AIS so this particular point is basically your intercept when your X is equal to 0 at that point of time you'll be seeing that this line is intersecting the y- AIS whatever value this will be that is your intercept now the second thing is about your Theta 1 what is Theta 1 this is nothing but slope or coefficient now what does this basically indicate this indicates let let's say that this is the unit one unit in the x-axis and probably with respect to this I can find one point over here one point over here and if I try to draw this over here to here this is the unit movement in y so what does it basically say slope with the unit movement in one one unit movement towards the x-axis what is the unit movement in y- axis that is basically slope or coefficient Theta 0 and Theta 1 two things and X of I is definitely your data points now our main aim is to create a best fit line in such a way that I I'll just try to show it to you what is our main aim let's let's understand what is the aim of a linear regression so if I take an example of linear regression I need to find out the best fit line in such a way that the distance between this data points that I have and the predicted points should be very very less suppose I'm creating a best fit line okay I'm creating a best fit line so with respect to this data points initially was this right but my predicted point is this point in this particular case my predicted point is this point so and if I do do the summation of all these points those distance should be minimal then only I'll be able to say that this is the best fit line so I I cannot definitely say that this is exactly the best fit line or not how will I say when I try to calculate the difference between this point and the predicted Point these are my predicted point right if I try to calculate the distance between them then I will basically have a aim to it should be minimal if I do the summation of all the distance it should be minimal so for that what I can do is that see you may be also thinking Krish why not just do one thing okay suppose if these are my data points why not just play and create multiple lines and try to compare what we can do is that we can compare multiple we can create multiple lines right like this and then whoever is giving the best minimal point I will go and select that but how many iteration you will do how you will come to know that okay this line is the best line so for that specific purpose we should start at one point and we should lead towards finding the best fit line start at one point and then we should go towards finding the best fit line so for this particular purpose what we do is that we create a something called as uh cost function I have already shown you what is my hypothesis function my best fit line equation is basically given as H Theta of x equal to Theta 0 + Theta 1 * X this is my hypothesis right now coming to the cost function which is super super important why this it is super important because cost function basically what what is cost function over here I told right right this distance when I do the summation this distance that I when I'm doing the summation it should be minimal so if I really want to find out this particular distance I will be using one more equation how can I use a distance formula between the predicted and the real point I will just say that H Theta of x - y so when I say h Theta of x - Y what does this basically mean this is my real point and this is my predicted Point predicted point is basically given by H Theta of X and what I'm going to do I'm going to basically do the squaring because I may get a negative value so because of that I really want to do the squaring part Now understand one thing I need to also do the summation I = 1 to compl complete M let's say that I'm taking the number of data points over here as M because I need to calculate the distance between all the points right with respect to the predicted and the predict with respect to the real points so after this I also need to divide by 1X 2m the reason why I'm dividing by first of all let me show you why we are dividing by 1 by m 1 by m will give us the average of all the values that we have the specific reason why we are dividing by 1 by 2 do is for the derivation purpose it helps us to make our equation very much simpler so that later on when I am updating the weights when I say weights I'm basically updating Theta 0 and Theta 1 Theta 0 and Theta 1 at that point of time you'll be able to see that this particular value when we probably do the derivative it will help us to do it again I'm going to repeat it I'm going to write it down for you first of all now in order to find find out the best fit line I need to keep on changing Theta 0 and Theta 1 unless and until I get the best fit line unless and until I don't get the best fit line I need to keep on updating Theta 0 and Theta 1 now if I need to keep on updating Theta 0 and Theta 1 I probably require a cost function okay what this cost function will do I'll just tell you so cost function over here I will specify as J of theta 0 comma Theta 1 is equal to now what is cost fun function over here what this distance I told right this distance between the H Theta of X and Y if I do the summation of all these things it needs to be minimal it needs to be less because with respect to an X point this is my y point right similarly with respect to this x point this is my y point so what I'm actually going to do I'm going to use a cost function now in this cost function my main aim is to basically write H Theta of x - y s this will be with respect to I I I why I am saying I because this will be moving from I equal to 1 to all the points that is m m is basically all the points over here now apart from this what I actually going to do I'm going to divide by 1X 2 m I'll tell you why I'm specifically dividing by 1X 2 m first of all by dividing by m I will be getting an average output average cost function because here I'm iterating M the reason why I'm dividing by two because it will help us in derivation why let's say that I have x² if I try to find out derivative of x² with respect to X then what will I get I will basically get 2x right that is what is the formula what is the derivation of X of n it is nothing but n x of n minus1 so that is the reason why I'm actually making it 1 by two so that when two comes over here this two and two will get cancelled so I hope everybody's able to understand so this is my cost function Now understand what is this called as this entire equation is basically called as squared error function yes mathematical Simplicity basically means because when we are updating Theta 0 and Theta 1 we basically find out derivation in the cost function so that is the reason why we are specifically doing it squaring off is basically done because so that we don't get any negative values here squared error function now let's go towards the what we need to solve this is my cost function okay so I need to minimize minimize this particular value that is 1x 2 m summation of I = 1 2 m and then this will basically be H Theta of X of I minus y of I whole Square we need to minimize this by adjusting parameter Theta 0 and Theta 1 this entirely is what this is nothing but J of theta 0 comma Theta 1 and we really need to minimize this so this is our task okay this is our task now let's go ahead and let's try to compare with two different thing one is the hypothesis testing and one is with respect to the cost function okay let's take an example so right now my equation of the hypothesis is nothing but H Theta of x equal to Theta 0 + Theta 1 * X if Theta 0 is 0 then what does this basically indicate can I say that it basically the line the line the best fit line passes through the origin and this is nothing but s Theta of xal to Theta 1 multiplied by X can I say like this obviously I can definitely say like this right so my equation will be like this so for right now let's consider that your Theta 0 is equal to 0 so this is what it is we have done till here we have minimized we have written the equation everything yes so it is passing through the origin and this is what is the equation I'm actually getting now let's take one example and let's try to solve this if I if I have H Theta of X so this is my new hypothesis considering that my intercept is passing through the region so with respect to this let's say that I will create one line over here let's say this is my this is my data points like X1 y1 I have 1 2 3 I have 1 2 3 now let's consider that if I have T I have data points like what I have data points like let's say I have three data points 1 comma 1 2A 2 3 comma 3 so 1A 1 is nothing but this is my data point 2A 2 is nothing but this is my data point and 3 comma 3 is this is my data point so these are my data points from the data set that I have so 2 comma 2 is this point and 3 comma 3 is basically this point let's consider that these are my points that I have these are my data points now if I consider Theta 1 as 1 where do you think the straight line will pass through where do you think the straight line will pass the straight line will definitely pass like this right my straight line will definitely pass through all the points this same point becomes a prediction point also right same point let's consider that this is also getting pass through this it passes through all the points when Theta 1 is equal to 1 Theta 1 is nothing but slope when slope is equal to 1 in this scenario it passes through all the points now go ahead and calculate your J of theta so what will the form of J of theta 1 become because Theta 0 is 0 okay we can basically write 1 by 2 m summation of I = 1 2 three how many points are there three right and here I have J of H of theta of X1 sorry X of theta of x i - y i s right now let's go ahead and compute now in this particular scenario what will happen 1X 2 m then what is what is this point minus y of I see h of X is also 1 y of I is also one both the point are 1 so this will become 1 - 1 whole S Plus because we are doing summation the next point is also falling in 2A 2 so this will become 2 - 2 s + 3 - 3 S so in total this will become zero so when your J of theta when Theta 1 is 1 Theta 1 is 1 so J of theta 1 is how much it is Z right so what is this J of theta 1 it is the cost function so let me draw the cost function graph over here let's say that this is my Theta and this is my so here I have 0.5 here I have 1 here I have 1.5 so this is my Theta here I have two then I have 2.5 okay then similarly I have 0. five then I have 1 1.5 2 2.5 this is my J of theta 1 so right now what is my Theta 1 my Theta 1 is 1 at this particular Point what did I get J of theta 1 is nothing but zero so this will be my first point this will be my first point guys I have discussed why why the value will be 1X 2m basically to make the calculation simpler we are dividing by 1X 2 m is basically used to average aage is the sumission that we are actually doing over here now let's go ahead and let's take the second scenario in the second scenario let's consider my Theta 1 let's say that my Theta 1 over here is now 0.5 if my Theta 1 is 0.5 then tell me what are the points that I will get for x equal to 1.5 * 1 so it will come as 0.5 over here right then similarly when X is equal to 2.5 * 2 is nothing but 1 over here and then similarly when uh for x equal to 35 multiplied by 3 see we are multiplying here right5 multi by 3 is 1.5 so the next point will come over here now when I create my best fit line what will happen so here is my next best fit line which I will probably create by green color okay so this is my second one which is green color here definitely slope is decreasing so if I go ahead and calculate my J of theta let's see what I'll get so J of theta 1 is nothing but 1X 2 m again same equation summation of I = 1 2 3 H Theta of X of i - y of i² so what we have for over here we have nothing but 1X 2 m now let's do the summation what is this point this point is nothing but the predicted point and this point is the real point right so in this particular scenario the first point that I will get is nothing but. 5 - 1 whole s how I'm getting. 5 - 1 whole Square this is 1 this is the real Point 1 this is the predicted Point .5 so here I'm getting. 5 - 1 whole Square the second point will be 1 - 2 whole s right 2 so 1 - 2 whole s and then I will finally get 1.5 - 3 whole s so finally if I do this calculation how much I'm actually getting 1X 2 * 3 which is 6 here I'm getting .25 5 Square here I'm getting 1 here I'm getting 1.5 whole Square so my final output will be which I have already calculated it is nothing but point it will be approximately equal to. 58 so 58 now with Theta as this is nothing but Theta Theta 1 as .5 right that is what Theta 1 as .5 we are able to get. 58 so Theta 1 is .5 over here and. 58 will be coming somewhere here right so this is my next point which will be again in green color now let's go ahead and calculate the third condition now in third condition what I'm actually going to write I'm going to basically say Theta 1 as 0 at that point of time just go and assume what is 0 multiplied by X it will obviously be zero so I will be getting three points and my next line will be in this line that is the x-axis and this is basically all my points now if I go ahead and calculate this what is J of theta 1 now what is J of theta 1 now in this particular case when my Theta 1 is equal = to 0 1X 2 m now this part you'll be able to see this is 0 - 1 0 - 2 0 - 3 okay so it will become 0 - 1 s 0 - 2 s and 0 - 3 S okay so this will become 1X 6 * 1 + 4 + 9 which will not be it will be nothing but 2.3 which is approximately equal to 2.3 then what will happen with respect to Theta 1 as 0 we are getting 2.3 so if I draw this it is nothing but with respect to zero I'm getting 2. 2 2.3 this is my point so similarly when you start constructing with Theta 1 is equal 2 I may get some point over here so here when I join this points together you will be seeing that I will be getting this kind of curve okay and this curve is something called as gradient descent and this gradient descent will play a very very important role in making sure that in making sure that you get the right Theta 1 value or light slope value now which is the most suitable point the most suitable point is to come over here because this is this this point is basically called AS Global Minima because see out of all these three lines which is the best fit line this is the best fit line right this is the best fit line when I had this best fit line my point that came over here was here itself this was my point that came over here right and I want to basically come to this region because this is my Global Minima when I basically am over here the distance between the predicted and the real point is very very less right so this specific point is basically called AS Global minimum but still I did not discuss Krish you have assumed Theta 1 is 1 Theta 1 is .5 Theta 1 is 0 here also you're assuming many things right and then you probably calculating and you're creating this gradient descent but the thing should be that probably you come to one point over here and then you reach towards this so for that specific reason how do you do that how do I first of all come to a point and then move towards This Global Minima so for that specific case we will be using one convergence algorithm because if I come to one specific point after that I just need to keep on updating Theta 1 instead of using different different Theta 1 value so for this we use something called as convergence algorithm so here the convergence algorithm basically says repeat until convergence that basically means I'm in a while loop let's say and here I'm basically going to update my Theta value which will be given by this notation which is continuous updation where I'll say Theta J minus I'll talk about this Alpha don't worry and then it will be derivative of theta J with respect to this J of theta 0 and Theta 1 so this should happen that basically means after we reach to a specific point of theta after performing this particular operation we should be able to come to the global Minima and this this specific thing that you are able to see is called as derivative this is called as derivative derivative basically means I'm trying to find out the slope derivative which I can also say it as slope this equation will definitely work guys trust me this will definitely work why it will work I'll just draw it show it to you let's say that this is my cost function let's say that I've got this gradient descent and let's say that my first point is somewhere here but I have to reach somewhere here right now when I reach this this is my Theta 1 and this is my J of theta 1 suppose I reach at this specific point and I will also have another gradient descent which looks like this let's say that in the initial time I reach the point over here how we will be coming to this minimal Global Minima by using this equation I'll talk about Alpha also don't worry now this is also my Theta 1 this is also my J of theta 1 now let's say suppose I came to this particular point right after coming to this particular point I will basically apply this derivative on this J of theta 1 okay now when I find out a derivative that basically means we are trying to find out the slope and in order to find the slope we just create a straight line like this which will look like this I'll just try to create so I'll try to create a slope like this this slope so if you try to find out with respect to this this is a positive slope how do we indicate it because understand the right hand side of the line of this is pointing on the top wordss Direction this is the best easy way to find out whether it is a positive slope or negative slope now in this particular case this is a positive slope now when I get a positive slope that basically means I will update my weights or Theta 1 as Theta 1 let's say I'm writing it over here so I will just apply this convergence algorithm see Theta 1 colon Theta 1 minus this learning rate which is called as Alpha this is my my learning rate I'll talk about learning rate don't worry then this derivative value in this particular case since I'm having a positive slope I will be getting a positive value let's say that for this Theta value I got this slope initially now I need to come to this location so for that I have to reduce Theta 1 so that I come to this main point now here you can see that I am I subtracting Theta 1 with something which is a positive number right this is a positive number so definitely I know that after some n number of iteration I will be able to come to the global Minima similarly if I take the right hand side and if I try to draw the slope in this particular case my slope will be negative so similarly I can write the equation as Theta 1 = to Theta 1 minus learning rate multiplied by a negative number so minus into minus will be positive right suppose initially my 1 was here my Theta 1 was here now I'll keep on updating the weight to come to this Global Minima so minus into minus is positive so I will basically get Theta 1 + Alpha by a positive number because minus into minus is plus so this will definitely work so that we will be able to come over here to the global Minima whether it is a positive slope or a negative slope now what is this learning learning rate now learning rate based on this learning rate suppose I want to come from this point to the global Minima by what speed I should be coming what speed if my learning rate value is bigger what speed I may be coming suppose if I say usually we select learning rate as 01 if I select a small number then it'll start taking small small steps to move towards the optimal Minima but if I take a alpha value a huge value if it is a huge huge value then what will happen this uh this updation of the Theta 1 will keep on jumping here and there and the situation will be that it will never meet it will never reach the global Minima so it is a very very good decision to take a alpha small value it should also not be a very very small value if it becomes a very very small value then what will happen very tiny steps it will take forever to reach the global Minima that basically means my model will keep on training itself so definitely this Al is going to work now let me talk about one scenario one scenario will be that what if my my cost function has a local Minima what if I have a local Minima because here if I come here if I come this is a local Minima suppose one of my points come over here and finally I'm reaching over here what will happen in this particular case because in this case you'll be seeing that what will be my equation my equation will be simply Theta 1 Theta 1 minus Alpha in this point in this local Minima slope will be zero so in this particular case my Theta 1 will be equal to Theta 1 now you may be thinking what is if this is the scenario then we will be stuck in local Minima this is called as local Minima but usually with respect to the gradient descent and the equation that we are using here we do not get stuck in local Minima because our gradient descent in this particular scenar iio will always look like this but yes in deep learning when we are learning about grade in descent and a Ann at that point of time we have lot of local Minima and because of that we have different different G decent algorithm like RMS prop we have Adam optimizers which will solve that specific problem so this one point also I wanted to mention because tomorrow if someone asks you as an interview question that what if in your uh do you see any local Minima in linear regression you can just that the cost function that we use will definitely not give us local Minima but if in deep learning techniques with that we are trying to use like Ann we have different different kind of optimizers which will solve that particular problem so that is the answer you basically have to give now let me go ahead and write with respect to the gradient descent algorithm so here again I'm going to write the gradient descent algorithm so this will be my gradient descent algorithm and remember guys gradient descent is an amazing algorithm and you you will definitely be using it so please make sure that you know this perfectly now some questions are that when will convergence stop convergence will stop when we come to near this area where my uh J of theta will be very very less now in gradient descent algorithm I will again repeat it so what did I say I said repeat until convergence I told you right here we have written this algorithm and now let's take it for Theta 0 and Theta 1 so here I will write Theta 0 J equal to Theta J minus learning rate of derivative of theta J J of theta 0 and Theta 1 so this is my repeat until convergence now we really need to find out what we'll try to equate we'll try to first of all find out what is this now if I really want to find out derivative of derivative of derivative of theta J with respect to J of theta 0 and Theta 1 so how do I write this I can definitely write this in a easy way okay so this will be derivative of theta J and remember J will be 0 and 1 right because we need to find out for 0 Theta 0 and Theta 1 so this will be 1 by 2 m what is what is J of theta 0a Theta 1 obviously my cost function so I will write summation of IAL 1 to M and here I will basically write J of theta of X of I minus y of I whole squar so if my J is equal to Z so what will happen for this so here I can specifically say that derivative of derivative of theta 0 J of theta 0a 1 now it's simple here what I will be doing is that I will be simply applying derivative function see guys what is this derivative let's consider this is something like this 1X 2 m x² so if I try to find out the derivative this will be 2x 2 MX so 2 and 2 will get cancel so similarly I'll have 1 by m and here I will specifically be writing summation of I = 1 2 m h Theta of x X of I which will be my x - y of i² so this will be my derivative with respect to Theta 0 this is what I got now the second thing will be that when J is equal to 1 derivative of derivative of theta 1 J of theta 0 comma Theta 1 in this particular case I will be having 1 by m summation of I = 1 to M then again see in this particular case Theta of 1 is there right Theta of 1 basically means what if I try to replace this let's say that I'm trying to replace this H Theta of X with something else what is s Theta of X I know that right it is Theta 0 + Theta 1 * X so Theta 0 + Theta 1 * X so after this if I'm trying to find out the derivative with respect to Theta 0 this will obviously become I will be able to get this much right now with respect to the second derivative what I will be writing I will again be writing H thet of X of i - y of i s multiplied X of I so this Square also went off understand this H Theta of X is what see they H Theta of X is nothing but Theta 0 + Theta 1 * X so if I'm trying to find out derivative with respect to Theta 0 nothing will be going to come okay Theta 1 of X will become a constant in this particular case in this case because Theta 1 of X is there so if I try to find out derivative of theta 1 into X only I'll be getting X Y Square will not be there it's easy right X squ means 2x this is the derivative of x square right so that square went and 1X 2 1 2 by two got cancelled so this will be now my convergence algorithm so here we have discussed about linear regression oh sorry I have to remove Square here also so let me write it again okay repeat until conver con let me write it down again repeat until convergence finally your two updates will be happening one is Theta 0 so here it will be Theta 0 minus Alpha that is my learning rate 1 by m summation of IAL 1 to M and this will basically be H Theta of X of I minus y of I and similarly if I want to update Theta 1 it will be - alpha 1 by m summation of I = 1 to m h Theta of X of I oh my God y of I uh multiplied by X of I Alpha is your learning rate guys Alpha is nothing but it is learning rate here we have to initialize some value like 0.1 see what is s Theta of X Theta 0 + Theta 1 into X right if I do derivative of theta 1 into x what is derivative of theta 1 with Theta 1 x it is nothing but X so this x will come over here now let's discuss about two important thing one is R square and adjusted R square now similarly what will happen you will have lot of convex functions now see if I talk about uh like if you have multiple features like X1 X2 X3 x4 at that point of time you will be having a 3D curve curve which looks like this gradient decent which will be something like this gradient it's just like coming down a mountain now let's discuss about two performance metrics which is important in this particular case one is R square and adjusted R square we usually use this performance metrix to verify how our model is and how good our model is with respect to linear regression so R square is basically given R square is a performance Matrix to check how good the specific model is so here we basically have a formula which is like 1 minus sum of residual divided by sum of total now this is the formula of R squ now what is this sum of residual I can basically write like this summation of y i Min - y i hat whole Square this Yi hat is nothing but H Theta of X just consider in this way divided by summation of Y of i - y mean y mean y s to formula this is the formula I'll try to explain you what this formula definitely says okay so first thing first let's consider that this is my this is my problem statement that I'm trying to solve suppose these are my data points and if I try to create the best fit line This Yi hat Yi hat basically means this specific point we are trying to find out the difference between this things difference between these things let's say that these are my points I'm trying to find out a difference between this predicted this is my predicted the point in green color are my predicted points which I have denoted as y i hat and always understand this is what Su sum of residual is sum of residual is nothing but difference between this point to this point this point to this point this point to this point this point to this point and I doing the all the summation of those now the next point which is very much important here is my X and Y what is this y IUS y y bar Y Bar is nothing but mean mean of Y if I calculate the mean of Y then I will probably get a line which looks like this I'll get a line something like this and then I will probably try to calculate the distance between each and every point and this specific point with respect to the distance between this point and this point the denominator will definitely be high right this value obviously this value will be higher than this value right the reason why it will be higher because the mean of this particular value distance will obviously be higher so this 1 minus high this will be a low value and this will be a high value when I try to divide Low by High Low by high then obviously this entire number will become a small number when this is a small number 1 minus small number will be a big number so this basically shows that our R square has fitted properly right it has basically got a very good R square now tell me can I get this entire R square a negative number let's say that in this particular case I got 90% can I get this R square as negative number there will be situation guys what if I create a best fit line which looks like this if I create this best fit line which looks like this then this value will be quite High it is only possible when this value will be higher than higher than this value okay but in the usual scenario it will not happen because obviously we'll try to fit a line which will be at least good it's not just like pulling one line somewhere we don't want to create a best fit line which is worse than this right worse than this so in this particular scenario you'll be saying that in R square now here you'll be able to see one one amazing feature about R square is that let's say let's say one scenario suppose I have features like let's say that my feature is something like uh let's say I have a price of a house okay so suppose this is my bedrooms how many bedrooms I have and this is basically the price of the house now if I if I probably solve this Pro problem I'll definitely get an R square value let's say the R square value is 85% let's say that my R square is 85% now what if if I add one more feature the one more feature basically says that okay if I add location location of the house will be definitely correlated with price so there is a definite chance that the R square value will increase let's say that R square will become 90% if I probably have this two specific feature and obviously it is basically increasing the R square because this is also correlated to price and let me change the example see first case I got by R square as 85% let's say now as soon as I added location I got 90% now let's say that I added one more feature which gender is going to stay gender like male or female is going to stay you know that gender is no way correlated to price but even though I add one feature there is a scenario that my R square will still increase and it may become 91% even though my feature is not that important even gender is not that important the R square formula Works in such a way that if I keep on adding features and that are not nowhere correlated this is obviously nowhere correlated this is not correlated with price then also what it does is that it is basically increasing my r² so this specific thing should not happen whether a male will stay or female will stay that does not matter at all still when you do the calculation the R square will still increase so in order to not impact the model because see now right now with this particular model where I have got 90% now as soon as I see R square as 91% because it is considering this particular gender so this model will be picked right because it is performing well and is giving you a better R square value but this should not happen because that is not at all corelated this model should have been picked so in order to prevent this situation what we do we basically Ally use something called as adjusted R square now what is this adjusted R square and how it will work I'll also show it to you very very nice concept of adjusted R square so adjusted R square R square adjusted is given by the formula is given by the Formula 1 - 1 - r² * N - 1 where n is the total number of samples n minus P minus 1 this p p is nothing but number of features or predictors we'll also say or predictors suppose initially my number of predictors were in this particular scenario in this scenario where I saw this my number of predictors was two and in this particular case my number of predictor was three now if my predictor is 2 I got the r squ as 90% so in this particular scenario what all the calculation will happen okay all the calculation will happen and let's say that my R square adjusted it'll be little bit less it'll be little bit less let's say it8 is 6% let's say that my R square adjusted is 86% based on this predictor 2 now when I use my predictor 3 predictor basically means number of features that I'm going to use and now in this one one feature is nowhere related like gender but what we are getting we are basically getting R square increased to 91% now for the R square adjusted this will not increase this will in turn decrease right now it will become 82% how it will become I'll show you I've just considered some value 8682 here you can see that there is an increase here an increase is there here decrease is there now how this is basically happening see this P value that I will be putting okay if I put a p isal 3 obviously with n minus P minus 1 this will become a little bit smaller number or sorry little bit uh smaller number right so now in this particular case if it is not correlated obviously this will be high when I'm increasing this so this will also be high let me write the equation something like this just a second so this will basically be okay now why probably this value may have decreased let me talk about this one what is r squ I hope everybody understood n is the number of data points p is the number of predictors if p is increasing then what will happen as P keeps on increasing this value will keep on decreasing this value will keep on decreasing if this values keep on decreasing this will be a bigger number this will obviously be a big number a big number divided by a small number what it will be obviously this will be a little bit bigger number 1 minus bigger number we will basically get some values which will be decreasing if my P value is two in this particular case it will be less smaller than this right at least it will be greater than this this particular value right when p is equal to 3 so with the help of P obviously R square is there to support you okay whether it is correlated or not always remember when the features are highly correlated your R square value will increase tremendously if it is less correlated then it will be there will be a small increase but there will not be a very huge increase now if I consider p is equal to 2 obviously when I'm trying to find out this uh calculation n minus P minus 1 it will obviously be greater than p is equal to 3 when p is equal to 3 then this value will be still more smaller and when we are dividing a bigger number by a smaller number obviously we are subtracting with one so that basically means even though my R square is 86 over here there may be a scenario since this is nowhere correlated I'm basically getting an 82% because of this entire equation so I hope you are understanding this this is very much important to understand a very very important property simple way to define is that as my P value keeps on increasing the number of predictors keeps on increasing my R squ gets adjusted whatever R square I'm getting with respect to this it will always be less than this particular R square there was one interview question that was asked one of my student between R square and adjusted R square which will always be bigger definitely the student said R square then he told him to explain about adjusted R square why does that specific happen agenda one is about Ridge lasso regression second is assumptions of linear regression the third point that we are probably going to discuss about is logistic regression then the fourth thing that we are going to discuss about is something called as confusion Matrix the fifth thing that we are going to consider about is practicals for lead lineer Ridge lasso and logistic so first topic uh that we are probably going to discuss is something called as Ridge and lasso regression so let's understand about Ridge and lasso regression if you remember in our previous session what all things we discussed linear regression and then we had discussed about the cost function we have discussed about R square adjusted adjusted R square sorry R square and adjusted R square we have discussed about it gradient descent we have discussed about it it was nothing but 1 by 2 m summation of I = 1 2 m h Theta of x i - y - y i s so this is the cost function that we had discussed right yesterday and this cost function was able to give us a gradient descent with respect to the J of theta J of theta Zer or Theta not so I can also write this as J of theta comma Theta 0 comma Theta 1 now let me give you a scenario let's say that I have a scenario over here and I have this specific scenario let's say that I just have two points which looks like this okay now if I have these two specific points what will happen I will probably try to create a best fit line the best fit line will definitely pass through all the points like this if I try to calculate the cost function what will be the value of J of theta 0 comma Theta 1 let's say that in this particular case since it is passing through the origin my Theta 0 will be zero okay so what will be the value of theta 0 comma Theta 1 so here obviously you can see that there is no difference so it will obviously become zero Now understand this data that you see right right this data is basically called as training data so this data that I have actually plotted with two points these are specifically called as training data now what is the problem in this data right now see right now exactly whatever line is basically getting created over here which is through the uh hypothesis over here you can see that it is passing through every point so that is the reason your cost is zero and our main aim is to basically minimize the cost function that is absolutely fine now in this particular case in which my model this if this model is getting trained initially this data is basically called as training data now just imagine that tomorrow new data points comes so if my new data points are here let's consider that I I want to basically uh come up with this new data point now in this particular scenario if I want to predict with respect to this particular Point let's say my predicted point is here is this the difference between the predicted and the real Point quite huge yes or no so this is basically creating a condition which is called as overfitting that basically means even though my model has given or trained well with the training data or let me write it down properly over here so this condition since since you can see that over here my each and every point is basically passing through the best fit line so because of that what happens it causes something called as overfitting so you really need to understand what is overfitting now what does overfitting mean overfitting basically means my model performs well with training data but it fails to perform well with test data now what is the test data over here the test data is basically this points the real test data answer was this points but because the my line is like this I'm actually getting the predicted point over here so this distance if I try to calculate it is quite huge so in this scenario whenever I say my model performs well with training data and it fails to perform well with test data then this scenario we say it as overfitting so this scenario when the model performs well with training data I have a condition which is called as low bias and when it fails to perform with the test data then it is basically called as high High variance very important okay I will make each and everyone understand one by one if it is performing well with the training data that is basically low bias and whenever it performs well with the test sorry fails to perform well with the fails to perform well with the test data then it is basically High variance now similarly I may have another scenario which is called as underfitting so let's say that I have something called as underfitting now in this underfitting what is the scenario the model fails to perform it gives bad accuracy I say that model always remember whenever I talk about bias then you can understand that it is something related to the training data whenever I talk about test data at that point of time you talk about variance and that specifically whenever you talk about variance that basically means we are talking about the test data so for an overfitting you will basically have low bias and high variance low bias with respect to the training data and high variance with respect to the test data now if the model accuracy is bad with training data and the model accuracy is also bad with test data in this scenario we basically say it as underfitting so these are the two conditions that are with respect to underfitting that basically means that both for the training data also the model is giving bad accuracy and again for the test data also it is basically having a bad accuracy so in this particular scenario we can definitely say two things out of underfitting one is high bias and high variance so this is the condition with respect to underfitting very super important let me just explain you once again suppose let's consider I have one model I have model two this is model one this is model one this is model two and this is model 3 okay guys so suppose let's say that I have my model my training accuracy is let's say 90% And my let's say that my test accuracy is 80% now in this particular case let's say that my training accuracy is 92% and my test accuracy is 91% and let's say my model three is basically having training accuracy as 70% and my test accuracy is 65% so if I take this particular case it is basically overfitting if I take this particular thing this basically becomes my generalized model and when I talk about this this is my I'll just say that okay I'll also put nice color so that uh you'll be able to understand this this becomes our generalized model and this finally becomes our underfitting right under under fitting so here is my red color I will just say it as underfitting what are the main properties of this overfitting as I said in this scenario since it is performing well with the training data so it will be low bias High variance in this particular case it will be low bias low variance and this particular case it will be high bias and high variance understand in this terminology in this particular way you'll be able to understand so why do we require always a generalized model because whenever our new data will definitely come generalized model will be able to give us very good output let's go back to this particular example here you'll be able to see this straight line the red line that I have actually created is basically overfitting so that whenever I probably get the new points which is having this real value and the predicted points here you'll be able to see the difference is quite huge so because of this it will definitely be a scenario of overfitting where it has low bias and high weight so again let me go ahead and take this example so this was my line which I have actually drawn I had two points and when I draw this line which was a best fit line to which is passing through both the points this scenario is basically causing a overfitting problem and I've also shown you my J of theta 1 will be zero in this scenario since it is passing exactly and the predicted point is also over there now understand one thing is that what can can we take out from this what assumptions we can take out from this definitely if I talk about our cost function our cost function here is nothing but 1X 2 m summation of I = 1 2 m h Theta of X of i - y of I whole s now let's consider that I am going to use this H Theta X and I'm going to basically write it as y hat okay let's focus on this specific point so when I take this I'm I'm just going to focus on this particular point so here I will definitely write it as y hat minus y of I whole squ so this is my y y hat of I minus y hat y i whole Square so this is nothing but the difference between the predicted value and the real value okay this is what I'm actually trying to get now in this scenario if I am adding this values obviously I'm going to get the value as zero now I have to make sure that this value does not come to zero because this is still over fitting so that is where your Ridge regression will come into picture Ridge and lasso will come into picture now when I use Ridge and lasso suppose if I use Ridge now in Ridge what we say this this is also called as L2 regularization now L2 regularization what it does is that it basically adds a unique parameter add a One More Sample value which is like Lambda multiplied by slope Square now what is this slope whatever slope of this particular line it is we are just going to square it off now suppose if I take my equation which looks like this H Theta of X is equal to Theta 0 + Theta 1 x now in this particular case my Theta 0 was zero so my H Theta of X is nothing but Theta 1 what is Theta 1 this is specifically called as slope and I am basically taking this Theta 1 I'm actually making it as a square Square so always understand I don't want to make this as zero because if it becomes zero it may lead to overfitting condition now what will happen if I add this particular equation if I add this particular equation this will obviously come as zero let's consider my Lambda value over here my Lambda value is one I'll talk about how do you set up Lambda value okay let's consider that I'm initializing it to one let's say my Lambda value is 1 now what I will do is that this l Lambda value is 1 Let's consider our slope value initially is two and because of this two I got this best fit line I'm just going to consider it so if I do the total sum over here if I'm just considering this this value is three now the cost function will not stop over here because still it has to minimize it has to reduce this three value so what it will do it will again change the Theta 1 value and let's say that my Theta van value has changed now it got another best fit line which looks something like like this this is my next best fit line I'll talk about Lambda Lambda is a hyper parameter guys what exactly is Lambda I'll just talk about it now when I basically change this line now see why I'm getting this line let's consider I have changed my Theta 1 value since we need to minimize now when we need to minimize what it will do we'll again calculate the slope of this particular line and then we will try to create a new line when we sorry it is two two not three just a second guys 0 + 1 multiplied by 2 s which is nothing but 4 so now my cost function will not stop over here so we are going to still reduce this now in order to reduce this again Theta 1 value will get changed and then we will get a next best fit line for this point now what will happen in this scenario once we have this best fit line we will definitely get a kind of small difference so now if I go ahead and consider the new equation my y hat I minus y i² + Lambda of slope squar this value will be a small value now because I have some difference and then plus again 1 multiplied by now understand whether the slope will increase in this particular case or whether it will decrease in this particular case there will be some slope value let's say that I have got some slope of this particular line in this particular scenario again your slope will definitely decrease so let's say in the case of two initially it was now it is basically 1.36 whole squ now this small Value Plus 1 + 1.3 squ or let me consider that my slope is now one simple value that is 5 so if I get this it is 2.25 2.25 plus small value it will be less than three only right it will obviously be less than three or equal to 3 but understand what is happening the value is getting reduced from 4 to 3 so this is is the importance of Ridge now what will happen is that you will try to get a generalized model which has low bias and low variance instead of this overfitting condition you know why specifically we are adding Ridge L2 regularization it is basically to prevent overfitting because here you are not stopping here you are trying to reduce it unless and until you get a line you get a line which will be able to handle the which will be able to handle as a uh generalized model now here you can see now if I have my new points like how I drew over here now the distance will be less so now you'll be able to see that it will be able to create a generalized model guys this will be a small value only see initially when we have this line obviously we have zero if we try to slightly move here and there so here you'll be able to see that it will just a slight movement but what this movement is basically specifying it is specifying that the slope should not be steep if we probably have a steep slope it obviously leads to most of the time overfitting condition it should not be steep it should be very very it should be less steeper but it should actually help you to create a generalized model so you will be seeing that after playing for some amount of time this value will not reduce after some point of time it'll get almost it'll be a minimal value it'll be a smaller value and for this also you have to specify iterations how many times you probably have to train them now this iterations is also a hyperparameter based on number of iterations you will probably see your R square or adjusted R square over here so this iterations based on the number of iterations it will never become zero guys understand because zero it is not possible if it becomes zero trust me it is an overfitting model you cannot get that is something zero now what is Lambda coming to this Lambda this Lambda is a hyperparameter this is basically to check how fast you want to lessen the steepness or how fast you want to make a steepness grow higher right and this Lambda will also be selected by using hyper parameter and this also I'll show you today in Practical what do you mean by iterations iteration basically means how many time I want to change the Theta 1 value how many times you want to change the Theta value that is the convergence algorithm right convergence algorithm over here L2 regularization or Ridge is basically used in such a way that you should never overfit why we assume Theta 0 is equal to 0 because I'm considering that it passes through a origin right origin over here Lambda is a hyper parameter steep basically means how steep the line is if I have this line this line is quite steep if I have this line This is less steep now if I go to the next regularization which is called as lasso raso R lasso regression this is also called as L1 regularization now here the formula will be changing little bit here you will be having y hat of minus of Y whole Square here you'll be adding a parameter Lambda but understand here you'll not be adding slope Square no here you'll be adding mode of slope here you'll be adding mode of slope and this mode of slope will work is that it will actually help you to do feature selection now you may be thinking how feature selection crash let's consider a equation over here let's say that I have many many features I have many many many features okay so my H Theta of X which I'm indicating here as y hat let's say that I'm I'm writing this equation apart from preventing for overfitting it will also help you to do feature selection here let me just show you over here with an example this H Theta of X which I'm probably writing as y hat will basically be indicated by something over here you'll be able to see that it is nothing but let's say that I have multiple features like this now in this particular features obviously there are so many coefficients over here so many slopes over here now mod of slope will be what it will be nothing but mod of X1 plus X2 plus X3 plus X4 plus X5 like this up to xn now in this particular case how it is basically helping you to sorry not X1 sorry just a second this mod of I have taken the data point this is not data points this should be your mod of theta 0 + Theta 1 + Theta 2 + theta 3 + Theta 4 + Theta 5 like this up to Theta n so here you'll be able to see that this is how I will basically uh I'll basically be calculating the slope now as we go ahead guys whichever features are probably not playing an amazing role the Theta value the coefficient value the slope value will be very very small it is just like that entire feature is neglected that entire feature is neglected now in this particular case we were doing squaring because of the squaring that value was also increasing but here because of the mode that value will not increase instead it will be a condition wherein we are basically neglecting those features that are not at all important in this specific problem statement so with the help of L1 regularization that is lasso you are able to do two important things one is preventing overfitting and the second case is that if you have many features and many of the features are not that important okay in basically finding out your slope or your line or the best fit line in that particular case it will also help you to perform feature selection so this is the importance of the entire what is the importance of this this is the importance of the uh Ridge and the lasso regression that we are doing here I'm just going to write L1 regularization and obviously we have discussed about L2 regularization also now you have probably understood Lambda is one hyperparameter okay which we will specifically using okay and based on this Lambda this will be found out through cross validation cross validation is a technique wherein we will try to probably train our model and try to find out the specific things okay what should be the exact value and there also we play with multiple values in short what we are doing we just trying to reduce the cost function in such a way that uh it will definitely never become zero but it will basically reduce based on the Lambda and the slope value in most of the scenario if you ask me we should definitely try both the regularization and see that wherever the performance Matrix is good we should use that what is cross validation basically means I will try to use different different Lambda value and basically Ally use it so in a short let me write it down again for Ridge regression which is an L2 Norm here I'm simply writing my cost function in this particular case will be little bit different here I can definitely write my cost function as H Theta X of i - y of I S Plus Lambda multiplied slope Square what is the purpose of this the purpose is very simple here we are preventing overfitting this was with respect to the Ridge Recreation that is L2 nor now if I go ahead and discuss about the next one which is called as lasso regression which is also called as L1 regularization in the case of lasso regression your cost function will be H Theta of X of IUS y of i² plus Lambda ultied mode of flow so here you have this specific thing and what is the purpose the purpose are two one is prevent overfitting and the second one is something called as feature selection so these two are the outcomes of the entire thing see with respect to this lasso right you have slopes slopes here you'll be having Theta 0 plus Theta 1 plus Theta 2 plus theta 3 like this up to Theta n now when you'll have this many number of thetas when you have many number of features and when you have many number of features that basically means you'll have multiple slopes right those features that are not performing well or that has no contribution in finding out your output that coefficient value will be almost nil right it will be very much near to zero in short you neglecting that value by using modulus you're not squaring them up you're not increasing those values now I will continue and uh probably I will also discuss about the assumptions of linear regressions so what are the assumptions of linear regression in this particular scenario so assumption is that number one point linear regression if our features are in normal or gion distribution if our features follows this particular distribution it is obviously good our model will get trained well so there is one concept which is called as feature transformation now in future transformation always understand what will happen if a model does not fall follow a gan distribution then we apply some kind of mathematical equation onto the data and try to convert them into normal orian distribution the second assumption that I would definitely like to make is that standard scalar or standard digestion standard dig is nothing but it is a kind of scaling your data by using Z score I hope everybody remembers Z score this is what we basically apply there your mean is equal to zero and standard deviation equal to 1 see guys wherever you have gradient descent involved it is good to basically do standardization because if our initial point is a small Point somewhere here then to reach the global Minima or training will happen quickly otherwise what will happen if your values are quite huge then your graph may be very big and the point can come over any over there and the third point is that this linear regression works with respect to linearity it works if your data is linearly separable I'll not say linearly separable but this linearity will come into picture if your data is too much linear it will obviously be able to give a very good answer like logistic regression also which we are going to discuss today this also has the same property now you may be asking is it compulsory to do standardization guys if you want to increase the training time of your model or if you want to optimize your model I would suggest go ahead and do standardization now coming to the fourth Point here you really need to check about multicolinearity this is also one kind of check we basically do what is multicol linearities let's say I have X1 I have X2 and this is my output feature I have let's say X3 also now let's say that if I try to see the colinearity of this two feature how how correlated these two feature are let's say that these two feature are 95% correlated is it is it a wise decision to use both the features and let's say that let's let's say that these two features are 95% correlated but it is highly correlated with Y is it necessary that we should use both the feature in this particular scenario the answer should be no we can drop this particular feature okay we can drop this particular feature any one of the feature we can definitely drop it and based on that I can just use one single feature and basically we do the prediction there is also a concept which is called as variation inflation factor I will try to make a dedicated video about this multical is also solved with the help of variation inflation Factor one more term is there homos orc so that kind of terminologies also we use one more condition in this but if you almost satisfied with this assumptions you will definitely be able to outperform in linear regression so you have got an idea of the assumptions you have also got an idea of multiple things okay now let's go towards something called as logistic regression now logistic regression what logistic regression is the first type of algorithm that we are going to learn in classification let's say that in classification I have one example you know so suppose I have say number of hours study hours and number of play hours based on this I want to predict whether a child is passing or failing suppose these two are my features I want to predict whether it is pass or fail so here you'll be able to see that I have some fixed number of categories specifically in this particular scenario I have two categories binary logistic regression works very well with binary classification now the uh question comes that can we solve multiclass classification using logistic the answer is simply yes you can definitely do it so let's go ahead and let's try to discuss about uh logistic regression now what is the main purpose of the logistic regression first of all let's let's uh understand one scenario okay suppose I have a feature which basically says um number of study hours and this is like 1 2 3 4 5 6 7 and let's say that I have pass this point is basically pass and this point is basically fail so I have this two conditions these are my outcomes now what I'll do I will just try to make some data points let's say that if I study Less Than 3 hours I will probably be fail if I study more than 3 hours then probably I will pass this I'll make it as fail and this I will make it as pass so I will be having points over here this 1 2 3 let's say that this is my training data set now the first question says that okay Chris fine you have some data over here whenever it is less than three you are basically the person is failing if it is greater than five greater than three it is basically showing data points points with respect to pass now can't we solve this problem first with linear regression now with the help of linear regression here the first point will be that yes I can definitely draw a best fit line my best fit line in this particular scenario may be something like this it may it may look something like this so here fail is nothing but zero pass is one the middle point is basically 0.5 so obviously with the help of linear regression I'm able to create this best fit line and I'll put a scenario that whenever the value is less than5 whenever the value is less than 0.5 whenever the output is less than5 let's say that new data point is this and based on this I'll try to do the prediction I'm actually able to get the output over here now when I'm getting the output over here this basically is 0.25 now in this particular scenario obviously I'm able to say that yes the person I'll write a condition over here saying that if my H Theta of x value is less than 0.5 then my output should be zero let's say less than 0.5 I'll say not less than or equal to less than5 then my output will be zero right so in this particular case Zero basically means fail similarly I'll have a scenario where I'll say that when if my S of theta of X is greater than or equal to 5 then this will basically be one which is nothing but pass so this two condition I can definitely write over here this is my center point so that any point that will probably come over here let's say that this point is coming over here right let's say new data point is somewhere coming over here with this red point now what I'll do I'll basically draw a straight line it will come over here I will just extend this line long I will extend this line over here and I will extend this line over here and here you can see that based on this I'm actually getting this particular prediction which is greater than 0.5 so I will say that okay the person has passed obviously this is fine this is obviously working better this is obviously working better so what what is the problem why we are not using linear regression okay in order to solve this particular problem why you are specifically having logistic regression the answer is very much simple guys the answer is that whenever let's say that if I have an outlier which looks something like this suppose I have an outlier which comes like this over here what is this value let's say that this value is nothing but 7 8 9 10 let's say that the number of study hours and I'm studying for nine it is obviously pass now in this particular scenario when I have an outlier this entire line will change now I will probably get my line which looks something like this okay my line will basically move something like this it will now get moved something like this now when it gets moves completely like this now for even five or even at any point that I am actually predicting let's say that at this particular point if I try to find out it'll be showing less than 0. five so here this particular value or answer will be wrong right because if we are studying more than 5 hours OB viously B based on the previous line the person had to pass but in this scenario it is failing it is coming less than 0.5 but the real value for this is basically passed so I hope you are understanding because of the outlier the entire line is getting changed so how do we fix this particular problem now in this two scenarios are there first of all obviously because of just an outlier your entire line is getting shifted here and there the second point is that over here sometimes you're also getting greater than one you you're also getting less than one suppose if I try to calculate for this particular point if I project it in behind I'll be getting some negative value so we have to squash this function if I squash this function then it'll become a plain line right how do we squash it and for this we use something called as sigmoid activation function or sigmoid function if somebody ask you why don't you use linear regession in order to solve this classification problem then your answer should be very much simple you should say this to specific points so we will try to go ahead and solve some linear regression now with the help of cost function everything as such and we'll try to understand how the cost function will look for logistic regression second reason I told you right it is greater than zero over here the line is going greater than zero right greater than zero I have only Z and one and it is becoming greater than zero but I have already told that our maximum and minimum value are 1 and zero so I hope you have understood why linear Reg cannot be used okay I showed you all the scenarios why linear regression should not be used now we'll continue and probably discuss about the other things over here and uh we will now try to understand fine what exactly logistic regression is all about and how the decision boundaries basically created now we'll go ahead and discuss about that specific thing so let's go ahead our values should be always between 0 to one over here in this particular case because it is a binary classification problem only this should be the answer so let's go ahead and let's define our decision boundary so my decision boundary decision boundary in the case of logistic regression first of all as usual in logistic regression we defined our hypothesis okay guys first of all let's see if I'm writing my my h of theta my H Theta of X as Theta 0 + Theta 1 into x + Theta 2 into X like this X1 X2 + Theta n into xn now in this scenario can I write this entire equation as Theta transpose X obviously I can definitely write this way right and this is what is the notation that you will probably seeing in many places so with respect to the decision boundary of logistic regression our Theta see like this we can write I'm saying okay but since we have to consider two things one is squashing the line okay how that squashing will basically happen see if I have this if I have this line we saw in the above right if I have this line suppose I have some data points over here and I have some data points over here if I want to create the best fit line how will I create I will basically create like this but I have to also do two things one is squash over here and squash over here right squash over here and squash over here now in order to squash I'm saying squash squash means okay now in order to do this I use a function which is called as sigmoid activation function that basically means what happens obviously you know this line is basically denoted by H Theta of x equal to how do you denote this straight line let me write it down nicely for you so how do you denote this straight line the straight line is obviously denoted by Theta 0 + Theta 1 * X1 let's say now on top of this on top of this I have to apply something on top of this value I have to apply something so that I can make this line straight instead of just expanding in this way so my hypothesis will basically be now G of G is basically a function on Theta 0 and Theta 1 * X1 so here I'm trying to basically what I'm trying to do I will apply a mathematical formula on top of this linear regression to squash this line now let's go ahead and let's try to find out what is this G okay what is this G I will say let Z equal to Theta 0 + Theta 1 * X I'm just initializing this now my H Theta of X is nothing but G of Z now we need to understand what is this z g of Z and how do we basically specify what is the G function so my G function is nothing but H Theta of x equal to 1 by 1 + e ^ of minus Z which in short if I try to initialize Zed now it is 1 ^ of e ^ of minus Theta 0 + Theta 1 * X so this is what is my H Theta of X which is my hypothesis and this obviously works well because it is being able to squash the function so this is basically my hypothesis which I am definitely trying to use it and this function that you are actually able to see is called as sigmoid or logistic function now you need to understand what does this sigmoid function look like in graph in graph it looks something like this so this this is my Zed value and this is my G of Z this is my 05 your sigmoid function will have this curve so this is your one this is zero your value when now from this we can make a lot of assumptions what are the assumptions that we can basically make your G of Zed your G of Zed is greater than or equal to 5.5 is obviously greater than or equal to 0.5 when your Zed value is greater than or equal to zero this is the major assumptions that we can basically make that is whenever your G of Z is greater than your G of Z is greater than or equal to 0.5 whenever your Zed is greater than or equal to Z so obviously whenever your Zed value is greater than Z it is greater than 0.5 if your Zed value is less than zero what it will become it will basically be less than 0.5 so you can write that specific condition also you want so this is the most important condition over here why it is called as logistic regression see guys with the help of regression you creating this straight line and with the help of the concept of sigmo you are able to squash it so they have probably combined that name and uh basically have written in this way will squashing of the best fit L line help to overcome the outlier issues yes obviously it'll be able to help you so let's go ahead and let's try to solve the problem statement now usually let's consider my training set let's consider my training set suppose I have some training points like this x of 1 comma y of 1 let's say x of 2A y of 2 okay X of 3A y of 3 like this I have lot of training points and finally X of n comma y of n let's say that this is my training data so here uh my y y will belong to what zero or 1 because I will only have two outputs since we are solving a binary classification problem here is my training set with two outputs and I hope everybody knows about J Theta of Z it is nothing but 1 + e ^ of minus Z here your Z is nothing but Theta 0 + Theta 1 * X1 so this is your Theta 0 now what we have to do we have to select this Theta now in this particular case let's consider that my Theta 0 is 0 because it is passing through the origin just for time pass sake suppose my Z is Theta 1 into X so now I need to change what is my parameter my parameter is Theta 1 I have to change parameter Theta 1 in such a way that I get the best fit line and along that I apply this sigmoid activation function now let's go ahead and let's first of all Define our cost function because for this we definitely require our cost function now everything will be same obviously you know the cost function of linear regression because the first best fit line that you are probably creating is with the help of linear regression now in this particular case in the case of linear regression so here you can basically write J J of theta 1 is nothing but 1 by m summation of I = 1 2 m 1X 2 and here you have H Theta of x minus y of I I whole Square so this is your entire thing of if you remember linear regression whatever things we have discussed yesterday okay so this is the cost function let's consider that for linear regression for this is for the linear regression now for the logistic regression what will happen for your logistic regression I will take the same cost function H Theta of X now you know what is s Theta of X it is nothing but 1 + 1 + e ^ of minus Theta 0 + Theta sorry Theta 1 multiplied by X right this is my with respect to logistic regression this is my entire equation now similarly I will try to only put this H Theta of X let's consider that this is my cost function only only my H Theta of X is changing in this particular case so if I go ahead and write my cost function I can basically say 1x2 h Theta of X of i - y of i² and in this particular scenario what is h Theta of X it is nothing but 1 + 1 + e ^ minus Theta 1 x so this is what this is getting replaced and this is my logistic regression cost function I'm just considering this cost function part this part later on if you replace this to this see if I replace this to this and if I replace this to this it becomes a logistic regression cost function intercept I'm considering it as zero guys now when I'm replacing this to this this to this then it becomes a logistic uh regression cost function but there is one problem we cannot we cannot use we cannot use this cost function there is a reason for this because this equation that you're seeing 1/ 1 + e^ of minus Theta 1 * X this is a non-convex function now you may be considering what is a non-convex function so let me write it down so here this this term this terminology right it is a non-convex function now what is this non-convex function let me show you and let me differentiate it with convex function okay we'll try to understand what is the difference between non-convex function and convex function this is related to gradient descent very important this is related to gradient desent if you remember with the help of linear regression whatever gradient Dent we are actually getting it is a convex function like this this is the convex function which looks like a parabola curve Parabola curve because of this Parabola curve whenever we use this linear regression cost function specifically because here my H Theta of X is what it is nothing but Theta 0 + Theta 1 into X because of this this equ will always give you a parabola curve this kind of cost function or convex function you can say but here your s Theta of X is changing so in the case of if I use that cost function you will be getting some curves which looks like this now what is the problem with this curve here you have lot of local Minima if local Minima is there you will never reach This Global Minima so that is the reason we cannot use that c function now mathematically you can also go and probably search in the Google what is the what is the graph or what is a convex or non-convex function but always remember whenever we updates Theta 1 with this within this particular equation by finding the slope then this way it will not be differentiable and here you have lot of local Minima and because of this local Minima you will never be able to reach the global Minima this is your Global Minima right in case of in case of linear regression you'll reach This Global Minima but in this case you will never reach never never you'll be stuck over here or you may get stuck over here you may get stuck over here okay so this has a local Minima problem so how do we solve this understand in local Minima these are my points right I have to come over here this is my deepest point in this particular case I don't have any local Minima now in local Minima also you'll get slope is equal to Z so that is the reason your Theta 1 will never get updated so in order to solve this problem you can see this diagram we have something called as logistic regression cost function so I can now write my logistic regression cost function in a different way so this researcher researcher thought of it and basically came up with this proposal that the logistic cost function should look something like this so the entire cost function of logistic regression that is specifically H Theta of X of I comma y this should be written something like this and it should be written like this see here I'm just going to write cost function of J of theta 1 let's say that I'm writing J of theta 1 okay so J of theta 1 what are the different different output that I'll be getting I'll be get I'll be getting yal 1 or y equal to 0 So based on this two scenarios our cost function will look something like this minus log of H of theta of X and I know I hope you all know what is h Theta of x h Theta of X is nothing but 1 + 1 ^ of - Theta 1 x so this is what is my H Theta of X and whenever Y is Zer then you basically have minus log * 1 - H Theta of X of I of I okay so this is how you basically write your cost function in this particular scenario now with the help of this cost function it is always possible since it is getting log log is basically getting used in this scenario you'll always get a global Minima that is the reason why they have completely neglected this cost function and utiliz this cost function now what does this cost function basically mean two scenarios if Y is equal to 1 Let's consider this is my cost function graph I have H Theta of X and you know that H Theta of x value will be ranging between 0 to 1 since it is a classification problem so it will be ranging between 0 to 1 and this is basically of J of theta 1 which is my cost function so if Y is equal to 1 this specific equation will be used and whenever this equation is is basically used you get a you get a curve see minus log s of X of I you get a curve which looks something like this okay which you'll get a curve which looks like this now what does this curve basically specify the curve come up with two assumptions the cost will be zero if Y is = 1 and H Theta of x equal to 1 that basically when your s Theta of X is 1 and the Y is output is one that basically means you're going to assign over here one right so in this particular case you will be seeing that your cost function will be zero cost is zero so here is my zero it is meeting over here if you of x equal to 1 and Y is equal to 1 so this is this is again a convex function only then the next point that you can probably discuss over here is with respect to Y is equal to 0 if your Y is Z then what kind of curve you will be getting you'll get a different kind of curve which will look like this H Theta of x here your value will be 0 to one and here you'll be having a curve which looks like this so when you combine this two you'll be able to see that you are able to get a kind of gradient descent so this will definitely help us to create a cost function so I hope everybody is able to understand till here with respect to this and this will definitely work so finally I can also write my cost function in a different way the cost function that I will probably write over here so this will be my J of theta 1 so I can come up with a cost function which looks like this cost of H of theta of X of I comma Yus log of H Theta of x if Y is equal 1 and then minus log 1 - H Theta of x if Y is equal 0 now I can combine this both and probably write something like like this I can combine this both and I can basically write cost of H Theta of X of IA Y is equal to - y log H Theta of X of I minus log 1 - y okay 1 - y log of 1 - H Theta of X so this will be my final cost function and here also you can see that if I replace if I replace y with one then what will remain only this particular value will remain right this value when Y is equal to 1 this thing only will come you see over here replace y with one probably replace y with one and then you'll be able to see so here I can now write if Y is equal to 1 my cost function will Rook something like this which is nothing but see Y is 1 then what will happen my log of H Theta of X of I will come and this 1 - 1 is 0 so 0 multili by anything will be 0 if Y is equal to 0 then what will happen my cost function will be so when it is zero this will - y will become 0 0 multili by anything is z so here you'll be able to see that I am I'll be having minus log 1 - H Theta of x i so this both the condition has been proved by this cost function so this is my cost function yes cost function and loss function with respect to the number of parameters will be almost same so finally if I try to write J of theta because I have that 1X 2 m also right so 1X 2 m also I have so what I'm actually going to do here you will be able to see that I can write J of theta 1 is equal to 1 by 2 m summation of IAL 1 to M and then write down the entire equation that you have probably over here so here you have minus y or I I'll just remove this minus and put it over here and this will become plus sorry y of I * log H Theta of X of I 1 - y of i y log 1 - H Theta of X of I so this becomes my entire first function and obviously you know what is h thet of x H Theta of X of I is nothing but 1 + 1 e^ minus Theta 1 * X and finally my convergence algorithm I have to repeat this to update Theta 1 repeat until this updation that is Theta Theta J is equal to Theta J minus learning rate derivative with respect to Theta J and this will be my J of theta 1 this is my repeat until conversion so this is my cost function this is my repeat algorithm and here I will be updating my entire Theta 1 and this solves your problem with respect to logistic regression simple simple questions may come like how it is different from linear regression how it is not different from linear regression can we say log likelihood a topic from probabilistic yes this is uh this is log likelihood if now I will discuss about performance metrics and this is specific to classification problem and binary classification I'm talking let's consider let's consider I have a data set which has X1 X2 and this is y and obviously in logistic uh classification you have outputs like 0 1 0 1 1 0 1 and your y hat y hat is basically the output of the predicted model now in this particular scenario my y hat will probably be 1 1 0 uh 1 1 1 Z so in this particular scenario this is my predicted output and this is my actual output so can we come to some kind of conclusions wherein probably we will be able to identify what may be the accuracy of this specific model with respect to this many data points because confusion Matrix is all dealt with this is called as we will first of all have to create a confusion Matrix now for a binary classification problem the confusion Matrix will look like this so here you have 1 0 1 0 Let's say that this is prediction let's say that these are my actual value and these are my prediction value okay these both are prediction value these are my output value when my actual value is zero my predicted value is one does this what does this mean wrong prediction right so when my actual value is zero my predicted value is 1 so here my count will increase to one let's go to the second scenario when the actual value is one and my predicted value is one that basically means one and one so here I'm going to increase my count similarly when my actual value is zero my predicted value is zero so that basically mean when my actual value is z my predicted value is zero I'm going to increase the count by one if I go over here 1 one again it is so instead of writing one now this will become two I'm going to increase the count similarly I'll go over here one more one is there so I'm going to increase the count three then I have 01 01 basically means when my actual value is zero I'm actually getting it as one so I'm also going to increase this particular value as two and then finally I have 1 and zero where I'm going to increase like this now what does this basically mean now what does this basically mean see with respect to this kind of predictions whenever we are discussing this basically basically says so this is my actual values and I have Z 1 and zero and this is my predicted values I also have 1 and zero this value when one and one are there this is called as true positive this value when 0 and Zer are there this is called as false negative whenever your actual value is zero and you have predicted one this becomes false positive and whenever your actual value is one you have predicted zero this becomes false negative now coming to this I really need to find out the accuracy of this model now if I really want to find out and this is what is called as confusion Matrix now in this confusion Matrix if I really want to find out the accuracy the accuracy of this model it is very much simple this middle elements that you are able to see will basically give us the right output so this and this if I add it up it will give us the right output so here I'm going to get TP + TN divided by TP + FP + FN + TN so once I calculate this so I have 3 + 1 / 3 + 2 + 1 + 1 so this is nothing but 4 by 7 what is 4 by 757 so am I getting 57 percentage accuracy so I'm actually getting 57% accuracy over here with respect to the accuracy so this is how we basically calculate with respect to basic accuracy with the help of uh the confusion Matrix okay so this is specifically called as confusion Matrix now there are some more things that you really need to specify always remember our model aim should be that we should try to reduce false positive and false negative now let's say that I want to discuss about two topics what one is suppose in our data set I have zeros and one category let's say in my output if I say Zer are 900 and ones are 100 this becomes an imbalanced data very clear right so this become an imbalanced data set it is a biased data suppose if I say zeros are probab 600 and ones are probably 400 in this particular scenario I will say that this is the balance data because yes you have 100 less but it's okay the it may not impact many of the algorithm now see guys most of the algorithm that we will be probably discussing imbalanced if we have an imbalanced data set it will obviously affect the algorithms let me talk about this let's say that I have number of zeros as 900 and number of ones is 100 now let's say that my model I have created which will directly predict zero it'll I'll just say that all my inputs that it is probably getting with respect to this training data it'll just output zero now in this particular scenario what will be my accuracy my accuracy will be 900 divid by 1,000 right so this is nothing but 90% so is this a good accuracy obviously it is a good accuracy but this is a biased data if my model is basically just outputting 00000000 0 if it is outputting 00 00 0 obviously most of the answer will be zeros but this will be a scenario like you know where it is just outputting one thing then also it is able to get 90% accuracy so you should only not be dependent on accuracy so there are lot of terminologies that we will basically use one terminology that we specifically use is something called as Precision then we'll also use recall what is precision what is recall I'll write the formula over here in Precision what do we need to focus and then finally we will discuss about f score so we have to use different kind of parametrics of sorry different kind of formulas whenever you have an imbalanced data set you can also do oversampling but again understand in most of the scenarios in some of the scenarios oversampling may work but we have to focus on the type of performance metrics that we are focusing on right now I'll not say F1 score I'll say F score the reason why I'm saying I'll just let you know so let's talk about recall recall formula is basically given by true positive divided by true positive plus false negative Precision is given by true positive divided by true positive plus false positive and then I will probably discuss about F sore also or we basically say fbaa also now I'll just draw this confusion Matrix again okay which is having true positive true negative so let me draw it over here so this is my ones and zeros these are my actual values and these are my predicted values I have true positive I have true negative false positive and false negative now in this particular scenario when I'm actually discussing understand what is recall and what focus it is basically given on so here whenever I talk about recall recall basically says that TP TP divided by TP plus FN so I'm actually focusing on this so what does this basically say true uh recall out of all the actual true positives how many have been predicted correctly that is basically mentioned by TP out of all the positive values how many of them have predicted as positive so this is what it is basically saying and this scenario is called as recall in this the false negative is basically given more priority and our focus should be that we should try to reduce false positive false negative sorry we should try to reduce this now let's go ahead and let's discuss about Precision in Precision what we are doing we are basically taking out of all the predicted values out of all the predicted positive values how many of them are actual true or positive okay this is what Precision basically means now suppose if I consider spam classification suppose this is my task tell me in this particular case should we use Precision or recall and one more use case I'm saying that whether the person has cancer or not in which case we have to support recall and in which case we have to go ahead with Precision has cancer or not in spam what is important okay guys the recall is also called as true positive rate I can also say recall as sensitivity so if I go with Spam classification it should definitely go with Precision why it should go with Precision if I probably get a Spam ma the main aim should be that whenever I get a Spam Mill it should be identified as spam okay in that specific scenario my positive false positive we should try to reduce and in this scenario my false pository talks about the spam classification a lot in a better way in the case of cancer I should definitely use recall let's let's focus on the recall formula tp/ by TP plus FN if a person has a cancer see one actually he has a cancer it should be predicted as one otherwise if we have FN it is basically predicting it does not have a cancer that is really a big situation in this case if a person does not have a Cancer and if he's predict if the model predicts okay fine he has a cancer he may go and further do the test and then he'll come to know whether he has a cancer or not but this scenario is very dangerous if a person has a cancer but he is being indicated that he does not have that cancer so here false negative is given more priority over here in the case of spam classification false positive is given more priority so this is something important over here and you really need to understand with respect to different different problem statement let me give you one more example tomorrow the stock market is going to crash in this what we need to focus on should we focus on Precision or should we focus on recall now here two things are there who is solving what kind of problem see many people will say recall or Precision but here two things are there on whose point of view you are creating this model are you creating this model for the industry or are you creating this model for the people for the people he should definitely get identified that okay in this particular scenario you need to sell your stock because tomorrow stock market is going to crash but for companies this is very bad okay I hope everybody is able to understand for companies it is very very bad so in this particular case sometime we need to focus both on false positive and false negative and again I'm telling you for which problem statement you are solving that indicates if you are solving for people then they should be able to get the notification saying that it is going to crash if you're probably uh doing it for companies at that time your Precision recall may change but if I consider for both the scenarios at that point of time I will definitely use something called as F score F score or I'll also say it as F beta now how is fbaa Formula given as I will talk about it and here in the F score you have three different formulas the first Formula I will say basically as when your beta value is 1 okay first of all I'll just give a generic definition of f s or F beta here you are basically going to consider 1 + beta squ Precision multiplied by recall divided beta Square * Precision plus recall whenever your both false positive and false negative are important we select beta as one so if I select beta as 1 it becomes 1 + 4 Precision multiplied by recall then you have Precision plus recall so here sorry 1 + 1 so this becomes 2 multiplied by Precision into recall divided by Precision plus recall so here you have this is basically called as harmonic mean harmonic mean probably you have seen this kind of equation where you have written 2x y / x + y same type you are able to see this this is called as harmonic mean here the focus is on both false positive and false negative let's say that your false positive is more important than false negative at that point of time you will try to decrease or you will try to decrease your beta value let's say that I'm decreasing my Beta value to 0.5 then what will happen 1 +5 whole s and then you have P * R Precision recall and here also you have 25 p + r now in this particular scenario I'm decreasing my Beta decreasing the beta basically means that you are providing more importance to false positive than false negative and finally you'll be able to see that if I consider beta value as let me just say my notes if I consider beta value as two that basically means you are giving more importance to false negative than false positive so with this specific case you can come up to a conclusion what value you basically want to use now whenever I use beta is equal to 1 it becomes fub1 score if I use beta as .5 then this basically becomes f.5 score and this becomes your F2 score So based on which is important okay which is important whether your Precision or false positive or false negative is important you can consider those things F score will have different values if you're using beta is equal to 1 that basically means you are giving importance to both precision and recall if your false positive is more important then at that point of time you reduce beta value if false negative is greater than false bet uh false positive then your beta value is increasing beta is a deciding parameter to decide your F1 score or F2 score or F Point score now first thing first what is the agenda of today's session first of all we will complete practicals for all the algorithms that we have discussed these all algorithms that we have discussed we will cover the practicals probably we will be doing hyper parameter tuning everything the second thing and again here we are going to take just simple examples so yes uh so today's session I said practicals with simple examples where I'll probably discuss about all the hyper parameter tuning then the second one the second algorithm that I'm going to discuss about is something called as n bias this is a classification algorithm so we are going to understand the intuition and the third one that we are going to probably discusses KNN algorithm so KNN algorithms is definitely there so this our today's plan I know I've written very less but this much maths and involved in na bias right we'll understand the probability theorem again over there there is something called as bias theorem we'll try to understand and then we'll try to solve a problem on that so let's proceed and let's enjoy today's session how do we enjoy first of all we enjoy by creating a practical problem so I am actually opening a notebook file in front of you so here uh we will try to Sol solve it with the help of linear regression Ridge lasso and try to solve some problems let's see how much we will be able to solve it but again the aim is that we learn in a better way okay uh so that everybody understands some basic basic things okay so first of all as usual uh everybody open your jupyter notebook file the first algorithm that I'm going to discuss about is something called as SK learn linear regression so everybody I hope everybody knows about this SK learn let's see what all things are basically there in this we will be using fit intercept everything as such but here the main aim is to find out the coefficients which is basically indicated by Theta 0 Theta 1 and all the first thing we'll start with linear regression and then we will go ahead and discuss with r and lassor I'm just going to make this as markdown how many different libraries of for linear regression you can do with stats you can do with skyi you can do with many things okay so first thing first let's first of all we require a data set so for the data set what we are going to do is that we are going to basically take up some smaller smaller data just let me do this so for this uh we are going to take the house pricing data set so we are going to solve house pricing data set problem a simple data set which is already present in SK learn only now in order to import the data set I will write a line of code which is like from SK learn dot data sets data sets import load uncore Boston so we have some Boston house pricing data set so I'm just going to execute this I'm also going to make a lot of Sals so that I don't have to again go ahead and create all the sales again some basic libraries that I probably want is pro import numai as NP import pandas SPD okay import cbon as SNS and then I will also import Matt Matt plot lib do p plot as PLT and then percentile matplot lib matlot lib do inline and I will try to execute this see this my typing speed has become a little bit faster by writing by executing this queries again and again and uh let's go ahead uh so I have imported all the necessary libraries that is required which which will be more than sufficient for you all to start with now in order to load this particular data set I will just use this Library called as load uncore Boston and I'm going to just initialize this so if you press shift tab you will be able to see that return load and return the Boston house prices data set it is a regression problem it is saying and then probably I'm just going to execute it now once I execute it I will go and probably see the type of DF so it is basically saying skarn dos. bunch now if I go and probably execute DF you'll be able to see that this will be in the form of key value pairs okay like Target is here data is here okay so data is here Target is here and probably you'll be able to find out feature names is here so we definitely require feature names we require our Target value and our data value so we really need to combine this specific thing in a proper way in the form of a data frame so that you will be able to see so what I'm actually going to do over here I'm just going to say PD do data frame I'll convert this entirely into a data frame and I will say DF do data see this is a key value pair right so DF do data is basically giving me all the features value so if I write DF do data and just execute it you'll be able to see that I you will be able to get my entire data set in this way my entire data set in this way this is my feature one feature two feature three feature 4 this feature 12 I have 12 features over here and based on that I have that specific value now the next thing thing that I'm going to do probably I should also be able to add the target feature name over here so what I will do I will just convert this into DF and then I will also say DF do columns and I'll set it to DF do Target okay and let me change this to data set so I'm going to change this to data set and I'm going to say data set. columns is equal to DF do Target so if I execute this and now if I probably print my data set do head you will be able to see this specific thing okay it is an error let's see expected axis has 13 element new values has 506 so Target okay I should not use Target over here instead I had a column which is called as features feature names like if I go and probably see DF DF over here you'll be able to see there is one thing which is called as feature names so I'm going to use DF do feature names over here so here it is DF do feature names I'm just going to paste it over here and now if I go and write here you can see print DF data set. head if I go and execute without print you'll be able to see my entire data set so these are my features with respect to different different things and this is basically a house pricing data set so initially I have this features CRM ZN indust CH nox RM age distance radius tax PT ratio b l stack that so I have my entire data set over here the same data set I have basically put it over here now here also you'll be able to see what all this feature basically means this is showing wasted weighted distance to five do uh Five Boston employment center rad basically means index of accessibility to radial Highway tax basically means full value property tax rate this much PT rate basically means pupil teacher ratio I don't know what the hell it means but it's fine we have some kind of data over here properly in front of you so these are my independent features what are these these all are my independent features if you want the features detail here you can see it right everything what is CRM this basically means per capita crime rate by town which is important ZN it is proportional of residential land zone for Lots over 25,000 Square ft so this is my DF I did not do much I'm just using data frame DF do data column features name I'm getting this value very much simple now let's go a little bit slowly so that many people will be able to also understand now this is my data set. head now the thing is that I obviously have taken all these particular values but this is my independent feature I still have my dependent feature so what I'm actually going to do I will create a new feature which is like data set of price I'll create my feature name price price of the house and what I will assign this particular value this value will be assigned with this target this target value this target value is basically the sale the price of the houses right it is again in the form of array so I'm going to take this and put it as a dependent feature so here you'll be able to see that my price will be my dependent feature so here I'll basically write DF do Target so once I execute it and now if I probably go and see my data set do head you'll be able to see features over here and one more feature is getting added that is price now this price may be the units may be in millions somewhere Target should be here or there it should be probably in millions or I cannot see it but it should be somewhere here it should have definitely said that it is probably in millions or okay but that is not a problem I think but mostly it'll be in millions somewhere I think it should be here okay I cannot see it but probably if I put more time I'll be able to understand it okay so over here what is the thing main thing this all are my independent features and this is my dependent feature right so if I'm trying to solve linear regression I have to divide my independent and dependent features properly now let's go to the next step that is dividing the data set dividing the oh my God dividing the data set into train into first of all I'll try try to divide into independent and dependent features so I want my entire features data set divided into independent and dependent features X I will be using as my independent featur so I will write data set dot I will use an iock which is present in data frames and understand from which feature to which feature I will be taking as my independent feature to this feature till lat so the best way that basically means that I just need to skip the last feature in order to skip the last feature what I'm actually going to do from all the columns I will just skip the last column so this is how you basically do an indexing with respect to just skipping the last feature and this will basically be my independent features and here I will basically say Y is equal to data set do iock and here I just want the last feature so I will write colon all the records I want and see the first term that we are probably WR writing over here this basically specifies with respect to records here this specifies with respect to columns from all the columns I'm taking the last column here I will just take the last column and this will basically be my dependent features dependent features so here I have basically executed now if you can go and probably see x. head here you'll be able to find all my independent features in y do head you'll be able to find the dependent feature now let's go to the first algorithm that is called as linear regression always remember whenever I definitely start with linear regression I'll definitely not go directly with linear regression instead what I will do is that I'll try to go with Ridge regression and uh lasso regression because there you are lot of options with respect to hyper pment T but I'll just show you how linear regression is done so basically you really really need to use a lot of libraries okay over here and based on this libraries this libraries will try to install okay and what are these libraries these are basically the linear regression Library so here I'm basically going to use two specific thing one is linear regression Library so I will just use from SK learn do linear uncore model import linear regression do you need to remember this the answer is no because I also do the Google and I try to find out where in escal and it is present okay so here is my linear regression so I will try to initialize linear reg is equal to initialize with linear regression and then here what I'm actually going to do I'm going to basically apply something called as cross validation cross validation is very much important because in Cross validation we divide out train and test data in such a way that every combination of the train and test data is basically taken by care is taken by the model and whoever accuracy is better that all entire thing is basically combined so here what I'm going to do I'm going to say mean square error is equal to here I will import one more Library let's say from SK learn dot model selection I'm going to import cross Val score so cross Val score cross validation score basically means it is going to do a lot of train and test split it's something like this one example I will show it to you here only so what does cross validation basically do okay so in Cross validation what happens what you do suppose this is your entire data set suppose this is 100 records if you do five cross validation then in the first this will be your test data and remaining all will be your training data if in the second cross validation this will be your test data and remaining all will be your test uh training data like this five times you'll be doing cross validation by taking the different combination of train and test but I'm not going to discuss much about it in the future if you want a separate session I will include that in one of the session itself so this was uh basically the plan with respect to cross validation or cross Val score so here I'm going to basically take cross Val score and here the first parameter that I give is my model so linear regression is my model and here I will take X and Y I'm not doing a train test split specifically over here I'm giving the entire X and Y and probably based on that I'm going to do a cross validation over here you can also do train test plate initially and then just give the X train and Y train over here to do the cross validation it is up to you but the best practices will be that first you do the train test split and then only give the train data over here to do the cross validation I'm just going to use scoring is equal to you can use mean squared error negative mean squared error let's say that I'm going to use negative mean squ error again where do you find all these things you will be able to see in the SK learn page of L uh cross Val score and then finally in the cross Val score you give cross validation value as 5 10 whatever you want so after this what I'm actually going to do I'm just going to basically from this how many scores I will get the mean squar error will be five since I'm doing five cross validation if you don't believe me just see over here print msse so here you'll be able to see five different values 1 2 3 4 5 right five different mean values because we are doing cross five five cross validation so here what I'm going to write I'm just going to say np. mean I want to take the average of all the five so here will basically be my meanor msse okay and then probably I'll print I will print my Ms meanor MSC so this will be my average score with respect to this the negative value is there because we have used negative mean squ error but if you just consider mean square error then it is only 37.1 3 okay so this I have actually shown you how to do cross validation see with respect to linear regression you can't modify much with the parameter so that is the reason why specifically in order to overcome overfitting and do the feature selection we use uh R and lasso regression so here I will show show you how to do ridge ridge regression now now in order to do the prediction all you have to do is that just go over here take the model okay what is the model linear R and just say do predict so here you can see uh you'll be getting a function called as do predict and give the test value whatever you want to predict automatically the prediction will be done so I'm just going to remove this and focus on Ridge regression right now because I I want to show how hyperparameter tuning is done in R regression so for R regression the simple thing is that I'll be using two different libraries from skarn do linear linear uncore model I'm going to import Ridge so for the ridge it is also present in linear underscore model for doing the hyperparameter tuning I will be using from SK learn do modore selection and then I'm going to import grid SE CV so these are the two libraries that I'm actually going to use grid SE CV will be able to help you out with the um okay will be able to help you out with Hyper parameter tuning and then probably you'll be able to do that uh difference between MSE and negative MSE not big thing guys if you use MSE here mean squ error you'll be getting 37 I've just used negation of MSE it's okay anything is fine you can go with MSE also means square error there is also another uh another scoring area which is like which focuses on square root square mean Square uh sorry root means Square eror okay so there are different different things which you can basically focus on okay now in order to give you this specific good value I'm actually going to do hyper Peter tuning now let's go ahead with uh grid s CV so here what I'm going to do again I'm going to basically Define my model which will be Ridge okay so this this is what I have actually imported now uh let me open the ridge skarn so SK learn Ridge we need need to understand what all parameters are basically used do you remember this Alpha value guys do you remember this Alpha value why do we use Alpha I I told you now Alpha multiplied by slope square if you remember in Ridge we specifically use this right Ridge and lasso regression Alpha so this is the alpha the this is probably the best parameter we can perform hyper parameter tuning the next parameter that we can probably perform is basically uh this Max iteration okay Max iteration basically means how many number of iteration how many number of times we may probably change the Theta 1 value to get the right value so we can do this so what I'm actually going to do I'm going to select some Alpha values I'm going to play with this apart from that if I want I can also play with the other parameters which are uh like kind of uh you know probably you can you can also play with the iteration parameter it is up to you try whichever parameter you want to change you can go ahead and change it now let me show you how do we write this and how do we make sure that this specific thing is done now uh before doing grid s CV uh let me do one thing I will Define my parameters okay so here is my Ridge now what I'm going to do I'm going to say parameters and in this parameter two important value that I'm probably going to take is this one that is my C value and I will try to Define this in the form of dictionaries so here the C value that I sorry not C just a second guys my mistake it is not C it is Alpha let's see so how do I Define my Alpha value we'll try to see so here the parameters will be Alpha C is basically for uh logistic regression I'll show you so the alpha value I will just mention some values like 1 e to the power of -5 that basically Me 00000000 0 0 0 1 similarly I I can write 1 E to the^ of - 10 that again means 0 0 0 0 0 0 0 0 10 * 0 1 I'm just making fun okay so that you will also get entertained 1 E to the^ of minus 8 okay similarly I can write 1 E to the^ of minus 3 from this particular value now I'm increasing this value see 1 E to the^ of minus 2 and then probably I can have 1 5 10 um 20 something like this so I'm going to play with all this particular parameters for right now because in grit or CV what they do is that they take all the combination of this Alpha value and wherever your uh your your model performs well it is going to take that specific parameter and it is going to give you that okay this is the best fit parameter that is got selected so here I have got all these things now what I'm going to do I'm going to basically apply the grid C TV so here I have uh gridge uh sorry Ridge GD I'm saying ridore regressor so I'm going to use git s CV git s CV and here I'm basically going to take the parameters regge okay Ridge is my first model and then I will take up all this params that I have actually defined see in git CV if I press shift tab I have to first of all execute this then only it will be able to press shift tab so here if I press shift tab here you'll be able to see estimator and parameter grid is my second parameter then scoring and then all the other parameters so here the first thing that goes is your model then your parameters which what you are actually playing then the third parameter is basically your scoring scoring and again here I'm going to use negative mean squ error some people are saying that mean squared error is not present so that is the reason why negative mean squ error is done why it may not be present because uh they try to always create a generic Library probably this kind of uh scoring parameter may also get used in other algorithms so that is the reason they may not have created but if you want to Deep dive into it Google Google then what is r regress dot fit on X comma y again I'm telling you you can first of all do train test split on X and Y and then probably only do this on X train and Y train parameter is not oh sorry okay I get this okay parameter is not and why it is not and oh yeah it has become a list I'm going to make this as dictionary right now I'm fully focused on implementing things if I get an error I'll definitely make sure that it'll get fixed anyhow if I get that error I will not say oh Kish why why this error came you know why this error came I I'll not get worried I'll get the error down only you cannot give this as the one okay so try to understand okay so this is your gitar CV I've also done the fit and let's go and select the best parameter so what I can do I will write print ridore regressor dot params sorry there will be a parameter called as best params I'm going to print this and I'm going to print ridore regressor Dot best score so these are all the values that are got selected one is Alpha is equal to 20 and the best score is - 32 so initially I gotus 37 but because of Ridge regression you can see that our negative mean square error has definitely become better there is a minus sign don't worry but from 37 it has come to 32 cross validation guys over here inside grids s CV also when it is probably taking the entire combination over there the CV Value Cross validation also we can use so probably if I am probably considering all these things many people has a question Chris is this minus value increased that basically means you cannot use Ridge regression you are right in this particular case Ridge regression is not helping you out so guys let me again write it down everybody don't worry yeah previous I got minus 32 right now I'm getting - 37 right sorry previously I got what - 37 - 37 now I got - 32 so here you can see this I got it from linear regression this I got it from what Ridge which one should I select I should select this model only because it is performing well than this but again understand Ridge also tries to reduce the overfitting so probably in this particular scenario we cannot use Ridge because the performance is becoming more bad so what I will do I will go and try with lasso regression now I'll copy and paste the same thing so linear model import lasso then this will basically be my lasso let's see with lasso whether it will increase or not let's see this is my parameter that got selected now let me write lasto regressor dot best params so this is Alpha is equal to one is got selected over here I'm just going to print it okay and then I'm going to print with last one regression DOT score will be the best so here I'm actually getting - 35 - 35 here I'm actually getting - 32 so minus 35 still I will focus on linear regression now see what will happen if I add more parameters if I add more parameters see what will happen so now I'm going to take Alpha different different values see this I'm just going to remove this and probably add Alpha value in this way see here I have added more values 5 10 20 30 35 40 45 100 okay let's see whether we our performance will increase or not so here uh first of all let me remove from here in Ridge just take it down guys I'm I'm adding more parameters like this just take it down yeah CV is equal to 5 nobody okay you're not able to see it um CV is equal to 5 now here it is uh what you can basically focus on so here you can see I have added some values like this you can also add and just try to execute and now if I go and probably see this is my see first I have tried for Ridge I'm getting minus 29 do you see after adding more parameters what happened in Ridge after adding more parameters what happened in Ridge you can see om minus 29 and the alpha value that is got selected is 100 if you want try with cross validation 10 and just try to execute now now so these are are some hyper parameters that we will definitely play with here you can see - 29 so here you can see minus 29 you can also increase the cross validation value over here also and probably execute it but with lasso I don't know whether it is improving or not it is coming to minus 34 you just have to play with this parameters as now for a bigger problem statement the thing is not limited to here right we try to take multiples and many parameters multiples and many parameters and try to do these things it is up to you we play with multiple parameters whichever gives us the best result we are basically taking it it's okay error is increased I know that no error is increasing definitely error is increasing even though by trying with different different parameters but about most of the scenario see here I gotus 37 probably what I can actually do is that uh try to get better one with respect to this now the best way what I can also do is that I can basically take up train and test split also and probably do these things let's see let's see one example so how do we do train and test from SK scalar dot I think model selection import train test split okay it's okay guys you may get a different value okay let's do one thing okay let's make your problem statement little bit simpler now what I'm going to do just tell me in train test plate what we need to do so I'm going to take the same code I'm going to paste it over here or let me do one thing let me insert a cell below and let me do it for train test split so in train test plate what we can do so I'm just going to take the syntax paste it over here let's say that I'm taking XT train y train and then I'm using train test split with 33% now if I execute with respect to X train and Y train so here is my you can see this I have written this code from SK learn. model selection uh train test plate random State can be anything whatever you write it is fine then you basically give X and Y with test sizes 33 uh this is basically saying that the test will have 33% and the train data will be 77% so this is what I'm actually getting with respect to X train and Y train here what I'm going to do I'm going to basically take X train comma y train and now if I go and probably see this here you can see minus 25 understand this value should go towards zero if it is going towards zero that basically means the performance is better now similarly I do it for Ridge in Ridge what I'm actually going to do here I'm going to write X train and Y train and if I go and probably select the best score than this here you'll be able to see I'm getting how much I'm getting minus 2.47 okay here I'm getting 25.8 here 25. 47 that basically means now still the Improvement is little bit bad because here we are not going towards zero so the next part again here also you can basically do it for X train and Y train X train and Y train so here you have this one and let's go and execute this so here you can see minus 2.47 now what you can also do is that you can use this lasso regressor do predict and you can basically predict with respect to X test so this is your white test value suppose let's say that this is my y PR Yore PR then what I can do from SK learn I will be using R square and adjusted R square if you remember SK learn R square r² so this is my R2 score so where it is present in SK learn. Matrix so I'm going to write from SK learn import let's say I'm saying from skarn do Matrix import r² R2 score now what I'm going to do over here I'm basically going to say my R2 score which is my variable I'll say this is nothing but R2 score here I'm just going to give my y PR comma Yore test so if I go and probably see the output here I will be able to see print R2 score this is all I have discussed guys there is also adjusted rant score is there where is R2 R2 score one adjusted r² okay R2 score is there but adjusted R square should be here somewhere in some manner so this is how your output looks like with respect to by using this lasso regressor okay which is very good okay it should be I told it should be near 100% right now I'm getting 67% if I want to tie with the ridge you can also try that so you can say Ridge regressor do predict and here you can see 7 68% then you can also try linear regressor and predict what is the error saying the regression is not fitted yet why why it is not fitted why it is not fitted let's say that I have fitted here linear regression dot fit on X train and Y train X train and comma y train so I'm just going to fit it now if I go and probably try to do the calculation so if I go and see my R2 score it is also coming somewhere around 68% 67% now since this is just a linear regression you won't be able to get 100% because you're drawing a straight line right so for that you basically have to other use other algorithms like XG boost and all n bias so many algorithms are there it's okay see you give y test over here y PR over here both are same right they're comparing by see at one limit you can you can increase the performance after that you cannot see again I'm telling you in linear regression what we do these are my points right I will be only able to create one best line I cannot create a curve line right over here so obviously my accuracy will be only limited let's go and do it logistic practical quickly and here uh in logistic also we can do git SE CV now what I'm actually going to do first of all let's go ahead with the data set so I will quickly Implement logistic so from LC learn. linear model I'm going to import logistic regression so I'm going to use logistic regression and apart from that we know that let's take a new data set because for logistic we need to solve using classification problem so this is basically my logistic regression I'll take one data set so from SK learn. data sets import we'll take a data set which is like uh breast cancer data set so that is also present in SK learn with respect to the breast cancer data set I'm just going to use this see load best cancer data set I'm loading it and all the independent features are in data and my columns are feature names the same thing like how we did previously okay so this will basically be my complete uh complete independent feature so if I go and probably see this x. head here you'll be able to see that based on this input features the independent feature we need to determine whether the person is having cancer or not these are some of the features over here and this is like many many features are actually present so next thing I this that was my independent feature now I'll take my dependent feature dependent feature will already present in DF Target okay this particular data set that we have taken in DF in DF do Target we will basically have all our dependent feature these are my independent features so what I'm actually going to do I'm going to create Y and I'm going to say PD do data frame and here I'm going to say DF do Target Target and this column name should be Target right so this will be my column name and now if I go and see my y y is basically having zeros and one in the target feature now the next thing that we are going to do is that uh apply basically apply the first of all we need to check whether this data set is uh this particular y column is balanced or imbalanced okay in order to do that I will just write F Target if the data set is imbalanced definitely we need to work on that and try to perform upsampling so if I write y target. Valore counts if I execute this so here you'll be able to see that value SC counts will basically give that how many number of ones are and how many number of zeros are so now total number of ones are 357 and total number of zeros are 22 so is this a imbalanced data set probably this is a balanced data set so here I'm actually going to now do train test spit train test spit I will try to do again train test spit how do we do we can quickly do copy the same thing entirely I'll copy this entirely over here and then I will get my X and Y so here is my X train X test y train y test so train test plate obviously I'll be doing it now in logistic regression if I go and search for logistic regression escalar I will be able to see this what all parameters are there this is basically the L1 Norm or L2 Norm or L1 regularization or L2 regularization with respect to whatever things we have discussed in logistic and then the C value these two parameter values are very much important if I probably show you over here the penalty what kind of penalty whether you want to add L2 penalty L1 penalty you can use L2 or L1 the next thing is C this is nothing but inverse of regularization strength this basically says 1 by Lambda something like that this parameter is also very much important guys class weight suppose if your data set is not balanced at that point of time you can apply weights to your classes if probably your data set is balanced you can directly use class weight is equal to balanced other than that you can use other other weight which you basically want so this is specifically some of this right no this is not Ridge or lasso okay this is logistic in logistic also you have L1 norm and L2 Norms understand probably I missed that particular part in the theory but here also you have an L2 penalty norm and L1 penalty Norm I probably did not teach you in theory because if you look see logistic regression can be learned by two different ways one is through probabilistic method and one is through geometric method if you go and probably see my video that is present with respect to logistic regression right now in my YouTube channel there I have explained you about this L1 and L2 Norms also over there so in this also it is basically present it is a kind of penalty again just for uh using for this kind of classification problem so what I'm actually going to do let's go and play with the parameters that I am looking at so I will play with two parameters one is params C value here I'm defining 1 10 20 anything that you can Define one set of values you can Define and there was one more parameter which is called as Max iteration this is specifically for grits or CV okay that I'm specifically going to apply so I will just try to execute this this will be my params now I'm going to quickly Define my model one which will be my logistic regression model so my logistic regression here by default one value I'll give for C and Max itra let's say I'm giving this value later on what I will do for this model I'll apply it to grid sear CV so I'm just going to say grid s CV and I'm going to apply it for model one param grid is equal to params this parameter that I'm specifically trying to apply since this is a classification problem and I am not pretty sure that whether true positive is important or true negative is important I'm going to use F1 scoring okay F1 scoring is basically again the parametric term which we discussed yesterday which is nothing but performance metrics and then I'm going to use CV is equal to 5 so this will be entirely my model with respect to grid s CV and I'll be executing this then I will do model. fit on my X train and Y train data so once I execute it here you can see all the output along with warnings a lot of warnings will be coming I don't know because this many parameters are there and finally you can see that this has got selected now if you really want to find out what is your best param score model dot best params so here you can see Max iteration as 150 and what you can actually do with respect to your best score model do best score is 95 percentage but still we want to test it with test data so can we do it yes we can definitely do it I'll say model do core or I'll say model dot predict on my X test data and this will basically be my y red so this will be my y red all the Y prediction that I'm actually getting so if you go and see y red so these are my ones and zeros with respect to the Y prediction at finally after getting the prediction values I can apply confusion Matrix I hope I have taught you about confusion Matrix so from sklearn do confusion Matrix sorry sklearn do metrix I'm going to import confusion metrix classification report and the next thing that I would like to do is this two I will try to import confusion Matrix and classification report now if you want to see the confusion Matrix with respect to your I can just write Yore frad or Yore test whatever you want go ahead with it and this is basically my confusion Matrix if I put this forward no difference will be there only this thing will be moving that also I showed you 63 118 3 and 4 now finally if I want to accuracy score I can also import accuracy score over here so here you can see accuracy score is imported I can also find out my accuracy score which is my the total accuracy with respect to this I we can give y test and Yore PR which we have discussed yesterday this is giving 96% if you want detailed Precision recall all the score then at that point of time I can use this classification report and here I can give white test and wied here is what I'm actually getting so here you can see with respect to F1 F1 score Precision recall since this is a balanced data set obviously the performance will be best yes you can also use Roc see I'll also show you how to use Roc and probably you'll be able to see this you have to probably calculate false positive rate two positive rate but don't worry about Roc I will first of all explain you the theoretical part now let's go ahead and discuss about n bias n bias is an important algorithm so here I'm just going to go ahead so now let's go ahead and discuss about na bias and here we are going to discuss about the intuition so na bias is an another amazing algorithm which is specifically used for classification and this specifically works on something called as base theorem now what exactly is base theorem first of all we need to understand about base theorem let's say that guys I have base theorem let's say that I have an experiment which is called as rolling a dis now in rolling a dis how many number of elements I have have so if I say what is the probability of 1 then obviously you'll be saying 1X 6 if I say probability of two then also here you'll say 1X 6 if I say probability of three then I will definitely say it is 1x 6 so here you know that this kind of events are basically called as independent events now rolling a dice why it is called as an independent event because getting one or two in every experiment one is not dependent on two two is not dependent on three so they are all independent that is the reason why we specifically say is an independent event but if I take an example of dependent events let's consider that I have a bag of marbles okay in this marble I basically have three red marbles and I have two green marbles now tell me what is the probability of suppose I have a event in the first event I take out a red marble so what is the probability of taking out a red marble so here you can definitely say that it is 3x5 okay so this is my first event now in the second event let's say that in this you have taken out the red marble now what is the second second time again you are taking out the second red marble or forget about second Rand marble now you want to take out the green marble now what is the probability with respect to taking out a green marble so here you'll be definitely saying that okay one red marble has been removed then the total number of marbles that are left are four so here you can definitely write that probability of getting a green marble is nothing but 2x4 which is nothing but 1x2 so here what is happening first first element you took out first marble that you took out first event from from the first event you took out red marble from the second event you took out green marble this two are in these two are dependent events because the number of marbles are getting reduced as you take out from them so if I tell you what is the probability of taking out a red marble and then a green marble so it's the simple the formula will be very much simple right which we have already discussed in stats it is nothing but probability of probability of red multiplied by probability of green given Red so this specific thing is called as conditional probability here understand what is happening probability of green marble given the red marble event has occurred here both the events are independent now let me write it down very nicely so I can write probability of A and B is equal to probability of a multiplied probability of B divided by probability of a let's go and derive something can can I write probability of A and B is equal to probability of b and a so answer is yes we can definitely say we can definitely say if you go and do the calculation you'll be able to get the answer you should not say no now what is the formula for probability of A and B so here you can basically write probability of a multiplied by probability of B given a if I take out probability of green what is probability of green in this particular case 2x 5 what is probability of red 3x 4 for right now let's consider this now this part I can definitely write as this part I can definitely write as probability of B multiplied by probability of B probability of B this one probability of B and this will be probability of a given B so I can definitely write this much with respect to all this information now can I derive probability of a is equal to probability of B multiplied by probability of a / B me probability of a given B divided by probability of sorry I'll write this as probability of B given a divided by probability of a and this is specifically called as base theorem and this is the Crux behind na bias understand this is the Crux behind the base theorem now let's go ahead and let's discuss about how we are using this to solve let's take some examples and probably make you understand let's say that I have some features like X1 X2 X3 X4 X5 like this till xn and I have my output y so these are my independent features these all are my independent features these all are my independent features so here I'm going to write independent features and this is my output feature which is also my dependent feature now what is happening if I say probability of b or a what does this basically mean I need to really find what is the probability of Y and you know that guys I will have some values over here and basically I'll have some output value over here so based on this input values I need to predict what is the output initially on a training data set I will have your input and then your output initially my model will get trained on this now let's consider what this entire terminology is I will try to write in terms of this equation so I will say probability of Y given x1a x2a X3 up till xn then this equation will become probability of Y see probability of Y given X X1 X2 X3 xn this a is nothing but X1 X2 X3 xn and I'm trying to find out what is the probability of Y and then I will write probability of b b is nothing but y but before that what I'll write probability of a / B right a given b or probability of B probability of B is nothing but y multiplied by probability of a given B probability of a given B basically means probability of x1a X2 comma xn and given b b is given right so I'm able to find this entire value now just a second I made some mistakes I guess now it is correct sorry I I just missed one term that is this given y this is how it will become and this will be equal to probability of a that is X1 comma X2 like this up to XL so probability of Y multiplied by probability of a given y now if I try to expand this then this will basically become something like this see probability of Y multiplied by probability of X1 given yes a given y sorry given y multiplied by probability of X2 given y probability of x3 given Y and like this it will be probability of xn given y so this will also be y1 Y2 Y3 YN this I can expand it like this and then this will basically become probability of X Y 1 multiplied by probability of X2 multiplied by probability of x3 like this up to probability of xn so this is with respect to all the probability y will be different see here for this particular record y will be different for this y will be different for this y will be different but why output it may be yes or no right it may be yes or no okay I I'll solve a problem it will make everything understand and this will probably be probability of Y it can be binary multiclass whatever things you want I'll solve a problem in front of you now let's say that I have my y as let's say that I have a lot of features X1 X2 X3 X X4 with respect to this let's say in my one of my data set I have this many x1s this many features and this is my y so these are my feature number and this is my y let's say that in y I have yes or no so how I will probably write we really need to understand this okay I will basically say what is the probability of Y is equal to yes given this x of I this is my first record first record of X of I this is my second record of X of I so I may write like this what is the probability of Y being yes if x of I is given to you X of I basically means X1 X2 X3 X4 so here you'll obviously write what kind of equation you'll basically say probability of yes multiplied by probability of yes multiplied by probability of X of 1 given yes multiplied by probability of X2 given yes probability of x3 given yes and probability of X4 given yes divided by probability of X1 multiplied by probability of X2 multiplied by probability of x3 multiplied by probability of X4 Y is fixed it may be yes or it may be no but with respect to different different records this value may change similarly if I write probability of Y is equal to no given X of I what it will be then it will be probability of no multiplied by probability of X1 given no then probability of X2 given no probability of x3 given no and probability of X4 given no so here because every any input that I give any input X of I that I give I may either get yes or no so I need to find both the probability so probability of X1 multiplied by probability of X2 multiplied by probability of x3 multiplied by probability of X4 see with respect to Any X of I the output can be yes or no and I really need to find out the probabilities so both the formula is written over here what is the probability of with respect to yes and what is the probability with respect to no now in this case one common thing you see that this this denominator is fixed this is definitely fixed it is fixed it is it is not going to change for both of them and I can consider that this is a constant so what I can do I can definitely ignore so here I can definitely ignore these things ignore this also ignore this Al because see this is constant so I don't want to consider this in the next time I'll just use this specific formula to calculate the probability now let's say that if my first probability for a specific data set yes of X of I is let's say that I'm getting as13 and similarly probability of no with respect to X of I if I get 05 you know that in a binary classification any values if it get greater than or equal to 5 we are going to consider it as 1 and if it is less than 0.5 I'm going to consider it as zero now I'm getting values like this 13 and .1 05 obviously I'm getting .13 05 so we do something called as normalization it says that if I really want to find out the probability of X with X of I if I do normalization it is nothing but .13 divided by .13 + 05 72 this is nothing but 72% and similarly if I do for probability of no given X of I here obviously it will say 1 - 72 which will be your remaining answer that is 28 which is nothing but 28% so your final answer will be this one this formulas you have to remember now we'll solve a problem let's solve a problem this will be a very very interesting problem let's say I have a data set which has like this feature day let me just copy this data set okay for you all now in this data set I want to take out some information let's take out Outlook table now based on this output Outlook feature see over here Outlook my day outlook temperature humidity wind are the input features independent feature this is my output feature this one that you are probably seeing play tennis is my output feature which is specifically a binary classification so what I'm actually going to do I'm basically going to take my Outlook feature and based on this Outlook feature I will just try to create a smaller table which will give some information now based on Outlook first of all try to find out how many categories are there in Outlook one is sunny one is overcast and one is rain right three categories are there so I'm going to write it down over here Sunny overcast and rain so these three are my features with respect to Sunny uh with Outlook I have three categories one is sunny one is overcast and one is RA here I'm going to basically say with respect to Sunny how many yes are there and how many no are there and what is the probability of yes and probability of no so I'm going to again write it over here so this is my Outlook feature and then I have categories first yes no Sunny overcast rain yes no then probability of yes and probability of no now the next thing that we need to find out is that with respect to Sunny how many of them are yes see yes we have so when we have sunny over here the answer is no so I will increase the count over here one then again I have sunny again answer is no so I'm going to increase the count to two with this sunny this is basically no okay so again I'm going to increase the count to three now with sunny how many of them are yes one and two so I have this one and this one so I have two so I'm going to say with respect to Sunny I have two yes understand Outlook is my X1 X1 feature let's consider now the next thing is that let's see with respect to overcost with overcast how many of them are yes so this overcast is there yes 1 2 3 and four so total four yes are there with respect to overcast then with respect to overcast how many are on no you can go ah and find out it is basically zero NOS then with respect to rain how many of them are yes so here you can see with respect to one rain yes yes no no so this is nothing but 3 2 let's try to find out there are three is two or not one here also one yes is there right so 3 yes two NOS so the total number of yes and NOS if you count it there are nine yes and five NOS this is my total count so if you totally count this 9 + 5 is 14 you'll be able to compare that there will be 9 yes and five NOS what is the probability of yes when Sunny is given so here you have 2X 9 here you have 4X 9 here you have 3x 9 now if if I say what is the probability of no given Sunny now see probability of yes given Sunny probability of yes given forecast probability of yes given rain so it is basically that I will just try to write it in a simpler manner so that you'll not get confused okay so this is my probability of yes and this is my probability of no but understand what does this basically mean this terminology basically means probability of yes given Sunny probability of yes given overcast probability of yes given rain similarly what is probability of no probability of no obviously you know that 3x 5 is my first probability then you have 0x 5 and then you have 2X 5 now with respect to the next feature let's consider that I'm going to consider one more feature and in this feature I will say let's consider temperature okay let's consider temperature now in temperature how many features I have or how many categories I have I have hot you can see hot mild and and cold now with respect to hot mild cold here also I will be having yes no probability of yes and probability of no now try to find out with respect to hot how many are yes so here no is there here also no is there two NOS uh 1 yes uh 2 yes so two yes and two NOS probably then similarly with respect to mild mild how many are there 1 yes 1 No 2 yes 3s 4s 4S and two knows okay so here you basically go and calculate 4 yes and two knows with respect to cold how many are there cool cool or cold 1 yes 1 No 2 yes 3 S 3 S and 1 no so here I have specifically have 3s and 1 no again the total number is 9 and five which will be equal to the same thing that what we have got now really go ahead with finding probability of yes given hot so it will be 2x 9 over here then here it will be how much 4X 9 here it will be 3x 9 again here what will be the probability of no given given hot so it'll be 2x 5 2x 5 1X 5 so this two tables has already been created and finally with respect to play the total number of plays are yes is 9 no is five and the answer is total 14 if if I say what is the probability of yes only yes then it is nothing but 9 by4 what is the probability of no it is nothing but 5x4 okay so this two values also you require now let's say that you get a new data set you need get a new data set let's say you get a new test data where it says that suppose if you are having sunny and hot tell me what is the output so this is my problem statement so let me write it down so here I will write probability of yes given Sunny comma hot then here I will write probability of yes multiplied by probability of so here I will write probability of Sunny given yes multiplied by probability of hot given yes divided by what is it probability of Sunny multiplied by probability of hot equation because it is a constant because probability of no also I'll be getting the same value 9 by4 so probability of yes I'm going to replace it with 9 by4 multiplied by 2x 9 then probability of hot given yes so I am going to get 2 by 9 so here 99 cancel or 2 1 7 then this is nothing but 2 by 6331 I read this statement little bit wrong it should be probability of Sunny given yes now go ahead and calculate go ahead and calculate what is probability of no given sunny and hot so here you have probability of no multiplied by probability of Sunny given no multiplied by probability of hot given no divided by probability of Sunny multiplied by probability of heart this will get cancelled denominator is a constant guys this is a constant so what is probability of no so probability of no is nothing but 5 by4 so I will write over here 5 by4 multiplied by probability of Sunny given no what is probability of Sunny given no what is probability of Sunny given no is nothing but probability of Sunny given no is nothing but 3x 5 so here I'm going to get 3x 5 multiplied probability of H given no that is nothing but 2x 5 so 2x 5 is here 3x 5 is there five and five will get cancelled 2 1 2 7 and then I'm getting 3x 35 which is nothing but calculator uh if I'm actually getting three ID by 35 it's nothing but 857 I will write it down again probability of yes given Sunny comma hot which is my independent feature is nothing but 031 031 and this is probability of no given Sunny comma hot 85 now we'll try to normalize this 85 + Point divided by 031 + 085 73 this is nothing but 73% and here I can basically say 1 -73 which is my27 which is nothing but 27% if the input comes as sunny and hot if the weather is sunny and hot what will the person do whether he will play or not the answer is no okay now my next question will be that if your new data is overcast and Mild now tell me what will be the probability using name bias now you can add any number of features let's say that I will say that okay let's let's say that I will I will probably say we can consider humidity mind wind also you basically create this kind of table to find it out but this will be an assignment just do it overcast and Mild if it is with respect to NB try to solve it so the second algorithm that we are going to discuss about is something called as KNN algorithm KNN algorithm is a very simple problem statement okay which can be used to solve both classification and regression so KNN basically means K nearest neighbor let's first of all discuss about classification problem number one classification problem let's say that I have a binary classification problem which looks like this I have two data points like this one and this is another one suppose a new data point suppose a new data point which comes over here then how do I say that whether this belongs to this category or whether it belongs to this category if I probably create a logistic regression I may divide a line but in this particular scenario how do we Define or how do we come to a conclusion that whether this will belong to this category or this category so for here we basically use something called as K nearest neighbor let's say that I say that my K value is five so what it is going to do it is going to basically take the five nearest closest point let's say from this you have two nearest closest point and from here you have three nearest closest point so here we basically see from the distance the distance that which is my nearest point now in this particular case you see that maximum number of points are from Red categories from Red from Red categories I'm getting three points and from White categories I'm getting two points now in this particular scenario maximum number of categories from where it is coming we basically categorize that into that particular class just with the help of distance which all distance we specifically use we use two distance one is ukan distance and the other one is something called as Manhattan distance so ukan and Manhattan distance now what does ukan distance basically say suppose if this is your two points which is denoted by X1 y1 X2 Y2 ukine distance in order to calculate we apply a formula which looks like this X2 - X1 s + Y2 - y1 s whereas in the case of magetan distance suppose this are my two points then we calculate the distance in this way we calculate the distance from here then here right this is the distance we calculate we don't calculate the hypothenuse distance so this is the basic difference between ukan and magetan distance now you may be thinking Chris then fine that is for classification problem for regression what do we do for regression also it is very much simple suppose I have all the data points which looks like this now for a new data point like this if I want to calculate then we basically take up the nearest Five Points let's say my K is five k is a hyper parameter which we play now suppose let's say that K it finds the nearest point over here here here here and here so if we need to find out the point for this particular output with respect to the K is equal to 5 it will try to calculate the average of all the points once it calculates the average of all the points that becomes your output so regression and classification that is the only difference because this K is actually an hyper parameter we try with K is equal to 1 to 50 and then we probably try to check the error rate and if the error rate is less then only we select the model now two more things with respect to K nearish neighbor K nearest neighbor works very bad with respect to two things one is outliers and and one is imbalanced data set now if I have an outlier let's say I have an outlier over here this is one of my categories like this and this is my another category let's consider that I have some outliers which looks like this now if I'm trying to find out the point for this you can see that the nearest point is basically blue only and it belongs to the blue category but because this outlier you know it'll consider that the nearest neighbor is this so then this will be basically treated in this group only formula for Manhattan distance it uses modulus X2 - X1 + Y2 - y1 mode X2 - X1 Y2 - y1 uh this was it from my side guys and yes I've also made detailed videos about whatever topics we have discussed today you can directly go and search for that particular topic so this is the agenda of this session we will try to complete this all things again here we are going to understand the mathematical equations and all uh so today's session we are basically going to discuss about uh decision tree okay and uh in this session we are going to basically understand what is the exact purpose of decision tree with the help of decision tree you are actually solving two different problems one is regression and the other one is classification so we'll try to understand both this particular part well we will take a specific data set and try to solve those problems now coming to the decision tree one thing you need to understand I'll say that if age is less than 8 let's say I'm writing this condition if age is less than or equal to 18 I'm going to say print go to college here I'm printing print college and then I'll write else if age is greater than 18 and pag is less than or equal to 35 I'll say print work then again I'll write else if age is let me let me put this condition little bit better then I'll write here L if if age is greater than 18 and age is less than or equal to 35 I'm going to say print work basically people needs to work in this age else I'm just going to consider print retire so here is my ifls condition over here now whenever we have this kind of nested if Els condition what we can do is that we can also represent this in the form of decision trees we'll also we can actually form this in the form of decision and the decision tree here first of all we will have a specific root node let's say this is my root node now in this root node the first condition is less than or equal to 18 so here obviously I will be having two conditions saying that if it is less than or equal to 18 and one condition will be yes one condition will be no so if this is yes and if this is no right if this condition is true that basically means we'll go in this side if it is true then here we will basically have something like college so this is your Leaf node similarly when I have no okay no no in this particular case we will go to the next condition in this next condition I will again create a node and I'll say that okay this is less than 18 and greater than sorry less than or equal to 35 so if this is also there then again I'll have two conditions which is basically yes or no now when I create this yes or no over here you'll be able to see that basically means here again two condition will be there if it is yes I will say print work so this will again be my leaf node and again for no again I will do the further splitting which is retire so here you can see that this entire algorithm this entire code that I have actually written you can see that it has got converted to this kind of trees where you specifically able to take decisions yes or no so can we solve a classification problem sorry this is greater than 18 again if it is greater than 18 and less than or 35 so can we solve a regression and a classification problem regression and classification problem using this decision trees by creating this kind of nodes so in short whenever we talk about decision trees whenever we talk about decision trees you will be seeing that decision trees are nothing but decision trees are nothing but by using this nested if El condition we can definitely solve some specific problem statement but here in the visualized way we will specifically create this decision tree in the form of nodes now you need to understand that what type of maths we will probably use okay so let's do one thing let's take a specific data set which I will definitely do it over here in front of you okay and we will try to solve this particular data set and this will basically give you an idea like how we can probably solve these problems so uh let me just open my snippet tool so this is my data set that I have let's consider that I have this specific data set now this data set are pretty much important because this probably in research papers also probably people who have come up with this algorithm they usually take this they take this thing but but right now this particular problem statement if I talk about this is a classification problem statement okay but don't worry I will also help you to explain I'll also explain you about regression also how decision tree regression will definitely work so let's go ahead and let's try to understand suppose if I have this specific problem statement how do we solve this this is my output feature play tennis yes or no okay whether the person is going to pay tennis or not yesterday or there after yesterday or whenever you want so if I have this input features like Outlook temperature humidity and wind is the person going to play tennis or not this is what my model should predict with the help of decision tree so how decision tree will work in this particular case first of all let's consider any any any specific uh feature let's say that Outlook is my feature so this will be my first feature which is specifically Outlook now just tell me how many are basically having no and how many are basically having yes in the case of Outlook there you'll be able to find out there are nine yes see 1 2 3 4 5 6 7 8 9 and how many NOS are there 1 2 3 4 5 I think 1 2 3 4 5 so nine yes and five NOS what we are going to do in this specific thing now we have N9 yes and five Nos and the first node that I have actually taken is basically Outlook so Outlook feature now just try to find out we are focusing on this specific feature now in this feature how many categories I have I have one Sunny category you can see over here I have Sunny one category then I have another category called as overcast then I have another category as rain so I have three unique categories So based on these three categories I will try to create three nodes so here is my one node here is my second node here is my third node so these are my three categories so this category is basically called as Sunny this category is basically called as overcast and this category is basically called as rain based on these three categories so I'm splitting it now just go ahead and see in Sunny how many yes and how many no are there how many yes with respect to Sunny are there see in sunny I have two NOS see one and two no uh one more no is there three NOS so here you can see this is my one no then this is my two no this is my three no and yes are two so this one and this one so how many total number of yes so here you can see that there are 1 2 2 yes and three no let's say that I have randomly selected one feature which is Outlook why can't I when like see it is up to it it is up to the decision tree to select any of the feature here I have specifically taken Outlook later on I'll explain why it it can basically select how it selects the feature okay I'll I'll talk about it don't worry so in the Outlook we have two yes sorry in the case of Sunny we have two yes and three NOS now the next thing is that let's go and see for overcast in overcast I have 1 yes uh 2s um 3s and 4 yes I don't have any no in overcast so over here my thing will be that four yes and Zer Nos and then finally when we go to the Rain part see in Rain how many features are there in rain if you go and probably see it how many number of yes and NOS are there go and see in one one yes in row rain two yes then one no then again you have one yes and one no right so here you can basically say that in rain in the case of rain if I take a as an example how many number of yes and NOS are there it will be 3 yes and two NOS understand understanding algorithm then everything will you'll be able to understand now let's go ahead and try to cease for sunny sunny definitely has 2 yes and three NOS this has four yes and zero NOS here you have three Y and two NOS now if I probably take overcast here you need to understand understand about two things one is pure split and one is impure split now what does pure split basically mean pure spit basically means that now see in this particular scenario in overcast in overcast I have either yes or no so here you can see that I have four yes and Zer NOS so that basically means this is a pure split anybody tomorrow in my data set if I just take this Outlook feature suppose in one day in day 15 the Outlook is Outlook is basically overcast then I know directly it is the person is going to play so this part is already created and this node is called as pure node understand this why it is called as pure node because either you have all Yes or zeros NOS or zero yes or all NOS like that in this particular case I have all yes so if I take this specific path I know that with respect to overcast my final decision which is yes it is always going to become yes so this is what it basically says so I don't have to split further so from here I will probably not split I will definitely not split more because I don't require it because I have it is a pure leaf node okay you can also say that this is a pure leaf node so I'm just going to mention it again this one I'm specifically talking about now let's talk about sunny in the case of Sunny you have two yes and three NOS so this is obviously impure so what we do we take next feature and again how do we calculate that which feature we should take next I'll discuss about it let's say that after this I take up temperature I take up temperature and I start splitting again since this is impure okay and this split will happen until we get finally a pure split similarly with respect to rain we will go ahead and take another feature and we'll keep on splitting unless and until we get a leaf node which is completely pure I hope you understood how this exactly work now two questions two questions is that Kish the first thing is that how do we calculate this Purity and how do we come to know that this is a pure split just by seeing definitely I can say I can definitely say by just seeing that how many number of yes or NOS are there based on that I can def itely say it is a pure split or not so for this we use two different things one is entropy and the other one is something called as guine coefficient so we will try to understand how does entropy work and how does Guinea coefficient work in decision tree which will help us to determine whether the split is pure split or not or whether this node is leaf node or not then coming to the second thing okay coming to the second thing one is with respect to Purity second thing your first most important question which you had asked why did I probably select Outlook how the features are selected and here you have a topic which is called as Information Gain and if you know this both your problem is solved so now let's go ahead and let's understand about entropy or guinea coefficient or Information Gain entropy or guine coefficient oh sorry Guinea coefficient I'm saying guine impurity also you can say over here I'll write it as guine impurity not coefficient also I'll just say it as Guinea impurity but I hope everybody is understood till here let's go ahead and let's discuss about the first thing that is entropy how does entropy work and how we are going to use the formula so entropy here I will just write guine so we are going to discuss about this both the things let's say that the entropy formula which is given by I will write h of s is equal to so h of s is equal to minus P plus I'll talk about what is minus what is p plus log base 2 p +- p minus log base 2 p minus so this is the formula and in guine impurity the formula is 1 minus summation of I equal 1 2 N p² I even talk about when you should use guine impurity when you should not use guine impurity when you should use entropy you know by default the decision tree regression or classific sorry decision tree classification uses Guinea impurity now let's take one specific example so my example is that I have a feature one my root node I have a feature one which is my root node and let's say that in this root node I have six yes and three NOS very simple let's say that this has two categories and based on this two categories of split has happened that is a C1 let's say in this I have 3 S3 Nos and here I have 3 s0 Nos and this is my second category always understand if I do the sumission 3s and 3s is 6s see this this sumission if I do 3 + 3 is obviously 6 3 + 0 is obviously so this you need to understand based on the number of root nodes only almost it'll be same now let's go ahead and let's understand how do we Cal calculate let's take this example how do we calculate the entropy of this so I have already shown you the entropy formula over here now let's understand the components I will write h of s is equal to minus sign is there what is p+ p+ basically means that what is the probability of yes what is the probability of yes this is a simple thing for you all out of this what is the probability of yes yes out of this so obviously how you'll write if you want to find out the probability of yes out of this see when I say plus that basically means yes when I say minus that basically means no so what is the probability of yes so it is be nothing but yes plus and minus are specifically for binary class this can be positive negative so the probability with respect to yes can I write 3x 3 only for this what is the probability out of this total number of this is there 3x3 similarly if I go and see the next term log to the base 2 p+ so again if I go ahead and write over here log to the base 2 p+ p+ is again 3x3 so then again we have minus and this is now P minus what is p minus 0 by 3 log base 2 0 by 3 this obviously will become zero this will obviously become 0 because 0 divid by anything is zero what will this be 1 log to the base 1 what is this this is nothing but zero log to the base 1 is nothing but zero tell me whether this is a pure split or impure split so this is a pure split whenever we have a pure split the answer of the entropy is going to come to zero so here I'm going to Define one graph this is H of s and let's say this is p+ or P minus if my probability of plus see when I say probability of plus is 0.5 what will be probability of minus it will also be 0. five right because it's just like P is equal to 1 - Q right if p is .5 then Q will be 1 - P same thing right so when it is5 obviously my h of s will be 1 let's say so this is this is the graph that will basically get formed let's go ahead and try to calculate the entropy of this guys what will be the entropy of this node so here I'm going to just make a graph h of s minus what is p+ p+ is nothing but 3x 6 log base 2 3x 6 minus three no are there 3x 6 log base 2 3x 6 so if you compute this log base 2 to the^ of 1 if you do the calculation here I'm actually going to get one so when I'm getting one when I'm actually getting one when you have three yes and three NOS what is the probability it is 50/50% right so when your p+ is5 that basically means your h of s is coming as one so from this graph you can see that I'm getting one if this is zero this is one this is zero and this is one I hope everybody is able to to understand guys 0o and one if your p+ is zero or if your p+ is one that basically means it becomes a pure split so in h of s you are going to get zero so always understand your entropy will be between 0 to 1 if I have a impure this is a completely impure split because here you have 50% probability of getting yes 50% probability of getting no h ofs is entropy this is entropy for the sample H ofs notation that I'm using is H ofs so if whenever the split is happening the first thing is done the purity test the purity test is done with the help of entropy right now I'll also show guinea guinea impurity don't worry so with the entropy you'll be able to find if I am getting one that basically means it is a impure split and if I'm getting zero it is pure split so this is the graph okay this is the graph and this graph is basically the entropy graph again understand if your probability of getting yes or no is 0.5 that basically means 50/50 is there 3s and three NOS then your entropy is going to be 1 h of s if your probability is completely one that basically means either you're getting completely yes or completely no so your your entropy will be zero that basically means it is pure split so in the case of probability .5 you're getting plus one then it'll keep on reducing now let's go ahead and let's try to understand so here you have understood about purity test definitely you'll use entropy try to find out whether it is pure or impure if it is impure you go ahead with the further shift further division of the categories again you take another feature divide it because here from this two which split you will do further you will do this split as further if you are getting 6 6 is this specific value then you probably go and draw over here this is your entropy if your probability is here which is.3 then you will go here and create this this may be0 4 or3 something like this it will be between 0 to 1 let's go ahead and discuss about the second issue I hope everybody is discussed about we have discussed about checking the pure split or not and we have understood this much but the next thing is that okay fine chish this is very good we have explained well I know many people will say that but there are some people I can't help let's say that I have some features okay now coming to the second problem how do we consider which node to cap which which feature to take and split because here I may have one one split so again let's see that what is the second problem which feature to take to split right this is the second problem that we are trying to solve let's say that I have one feature one over here and I have two categories let's say this is there C1 and C2 here let's say that I have 9 years 5 Nos and then I have 6 years 2 NOS here I have basically three yes and three NOS let's say and in my data set I have features like F1 FS2 F3 now let's say that another split I can actually start with feature two also and in feature two I may have probably three categories like C1 C2 C3 so with respect to the root node and all the other features because after this also I may have to split right I may have to take another feature and keep on splitting right based on the Pure or impure split how do I decide should I take fub1 first or F2 first or F3 first or any other feature first how should I decide that which feature should I take and probably do the split that is the major question so for this we specifically use something called as Information Gain so here I'm just going to say here we basically use Information Gain now what is this Information Gain I'll talk about it so Information Gain first of all I will write the formula we basically write gain with sample first with feature one I will compute so first with feature one I will compute suppose this is my first split of my data and probably I'm Computing over here this can be written as h of s I'll discuss about each and every parameter don't worry summation of V belong to values s of V don't worry guys if you have not understood the formula I will explain it then the sample size H of SV I'll discuss about each and every parameter let's say that I'm taking this feature one split I have you have already seen what is feature one so this is my feature one I have two categories C1 C2 this has 9 yes 5 NOS this has 6s and two Nos and this has 3 yes and three NOS now I will try to calculate the information gain of this specific split now I will go ahead and probably take this up now see over here we'll try to understand what is this now if I want to compute the gain of s of F1 first is first first thing that I need to find out is H of s now this h of s is specifically of the root node so I need to first of all calculate what is h of s h ofs is nothing but entropy entropy of the root node so if I want to compute the entropy of the node node tell me how should I compute h of s is equal to minus p + log base 2 p+ calculate guys along with me - P minus log base to P minus so I hope everybody knows this so here I'm going to compute by what is ability of plus over here in this specific root node it is nothing but 9 by4 then I have log base 2 again 9 by4 then I have P minus what is p minus 5x4 log base 2 5 by4 so this calculation I will probably get it as 94 approximately equal to 94 just check it whether you're getting this or not again you can use calculator if you want now now I have definitely found out this this is specifically for the root node now let's see the next thing the next important thing which is this part what is s of v and what is s and what is h of SV now very important just have a look everybody see this graph okay see this graph I will talk about h of SV first of all I'll talk about h of SV okay this one this is the entropy of category one you need to find and entropy of category 2 you need to find so if I write h of SV of category 1 so what is category 1 for this I'll write SC1 let's say I'm going to write like this quickly calculate the H of SV of this and this separately you need to calculate so h of SV of C1 okay so here again you'll write - 6X 8 log base 2 6X 8us 2x 8 log base to 2x 8 I hope everybody knows this how we got it so h of SV basically means I'm going to compute the entropy of this category and this category so for that I will basically write h of so here I will write - 6 by8 log base 2 6X 8 - 2x 8 log base 2 2x 8 so if I get it I'm actually going to get 81 and similarly if I if I calculate h of C2 quickly calculate how much you are going to get guys 6X 8 6X 8 with respect to this we need to find out so now we have all these values we'll start equating them to this equation so here we have finally gain of s comma fub1 so let's say that here I'm going to basically add 94 minus see minus summation of okay summation of what is s s of V understand s of V basically means that how many samples I have over here let's say for category one how many samples I have for category one over here simple if you really want to just calculate it is nothing but eight and total number of sample is how much if I go and see over here there are 9 years five NOS okay 9 years and five NOS that basically means 14 total sample here you have eight sample Okay so this will become 8x4 then you multiply by what see see from this equation you multiply by h of SV so h of SV is nothing but the entropy of category 1 so entropy of category 1 is nothing but 81 plus then you go again back to the graph and try to see that for C2 how much how many total number of samples are there 3 + 3 is 6 so 6 by 14 it will become multiplied by 1 right so this is your entire thing so here after all the calculation you are going to get 0.041 so this is my gain with s comma F1 so here I have got this value amazing I did this with feature one only what about feature two let's say that this was my split for feature two and suppose I get the gain for S comma feature 2 as .51 if I get this now tell me in using which feature should I start splitting first whether it should be fub1 or whether it should be FS2 based on this value you know that over here the gain the information gain of s comma F2 is greater than gain of s comma fub1 so your answer is very much simple we will definitely use feature 2 to start the split the thing over here you are trying to understand that if I really want to select which feature to select to start my splitting then I have to basically calculate the information gain and go throughout the all the paths and whichever path has the highest Information Gain then we will select that specific thing now the question Rises Kish obviously this is good but you had written about guinea impurity what is the purpose of that please explain us and why Guinea impurity is basically used so let me go ahead with guine impurity I told that yes you can obviously use you can obviously use entropy but why Guinea impurity so guine impurity formula which I have specifically written as 1 minus summation of IAL 1 2 N p² now what is this p² suppose let's say that in my n n is the number of outputs right now how many outputs I have I have two outputs yes or no so I will expand this 1 minus since this is summation I equal to 1 to n I'm basically going to basically say that okay fine I will write probability of plus whole Square uh plus probability of minus whole Square so this is the formula for guinea impurity now you may be thinking okay fine the calculation will be obviously very much equal easy right suppose if I have a node sorry if I have a node which which has 2 yes two NOS now in this particular case how do I calculate my this probability if I have two yes or two NOS suppose let's say that I have a node over here which is my split and this is having two yes and two no so how do I calculate I will write 1 minus what is probability of square 1X 2 square sorry not 1 by two yeah 1X 2 squ + 1 by 2 squ right then I will say 1 by 1X 4 + 1X 4 is nothing but 2x 4 which is nothing but 1X 2 so I will be getting 0.5 now here here you understand this is a complete impure split right if you have an impure split in entropy the output you getting it as one whereas in the case of Guinea impurity as Z sorry 0.5 so if I go ahead with the graph that I probably had created here so my Guinea impurity line will look something like this so it will be looking something like this for zero obviously I'll be getting zero but whenever my probability of plus is 0.5 I'm going to get 0.5 over here and that is the difference between Guinea impurity and entropy but the re but you may be seeing Kish when to use what now let's understand that when to use Guinea and when to use entropy tell me guys if I consider this formula of guine impurity and if I probably consider if I consider entropy this formula where do you think more time will take for execution for this particular formula whether for entropy it will take or for guinea impurity it will take more time where it will probably take for the execution purpose see understand decision tree is having a worst time complexity because if you have 100 features probably you'll keep on comparing by dividing many many features then probably compute a Information Gain like this if you have just 100 features so which is faster entrop or guine impurity understand in entropy you have log function here you have log function here you have simple maths the more amount of time out of entropy and guine impurity the more amount of time basically is taken by entropy so if you have huge number of features like 100 200 features and you are planning to apply decision Tre I would suggest try to use Guinea impurity then entropy if you have small set of features then you can go ahead with entropy so over here definitely with respect to fast Guinea is greater than entropy now let's go ahead and understand with respect to you may be thinking Kish okay fine you have basically explained us about categorical variables over here see over here you have you have explained about categorical variables what if I have numerical feature let's say I have F1 over here which is a numerical feature I have an F1 feature which is numerical feature and I may have values let's say that I have sorted all the values over here okay let's say that I have F1 and output okay so this F1 let's say that I have values like ass sorted order values I'm sorting this features I'm basically doing this let's say that initially I have this features like this and let's say I have values like 2.3 1.3 4 5 7 3 let's say I have this features now this is a continuous feature this is a continuous feature so for a continuous feature how probably the decision tree entropy will be calculated and the Information Gain will get calculated so here you'll be able to see that I will first of all sort these values so in F1 the decision tree will B basically first of all sort this values so I have 1.3 then you have 2.3 then you have four then you have three three then you have four then you have five and then you have six now whenever you have a continuous feature so how the continuous feature will basically work in this case first of all your decision tree node will say that we'll take this one only one first record and say that if it is less than or equal to 1.3 okay if it is less than or equal to 1.3 so you here you'll be getting two branches yes or no so yes and no definitely your output over here will be put over here right and then for the no here you'll be having another node over here how many number of Records you'll be having in this particular case you'll be having one record in this particular case you will be having around five to six records and here also you'll be able to see right how many yes and NOS are there definitely this will be a leaf node so in the first instance they will go ahead and calculate the information gain of this then probably once the Information Gain Is got then what they'll do they will take the first two records and again create a new decision tree let's say that this will be my suggestion where they'll say it is less than or equal to 2.3 so I will get one and one over here so in this now you'll be having two records which will basically say how many yes and no are there and remaining all records will come over here then again Information Gain will be computed here then again what will happen they'll go to the next record then then again they'll create another feature where they'll say less than or equal to three and they will create this many nodes again they'll try to understand that how many yes or no are there and then they'll again compute The Information Gain like this they'll do it for each and every record and finally whichever Information Gain is higher they will select that specific value in that feature and they'll split the node so in a continuous feature whenever you have a continuous feature this is how it will basically have and then it will try to compute who is having the highest Information Gain the best Information Gain will get selected and from there the splitting will happen now let's go ahead and understand about the next topic is that how this entirely things work in decision tree regressor because in decision tree regressor my output is an continuous variable so suppose if I have one feature one feature two and this output is a continuous feature it will be continuous any value can be there so in this particular case how do I split it so let's say that f1c feature is getting selected now in this f1c feature what value will come when it is getting selected first of all the entire mean will get calculated of the output mean will get calculated so here I will have the mean and here here the cost function that is used is not Guinea coefficient or guinea impurity or entropy here we use mean squared error or you can also use mean absolute error now what is mean squared error if you remember from our logistic linear regression how do we calculate 1 by 2 m summation of I = 1 to n y hat minus y whole Square y hat of i y - y whole Square this is what is mean square error so what it will do first based on F1 feature it will try to assign a mean value and then it will compute the MSE value and then it'll go ahead and do the splitting now when it is doing splitting based on categories of continuous variable I will be having different different categories now in this categories what will happen after split some records will go over here then I will be having a mean value of this over here that will be my output and then again the MSC will get calculated over here as the msse gets reduced that basically means we are reaching near the leaf note and the same thing will happen over here so finally when you follow this path whatever mean value is present over here that will be your output this is the difference between the decision tree regressor and the classifier here instead of using entropy and all you use mean squar error or mean absolute error and this is the formula of mean square error now let's go to the one more topic which is called as the hyperparameters tell me decision tree if I keep on growing this to any depth what kind of problem it will face regressor part you want me to explain okay let's see okay let's let's do the regression decision tree regressor let's say I have feature F1 and this is my output let's say I have values like 20 24 26 28 30 and this is my feature one with category one category one let's say some categories are there let's say I have done the division by F1 that is this feature initially tell me what is the mean of this that mean value will get assigned over here then using msse that is mean squar error here you will try to calculate suppose I get an msse of some 37 47 something like this and then I will try to split this then I will be getting two more nodes or three more nodes it depends then that specific nodes will be the part of this again the mean will change again the mean will change over here suppose this two is there this two records goes here right then again MC will get calculated I'm just taking as an example over here just try to assume this thing now if I talk about hyper parameters see this is what is the formula that gets applied over MSC now let's see in this hyper parameter always understand decision tree leads to overfitting because we are just going to divide the nodes to whatever level we want so this obviously will lead to overfitting now in order to prevent overfitting we perform two important steps one is post pruning and one is pre- pruning so this two post pruning and pre pruning is a condition let's say that I have done some splits I have done some splits let's say over here I have seven yes and two no and again probably I do the further split like this now in this particular scenario you know that if 7 yes and two NOS are there there is a maximum there is more than 80% chances that this node is saying that the output is yes so should we further do more pruning the answer is no we can close it and we can cut the branch from here this technique is basically called as post pruning that basically means first of all you create your decision tree then probably see the decision tree and see that whether there is an extra Branch or not and just try to cut it there is one more thing which is called as pre-pruning now pre-pruning is decided by hyperparameters what kind of hyper parameters you can basically say that how many number of decision tree needs to be used not number of decision tree sorry over here you may say that what is the max depth what is the max depth how many Max Leaf you can have so this all parameters you can set it with grid SE CV and you can try it and you can basically come up with a pre- pruning technique so this is the idea about decision tree uh regressor yes yes it is possible your guinea value will be one no this graph is there no Guinea value are you talking about this Guinea entropy it will not be one it will always be between 0 to.5 so the first thing first as usual what we should do we should import the libraries so here I will go ahead and import the librar so I'll say import pandas as NP PD import matplot li. pyplot as PLT uh import so this basic things I have with me so I will go and take any data set that I want from SK learn. data sets import let's say that I'm going to take load Iris data set and then I'm going to upload the iris data set so I'm going to write load Iris there is my Iris data set then the next step uh once you get your iris data set so this is my iris. dat okay these are all my features the four features will be there these four features are petal length petal width SLE length and SLE width this is my independent features then if I really want to apply for classifier so decision tree classifier so I can first of all import from skarn do tree import decision let's see where decision tree present in a scalon decision tree classifier the name is absolutely fine but I was not getting over here so so this is got no module SK okay SK skar skn learn so here you have classifier right now I'm just going to overfit the data then I'll probably show you how you can go ahead with uh pruning so by default what are the parameters over here if you probably go and see in in the classifier over here you have Criterion see this the first P parameter is Criterion by default it is Guinea then you have Splitter Splitter basically means how you're going to split and there also you have two types best and random you can randomly select the features and do it okay you should always go with best max depth is a hyper parameter minimum sample lift is a hyper parameter Max Fe features how many number of features we are going to take in order to fix that that is also an hyper parameter so all these things are hyper parameter okay so I will just by default executed whatever is giving me in decision tree and the next thing that I'm actually going to do is create a decision tree so for this I will be using plot. fig size plot. figure inside figure I have this fix size okay and I will probably show in some better figure size so that everybody body will be able to see it so here let me say that I'm going to take an area of 1510 and then probably I'm going to say tree Dot Plot and here I'm going to say a classifier and it should be filled the coloring should be filled with this so tree sorry Tre Tre Tre Tre Tre it should be classifi tree. plot okay I have to also import uh tree so I have to basically import tree so from SK learn import three again I'm getting error has no attribute plot why let me just see the documentation guys so this plot function is like plot uncore tree dot tab plot _ tree now what is the error we are getting okay not fitted yet sorry so I'm going to say classifier do fit on data what data iris. data and then I'm going to fit with Iris dot Target so once this is done I think now it will get executed so this is how your graph will look like guys so here you can see this is how your graph looks like now if I show you the graph over here see you can see some amazing things over here three outputs are actually there in this when you see in this left hand side this become a leaf node so this first one is probably vers color uh versol flower okay if you go on the right hand side here you can see 50/50 is there so based on one feature based on one feature here you'll be able to see that you are getting a leaf node based on another Branch here you are getting 05050 so again you have two more features getting splitted over here so here you have 495 here you have 471 do we require this split anybody tell me from here do we require any any more split just try to think this is after post pruning I want to find out whether more splits are required or not now in this particular case you see this after this do you require any split you do not require right here you are basically getting 47 and one I guess after this also you require no split understand this so this is basically post pruning so you can then decide your level and probably do it gu value is more than 0.5 okay this side H this is coming as 0.5 greater than 0.5 it should not had here it is 0.5 no maximum .5 can come 0 to.5 only should come I don't know why this is coming as 667 I'll have a look onto this guys but anywhere you see other than that you're everywhere you're getting less than5 the plotting graph is very much easy you use SK learn import tree then you basically do this get classify and field is equal to true and you can just do this so the agenda let me Define the agenda what all things are there first we'll understand about emble techniques in this assemble techniques we are basically going to discuss about what is the difference between bagging and boosting second what we are basically going to discuss about is so uh the agenda of this session is emble techniques bagging and boosting then we are probably going to cover random forest and then probably we will try to cover adab boost and if I have more energy I will also try to cover XG boost so all this Al lthms we'll discuss about it so let's go ahead and let's start the topics the first topic that we are going to discuss is about emble techniques now what exactly is emble techniques and we are going to discuss about it okay so emble techniques what exactly is emble techniques till now we have solved two different kind of problem statement one is classification and regression and you have learned about different different algorithms like uh linear regression logistic regression we have discussed about KNN we have discussed about yesterday what disc what did we discuss about n bias different different algorithms we have already finished now with respect to classification regression Problem whatever algorithm we are discussing there was only one algorithm at a time we were discussing one algorithm at a time we are discussing and we are trying to either solve a classification or a regression problem now the next thing is over here is that can we use multiple algorithms mul multiple algorithm to solve a problem multiple algorithms basically means can we I'll just talk about it okay now the if I ask this specific question can we use multiple algorithms to solve a problem at that point of time I will definitely say yes we can because we are going to use something called as emble techniques there now what this emble techniques is okay so emble techniques in emble techniques we specifically use two different ways one is one one way is that we specifically use and the other one I'll just go to write it over here so one that we basically use is something called as bagging technique and the other one we specifically use is something called as boosting technique so in bagging Technique we what exactly we can do and in boosting technique what we can actually do and how we are combining multiple models to solve a problem so let's first of all discuss about bagging now how does bagging work let's say that I have a specific data set so this is my data set with uh with features rows columns everything like this I have this specific data set just imagine I have many many features over here like this fub1 F2 F3 and probably I have my output so this is my data set D let's consider it now what we do in bagging is that we create models and this model can be anything it can be logistic it can be linear for a classification problem let's say that this is logistic model so this is my model M1 let's say I have another model M2 then I may have another model M3 let's say that this is logistic and this is probably the other model which is like decision tree and then probably we use this model as KNN classification and this model can again be decision tree it's fine let's use another decision tree so now here you can see that we have used so many models okay so many models are there now with respect to this particular model what I will do is that the first step that I will do from this particular data set I will just take up some rows so I'll basically do row sampling and I'll take a row sampling of D Dash D Das basically means this D Das is always less than D some of the rows I'll push it to M1 okay I can also use n fine so what I'll do is that some of the rows I'll push it to model one this model one will be training let's say that for out of this 10,000 record th000 rows I'm actually doing a row sampling of th rows and giving it to M1 to train it then what I'm actually going to do over here I'm basically going to give this specific model M2 and again I'm going to do row row sampling and I'm again going to sample some of the rows and give it to model two and again remember some of the rows may get repeated from this D Dash to next dble Dash similarly I will do row sampling and give it to this and again I may have d triple Dash and D4 Dash so different different different different rows data points when I say row sampling basically I'm talking about data points different different data points I will give it to separate separate model and this model will specifically train when I say D Dash that basically means uh suppose I say th 10,000 are my total number of data points when I say D Dash This D Dash may be th000 points then D Double Dash may be another th000 points and some of the rows may get repeated over here dle Dash here also I can basically use so here specifically row sampling will be used now when I have this many specific each and every model will be trained with different kind of data now how the inferencing will happen for the test data so first thing first let's say that I'm going to get a new test data over here now new test data will be passed to M1 and this M1 suppose it gives zero as my output suppose let's say that I'm doing a binary classification it gives a Zer as an output so this is my output of zero next M2 for the new test data gives one M3 gives one and M4 also gives one as the the output now in this particular case in this particular case what will happen now you can see over here it's simple what what do you think the output may be in this particular case now M1 has predicted for this particular test data as zero the model M2 has predicted 1 M3 has predicted 1 and M4 has predicted one so finally all these outputs are going to get aggregated are going to get aggregated and a simple thing that gets applied is majority voting majority voting so tell me what will be the output for with respect to this the output will obviously be one because the majority voting that you can see three people are basically saying it as one so my output over here will be one okay this is the concept of bagging wherein you are providing different different rows with probably all the features in this case and giving it to different different model again which is a classification model and then finally you are combining them based on majority voting and you're getting the answer as one so this step is called as bootstrap aggregator that basically means you're aggregating all the output that is basically coming from all the specific models all the specific models now many people will say Krish what about Tai guys like this kind of situation you know we will be having more than 100 to 200 models so it is very very difficult that it will be a tie who are repeating questions they will be put up in time out so what if you're saying that if the 50% of model says yes 50% of our models says no always understand guys we will be having more than 100 to 200 plus models so in this particular case there will be high probability that always there will be a majority voting available it will always not be in that specific scenario so this was the concept about bagging now some people will be saying that Krish why are you using different different models guys I'm not discussing about random Forest over here random Forest uses only one type of model that is decision tree but if we think as an concept of bagging you can have different different models over here and you can basically combine them so this is a technique of emble techniques and this is basically called as bagging okay now tell me one point I missed out fine this is with respect to the classification problem with respect to the regression problem what will happen in case of a regression problem let's say that I got here 120 here 140 here 122 here 148 as my output so in regression what will happen is that the entire mean will be taken mean will be taken the output mean will be basically taken and that will be your output of the model average or mean very simple right so average or mean will be basically taken up and here based on the average you'll be able to solve the regression problem great now let's go ahead and try to understand with respect to bagging and boosting how many different types of algorithm are but before that I need to make you understand what exactly is boosting now here in bagging you have seen that you have parallel models right one one one independent you have parallel models you're giving some row samples in different different models and basically are able to find out the output now in case of boosting boosting is a sequential combination of models like this you have lot of sequential models like this and one after the model like first I'll give my training data to this particular model then it will go to this data then this model then this model so this will be my M1 M2 M3 M4 and finally I will be getting my output so here you can basically say that boosting is all about and this M1 M2 M3 we basically mention it as weak Learners so this will be weak learner weak learner weak learner weak learner and finally when we go till here it it'll if I combine all these weak ners weak learner weak learner okay once I combine all this weak learner it becomes a it becomes a strong learner finally if I try to combine this this will basically become a strong learner so here you have all the models sequentially one after the other and then you will probably try to provide your uh input from one model to the next model to the next model and these all models will be a very simpler weak learner model which will not be able to predict properly but when you combine all this particular models together sequentially it becomes a strong learner how this specifically works I'll take an example example of AD boost XG boost I will show you that okay week learner basically means the prediction is very bad but as you go sequentially you combine them they become a strong learner okay one example I want to give you let's say that you are a data scientist right let's say that this model one may be a teacher with respect to physics then this model two may be a teacher with respect to chemistry let's say model 3 is basically a teacher of maths and model four is a teacher of geography now suppose if you are trying to solve one problem obviously if the physics teacher is not able to solve that particular problem then probably chemistry can help or maths can help or geography can help or someone can help so when we combine this many expertise together they will be able to give you the output in an efficient way Sumit I'll talk about it where whether all the features are basically passed to all the models or not I'll just talk about it just give me some time okay but I just want to give you an idea about in short if someone asks you in an interview what exactly is boosting okay boosting is you can just say that it is a sequential set of all the models combined together and these all models that I initialized are usually weak Learners and when they are combined together they become a strong learner and based on the strong learner they gives an amazing output and right now if I say in most of the kaggle competition they use different types of boosting or bagging technique so we have basically as I said bagging and boosting in bagging what kind of algorithm we specifically use we use something called as random forest classifier and the second model that we specifically use is something called as random Forest regress so we specifically use these two kind of models which I'm actually going to discuss right now after this and then in boosting we basically use techniques like ad boost gradi Boost number three is Extreme gradient boost which we also say it as XG boost extreme gradient boost so let's go ahead and let's discuss about the first algorithm which is called as random forest classifier and regressor now first thing first let's understand some things from the yesterday's class I hope uh what is the main problem with respect to decision tree whenever we create a decision tree without any hyperparameter it does it not lead to overit does it not lead to overfitting uh whenever you probably have a decision tree right it leads to something like overfitting why overfitting because it completely splits all the feature till it's complete depth overfitting basically means for training data the accuracy is high for test data the accuracy is low so training data when the accuracy is high I may basically say it as high bias and then I may basically say it as sorry not high bias low bias and high V variance so low bias and high variance yes obviously we can do pruning and all guys but again understand pruning is an extensive task probably if your if you have 100 features if you have data points which is like 1 million to do pruning also it is very much difficult yes pre pruning can be done but again we cannot confirm that it may work well or not so right now with respect to decision tree you have this specific problem that is low bias and high variance now in low Biance and high variance you know that my model is basically the generalized model that I should get it should have low bias and low variance so if somebody asks you why do you use random Forest you can basically explain about decision trees like this now my main aim is to convert this High variance to low variance now I will be able to convert this High variance to low variance using random forest classifier or random Forest regressor now what does random Forest do random Forest is a bagging technique similarly I have a data set over here let's say that I have this data set and then here I will be having multiple models like M1 M2 M3 M4 let's say I have this four models like this we have many many models now with respect to this models this models all the models are actually decision Tree in random forest all are decision trees you don't have a different model over there so over here you can see that all the models are decision trees that is going to get used used in random Forest so decision trees always gets used in random Forest the first thing that you should know now whenever we are using decision trees you know that decision tree if I by default if we try to create it it may lead to overfitting and because of that every decision tree will basically create low V low bias and high variance but if we combine in the form of bootstrap aggregator this High variance will be getting converted to low variance because why because majority of voting we will be taking from this particular decision trees like there will be many many decision tree so they lot of outputs will be coming and with the help of majority voting classifier this High variance will get converted to low variance now in random Forest how it works in the first case if I talk about random Forest over here two things basically happen with respect to the D- data set let's say in first model we do some kind of row sampling plus Feature Feature sampling that basically means we have to select some set of rows and some set of features and give it to M1 similarly you do row sampling and feature sampling and give it to M2 then you do row sampling and feature sampling you give it to M3 and then you do row sampling and feature sampling you give it to M4 now when you do this so what will happen independently you're giving some features along with some rows now there may be a situation that your features may also get repeated it may also get repeated your records or data points may also get repeated so when you are probably training your model with this specific data sets and specific features this model become expert in predicting something right as I said one example over here I'm giving a physics model some data I'm giving chemistry data chemistry model with some data similarly here I'm giving some information to some model so the model will be an expert with respect to that specific data So based on all this particular data whenever I get a new test data so what will happen suppose let's say that this this is a classification problem the M1 model will be predicting zero this will be predicting one this will be predicting zero and this will be predicting zero now in this particular case again the majority voting classifier or majority voting will happen in the case of classification problem and then here you will be specifically able to get the output as zero so I hope everybody is able to understand all the models over here are decision trees and based on that you will be doing see when in I interview should be very very uh things the things that I'm telling you over here is all all the points are very much important and similarly if you tell the interviewer definitely your interview is cracked in this kind of algorithm I've seen some of my students saying that okay uh Kish um when the interviewer asked me that which is my favorite algorithm I said random Forest I told why did you say like that because he said that because that person let me let him ask any questions in random Forest I'm very much confident about it and I'm also going to prove him you know why they are very very good so with this specific case here you can basically see that because of the overfitting condition of the decision tree you're combining multiple decision tree so that you get a generalized model which has low bias and low variance so I hope everybody is able to understand boost feature sampling basically means suppose if I have 1 2 3 four feature for the first model I may give two features for the second model I may get three features for the fourth model I may give four features or uh any one feature ALS I can give to a specific model so internally that random Forest it take carees of over here these things are there and this is how random Forest Works only the difference between random Forest classify and regression is that in regression again whatever output you are basically getting you basically do the mean that's it average you just do the average you'll be able to get the output based on all the models output that you are actually getting now let's talk about some of the important points in random Forest the first thing first question is that is normalization required in random Forest then the next question is that in KNN is normalization when I say normalization or standardization I I'll just talk about standardization is standardization is required so this will be my another question so is normalization required in random forest or decision tree you here you can also say it as decision tree is it required so for this the answer will be no because understand decision tree will basically do the splits if you Mini minimize the data also that split won't be that much important but if I talk about KNN whether standardization normalization required over here the answer is yes because here we use two things one is ukan distance and Manhattan distance because of this you definitely have to apply standardization so that the computation or distance becomes easy so this is one of the most common interview questions that is basically asked in random Forest coming to the third question is random Forest impacted by outlier over here the answer will be no just check it out outside basically means Google and check it out check it out in Google okay perfect so I hope I've covered most of the things in random Forest is random Forest impacted by outliers this is the third question is KNN impacted by outliers is this KNN algorithm impacted by outliers is KNN impacted Byers the answer is yes big yes perfect so so these all are the interview questions that needs to be covered now let's go ahead and discuss about adab boost now in bagging most of the time we specifically use random forest or you can also create custom bagging techniques custom bagging techniques means whatever algorithm you want use the combination of them and try to give the output this also you can do it manually with the help of hands okay guys so second thing uh we are going to discuss about is boosting technique in this the first thing that uh first algorithm that we are going to discuss about is adab Boost so adab boost we going to discuss about how does adab Boost uh work now let's solve uh the first boosting technique which is called as adab boost okay and uh this is a boosting technique um in the boosting technique you have heard that we have to basically solve in a sequential way this at least you know I know there is a lot of confusion within you all but we'll try to solve a problem let's say so suppose I have a data set which looks like this fub1 F2 F3 F4 so these are my features and probably these are my output okay so let's say that I'm having this features like this and this is my output like yes or no like this so let's say that how many records I have over here three 4 5 6 and one more is there 7 so this seven records are there now in adab boost the first thing is that specifically with adab Boost uh you really need to understand that what all things we can basically do how do we solve this classification problem that we are going to understand the first thing first is that we Define a weight and the weight is very much simple initially to all the records to all this input records we provide an equal weight now how do we provide an equal weight we just go and count how many number of records are there now in this particular case the total number of records are one 2 3 4 5 6 7 now every record I have to provide an equal weight that is between 0 to 1 so the overall sum should be one so in this particular case what I can do if I make 1X 7 1X 7 1X 7 to everyone this will definitely become a equal weights to all right and if I do the total sum it will obviously be one let's go to the next one now after this what do we do okay after this in adab the first thing that we do is that we take any of this feature how do you decide which feature to take whether we should go with F1 or whether we should go with FS2 or whether we should go with F3 this we can do it with the help of Information Gain and Information Gain and entropy or guinea right based on this we can definitely understand whether we should start making decision here also you specifically make decision trees so here what you do is that you probably have to determine by using which feature I have to start my decision tree so suppose out of all this feature one feature two feature three you have selected that okay the information gain and entropy of feature one is higher so I'm going to use feature one and probably divide this into decision trees now when I divide this into decision tree let's say that I'm dividing like this into decision tree this decision tree depth will be only one one depth and this depth since it has only one depth we basically call it as stumps so what we do over here specifically we will create a decision Tre by taking only one feature and we will only divide it to one level okay one level or one depth that's and this is specifically called as stump what we are going to do next is that from this particular stump okay the stump is basically getting created only one so that is adab Boost right we say it as weak Learners because this is weak learner weak learner why there is a reason we say this as weak learner so only weak learner so that is the first thing with respect to uh this particular adab boost so the first step is that this is a weak learner so for the weak learner we basically create a stump stump basically means one level decision tree that's it based on the information gain and entropy I have selected the feature and then I just made a decision tree with only one level why it is called as it is called as weak learner okay so that is the reason we use only stum that is just a one level decision tree now the next step happens is that we provide all the specific records to this F1 and we train this specific model only with one level decision tree we train them now after we train them let's say that we are going to pass all these particular records to find out how many are correct and how many are wrong this decision this decision tree is basically giving so let's say that out of this entire records one record one record was just given as wrong let's say that this is the this is the record which was given as wrong okay so let's say that this record output was predicted wrong from this particular model only one wrong was there after training the model now what we need to do in this specific case understand a very important thing so let's say that we have done this and probably after this what we are actually going to do we are going to calculate the total error so how many error this particular model made let's say that in this particular case only one was wrong so this was only wrong right one was wrong so if I want to calculate the total error how will I calculate how many how many of them are wrong how many of them are wrong only one is wrong what is the weight of this so I will go and write 1X 7 so this is specifically my total error out of this specific model which is my stump over here okay which is my F1 stop now this is my first step the second step is that I need to see the performance of stump which stump this specific stump and the performance is basically checked by a formula which is 1 by log e 1us total error divided total error why we are doing this everything will make sense okay in just time every every in just a small time everything will make sense the first step that we do in adaab boost is that we try to find out the total error the second step we try to find out the performance of stump now in this particular case it will be 1 by log e 1 - 1 by 7 / 1X 7 so once I calculate it it will be coming as 895 F2 and F3 see again understand out of all these features I found out from Information Gain and entropy that this is the best feature let's say that I have calculated this as895 so this is my second step the first step is find out the total error the second step is performance of stum what is te te basically means total error te basically means total error now see see the steps okay see the steps whenever I'm discussing about boosting I'm going to combine weak Learners together to get a strong learner now what is the next step out of this now what what will be my third step understand over here my third step will be to update all these weights and that is the reason why I'm calculating this total error and performance of Step so my third step will basically be new sample weight from the decision tree one which is my stump so I'll say new sample weight is equal to I need to update all these weights why I need to update all these weights again understand I'll I'll talk about it just a second so if I want to up update the sample weights first update I will do it for correct records see for correct records whichever are correct like these all records are correct these all records are correct now when I update the weights of this update the weights of this particular record it should reduce and when the the the wrong records that I have this update should increase why because because if I increase this weights then the wrong records that are there that record should go to the next week learner that is the reason why I'm doing it now how to update this particular weights for correct records for correct records the formula looks something like this weight multiplied by weight multiplied by E to the^ of minus this specific performance okay this specific performance so e to the power of PS I'll write performance of stump and then I will basically be able to write 1X 7 * e to the^ of minus 895 if I do the calculation everybody try to do it the answer will be 05 now this is for correct records what about incorrect records for the incorrect records the the weights that is going to the formula that we going to apply is multiplied by E to the^ of plus PS not minus PS plus PS so here I'll write 1 by 7 multiplied e to the^ of 895 so if I go and probably calcul this I'm going to get it as 349 so this two are the weights that I have got that basically means all these records now which are correct 1X 7 the new updated weights will be 05 05 05 05 sorry not for the wrong records then this will be 05 then 05 and 05 so let me just see what is 1x 7 so here you can see initially it was. 142 now it has got reduced to 05 because all these records are correct but the wrong record value is 349 so my weights will now become over here as 349 now I will just go and go ahead and write over here my new weight my new weight is nothing but 05 055 05 05 05 1 2 how many 1 2 3 okay fourth record is here fourth record is there 1 2 3 4 05 05 okay how many records are there 1 2 3 4 5 6 7 so my fourth record will basically become the new value that I'm having is something called as 349 now tell me guys if I do the summation of all these weights is this is it one so prob no I don't think so it is one because if I try to add it up it is not one but if I go and see over here these all are one if I combine all the things 1 2 3 4 5 6 7 these all are one so here I have need to find out my normalized weight now in order to find out the normalized weight all I have to do is that what I have to do because the entire sumission should be one so we have to normalize now in order to normalize all you have to do is that go and find out what is the sum of all this things the summation of all these things will be 0 649 all you have to do is that divide all the numbers by 649 divided by 649 649 like this divide all the numbers by 649 and tell me what will be the answer that you'll be getting so here your normalized weight will now look like 077 07 and this value will be somewhere around uh 537 I guess in this case then this will be 07 077 here we are going to divide by all this 64 649 now this is my normalized weight now after you get a normalized weight we will try to create something called as buckets because see one decision tree we have already created which is a stump and you know from this particular stum what you're going to get okay as an output then in the sequential model we will go and combine another model over here now it's the time that I have to create this specific model now in order to create this specific model I need to provide some specific rows only to this model to train because this model is giving one wrong now what I have to do is that whatever is wrong along with other data points I need to provide this specific model with those records so that this model will be able to train on this and probably be able to get the output now let's create buckets now based on buckets how the buckets will be created over here I will take 07 until sorry whatever is the value over here normal we value okay so I will start creating my buckets buckets basically from 0 to 07 what did I say now for this decision tree or stump I need to provide some records so the maximum number of record that should be going should be the wrong records that should go over here now how do we decide that okay there should be a way that we should be able to say that that specific wrong number of Records should go to that decision tree so for that purpose what we do is that this decision tree will randomly create some numbers between 0 to 1 randomly create those numbers between 0 to 1 and whichever bucket it will come in like 07 to 014 014 to 07 basically means 0 2 1 then 0 2 1 2 see how the bucket is getting cre this value is getting added to this so that becomes this bucket 021 +3 537 how much it is it is nothing but 470 747 then 747 to 751 like this you create all the buckets okay you can create all the buckets now tell me which record is basically having the biggest bucket size obviously this record so if I randomly create a number between 0 to one what is the highest probability that the values will be going in so in this particular case most of the wrong records will be passed along with the other records obviously other records there are chances that other records will go to the next decision tree but understand maximum number will go with the wrong records because the bucket is high over here so the bucket is high over here so most of the time this specific record will get create selected and then it will be gone to the second tree now suppose I have this all records so this is my first stump this is my second stump this is my third stump similarly the third stump from the second stump whichever wrong records will be going maximum number of Records will go over here then again it will be trained like this we'll be having lot of stumps minimum 100 decision trees can be added you know that every decision tree will give one output for a new test data new test data this week learner will give one output this week learner will give one output this week learner and this will week learner will be giving one output obviously the time complexity will be more now from this particular output suppose it is a binary classification I will be getting 0 1 1 1 so again over here majority voting will happen and the output will be one in case of regression problem I will be having a continuous value over here and for this the average average will be computed and that will give me an output over here so for regression the average will be done for classification what will happen majority will be be happening so everywhere that same part will be going on buckets is very much simple guys buckets basically means based on this weights normalized weight we are going to create bucket so that whichever records has the highest bucket based on this randomly creating code you know it will select those specific buckets and put it into random Forest understand why this bucket size is Big the other wrong records which are present right suppose they are have more than four to five wrong records their bucket size will also be bigger and because based on this randomly creating num between 0 to 1 most of the wrong records will be selected and given to the second stum similarly this particular decision tree will be doing some mistakes then that wrong records will get updated all the weights will get updated and it will be passed to the next decision tree guys when I say wrong record the output will be same only no zero and one so interesting everyone I hope you understood so much of maths in adab boost and how adab boost actually work three main things one is total error one is performance of stump and one is the new sample weight these things are getting calculated extensive max normalized weight was basically used because the sum of all these weights are approximately equal to one when boosting why not take the last output no no no we have to give the importance of every decision tree output every decision tree output are important okay let me talk about one model which is called as blackbox model versus white box what is the difference between blackbox model and white box if I take an example of linear regression tell me what kind of model it is is is it a white box model or black box if I take an example of random Forest is this a white box or black box if I take an example of decision tree it is a white box of blackbox model if I take an example of a Ann is it a white box of blackbox model linear regression is basically called as an wide Box model because here you can basically visualize how the Theta value is basically changing and how it is coming to a global Minima and all those things in random Forest I will say this as blackbox model because it is impossible to see all the decision tree how it is working so that is the reason the maths is so complex inside this if I talk about decision tree this is basically a white box model because in decision tree we know how the split are basically happening with the help of paper and pen you'll be able to do it in the case of an Ann this is a blackbox model because here you don't know like how many neurons are there how they are performing and how the weights are getting updated so this is the basic difference between the blackbox and uh uh white box model this entire thing is the agenda of today's session so let's start uh the first algorithm that we are probably going to discuss today is something called as K means clustering K means clustering and this is a kind of unsupervised machine learning now always remember unsupervised machine learning basically means that uh the one and the most important thing is that in unsupervised machine learning in unsupervised ml you don't have any specific output so you don't have any specific output so suppose you have feature one and feature two and suppose you have datas different different data you know and based on this data what we do we basically try to create clusters this clusters basically says what are the similar kind of data so this is what we basically do from uh clustering and there are various techniques like K Mains uh it is hierle clustering and all so first of all we'll try to understand about K means and how does it specifically work it's simple uh suppose you have a data points like this okay let's say that this is your F1 feature F2 feature and based on this in two dimensional probably I will be plotting this points and suppose this is my another points so our main purpose is basically to Cluster together in different different groups okay so this will be my one group and probably the other group will be this group right so two groups because obviously you can see from this clusters here you have two similar kind of data which is basically grouped together right this is my cluster one and this is my cluster 2 let me talk about this and why specifically it'll be very much useful then we'll try to understand about math intuition also now always understand guys uh where does clustering gets used okay in most of the Ensemble techniques I told you about custom emble technique right so custom emble techniques in custom assemble techniques you know whenever we are probably creating a model first of all on our data set what we do is that we create clusters so suppose this is my data set during my model creation the first algorithm we will probably apply will be clustering algorithm and after that it is obviously good that we can apply regression or classification problem suppose in this clustering I have two or three groups let's say that I have two or three groups over here for each group we can apply a separate supervis machine learning algorithm if we know the specific output that we really want to take ahead I'll talk about this and uh give you some of the examples as I go ahead now let's go on go ahead and focus more on understanding how does kin's clustering algorithm work so let's go over here the word K means has this K value this K are nothing but this K basically means centroids K basically means centroids so suppose if I have a data set which looks like this let's say that this is my data set now over here just by seeing the data set what are the possible groups you think definitely you'll be saying K is equal to 2 So when you say k is equal to two that basically means you will be able to get two groups like this and each and every group will be having a centroid a centroid Point here also there will be a centroid point so this centroid will determine basically this is a separate group over here this is a separate group over here so over here here you can definitely say that fine this is two groups but but how do we come to a conclusion that there is only two groups okay we cannot just directly say that okay we'll try to just by seeing the data because your data will be having a high dimension data right right now I'm just showing your two Dimension data but for a high dimension data definitely you'll not be able to see the data points how it is plotted so how do you come to a conclusion that only two groups are there so for this there is some steps that we basically perform in K mins the first step is that we try with different K values we try with different K values and which is the suitable K value K is nothing but centroids okay it is nothing but centroids we try with different different centroids in this particular case let's say that I have this particular data point and I actually start with k is equal 1 or 2 or 3 any one you want let's say that I'm going to start with k is equal 2 how to come up with this K is equal to 2 as a perfect value that I'll talk about it we need to know there is a concept which is called as within cluster sum of square so when we try different K values let's say that for K is equal to 2 what will happen the first step we select a we try K values so let's say that we are considering K is equal to 2 the second step is that we initialize K number of centroids now in this particular case I know my K value is 2 so we will be initializing randomly let's say that K is equal to 2 so what we can actually do let's say that this is this is my one centroid I will I'll put it in another color so this will be my one centroid and let's say that this is my another centroid so I have initialized two centroids randomly in this space now after this particular centroid what we have to do is that after initializing this centroid what we have to do is that we have to basically find out which points are near to the centroid and which points are near to this centroid now in order to find out it is a very easy step we can basically use ukan distance to find out the distance between the points in an easy way if I really want to show you that you know like how many points I want to in an easy way what I can do I can basically draw a straight line over here let's say that I'm drawing a straight line over here in another color I can draw a straight line and I can also draw one parallel line like this so This basically indicates that whichever points you see over here suppose if I draw a straight line in between all these points you will be able to see that let's say that I'm drawing one more parallel line which is intersecting together so from this you can definitely find out let's say that these are all my points that are nearer to this green line Green Point so what I'm actually going to do in this particular case all these points that you are seeing near the green it will become green color so that basically means this is basically nearer to this centroid and whichever points are nearer to this particular point that will become red point so that basically means this belongs to this group okay this belongs to this group so I hope everybody's clear till here then what will happen is that this summation of all the values then we initialize the K number of centroids that is done then we try to calculate the distance we try to find out which all points is nearer to the centroid let's say that this is my one centroid this is my another centroid and we have seen that okay these all points belong to this centroid it near to this particular centroid so this is becoming red so that is based on the shortage distance and here it is becoming green now the next step let's see what is the next step after this so I am going to remove this thing now the next step will be that the entire points that is in red color all the average will be taken so here again the average will be taken now third step here I'm going to write here we are going to compute the average the reason we compute the average is that because we need to update the centroid so compute the average to update centroid to update centroids so here you'll be able to see that what I'm actually doing as soon as we compute the average this centroid is going to move to some other location so what location it will move it will obviously become somewhere in Center so here now I'm going to rub this and now my new centroid will be this point where I am actually going to draw like this let's say this is my new centroid now similarly this thing will happen with respect to the green color so with respect to the green color also it will happen and this green will also Al get updated so I'm going to rub this and this will be my new Green Point which will get updated over here then again what will happen again the distance will be calculated and again a perpendicular line will be calculated here you can see that now all the points are towards there okay again the centroid based on this particular distance again it will be calculated and here you can see that all the points are in its own location so here now no update will actually happen let's say that there was one point which was red color over here then this would have become green color but since the updation has happened perfectly we are not going to update it and we are not going to update the centroid right so now you can understand that yes now we have actually got the perfect centroid and now this will be considered as one group and this will be basically considered as the another group it will not intersect but right by default here intersection is happening so I hope everybody's understood the steps that you have actually followed in initializing the centroids in updating the centroids and in updating the points is it clear everybody with respect to K means now let's discuss about one point how do we decide this K value okay how do we decide this K value so for deciding the K value there is a concept which is called as elbow method so here I'm going to basically Define my elbow method now elbow method says something very much important because this will actually help us to find out what is the optimized K value whether the K value should be two whether uh the K value is going to be three whether the K value is going to become four and always understand suppose this is my data set suppose this is my data set initially let's say that I have my data points like this we cannot go ahead and directly say say that okay K is equal to 2 is going to work so obviously we are going to go with iteration for I is equal to probably 1 to 10 I'm going to move towards iteration from 1 to 10 let's say so for every iteration we will construct a graph with respect to K value and with respect to something called as W CSS now what is this W CSS W CSS basically means within cluster sum of square okay this is the meaning of wcss within cluster sum of square now let's say that initially we start with one centroid so one centroid let's say it is initialized here one centroid is basically initialized here if we go and compute the distance between each and every points to the centroid and if we try to find out the distance will the distance value be greater or it will be smaller will it be smaller or greater tell me if you try to calculate this distance from this centroid to every point this is what is within cluster sum of square it will always be very very much greater so let's say that my first point has come somewhere here it is going to be obviously greater let's say that my first point is coming over here find So within K is equal to 1 initially we took and we found out the distance of w CSS and it is a very huge value okay because we're going to compute the distance between each and every point to the centroid now the next thing that I'm actually going to do is that now we'll go with next value that is K is equal to 2 now in K is equal to 2 I will initialize two points okay I will initialize two points and then probably I will do the entire process which I have written on the top now tell me me whichever points is nearer to this green point if we compute the distance and whichever points is nearer to the red point if you compute the distance like this now this summation of the distance will be lesser than the previous W CSS or not obviously it is going to be lesser than the previous W CSS so what I'm actually going to do probably with K is equal to 2 your value may come somewhere here then with K is equal to 3 your value May come somewhere here then K is equal to 4 will come here to 5 6 like this it will go so here if I probably join this line you'll be able to see that there will be an Abrupt changes in the W CSS value in the wcss value there will be an Abrupt changes and this this is basically called as elbow curve now why we say it as elbow curve because it is in the shape of elbow and here at one specific point there will be an Abrupt change and then it will be straight so that is the reason why we basically say this as elbow okay so this is a very important thing see in finding the K value we use elbow method but for validating purpose how do we validate that this model is performing well we use silard score that I'll show you just in some time but understand that in K means clustering we need to update the centroids and based on that we calculate the distance and as the K value keep on increasing you'll be able to see that the distance will become normal or the wcss value will become normal and then we really need to find out which is the phys K value where the abrupt change see over here suppose abrupt change is there and then it is normal then I will probably take this as my K value so obviously the model complexity will be high because we are going to check with respect to different different K values and wcss values and this basically means that the value that we'll probably get first of all we need to construct this elbow curve then see the changes where it is basically happening we'll need to find out the abrupt change and once we get the abrupt change we basically say that this may be the K value so K is equal to 4 as an example I'm telling you so unless and until if you really want to find the cluster it is very much simple we take a k value we initialize K number of centroids we compute the average to update the centroids then again we try to find out the distance try to see that whether any points has changed and continue that process unless and until we get separate groups okay so this is the entire funa of claim in clustering so finally you'll be able to see that with respect to the K value we will be able to get that many number of groups if my K value is four that basically means I will be probably getting four different groups like this 1 two right three like this and four I will be getting four groups like this with K is equal to 4 that basically means K is equal to four clusters and every group will be having its own centroids okay every group will be having okay centroids are very much important yes I'll try to show you in the coding also guys let's go towards the second algorithm the second algorithm that we will be probably discussing is called as hierarchical clustering now hierarchal clustering is very much simple guys all you have to do is that let's say this is your data points this is your data points and this is my P1 let's say P2 now hierle clustering says that we will go step by step the first thing is that we will try to find out the most nearest Value let's say this is my X and Y let's say these are my points like this is my P1 point this is my P2 point this is my P3 point this is my P4 Point P5 Point P6 point p7 point okay so these are my points that I have actually named over here let's say that this may be the nearest point to each other so what it will do it will combine this together into one cluster this we have computed the distance so it will C create one cluster now what will happen on the right hand side there will be another notation which you may be using in connecting all the points one so suppose this is my P1 this is my P2 this is my P3 P4 let's say that I have this many points and probably I will also try to make p7 so these are my points p7 now you know that the nearest point that we are having okay this will probably be distance 1 2 3 this is distance okay 4 5 6 like this we have lot of distance so hierle clustering will first of all find out the nearest point and try to compute the distance between them and just try to combine them together into one what do we do we basically combine them into one group okay so P1 and P2 has been combined let's say then it'll go and find out the other nearest point so let's say P6 and p7 are near so they are also going to combine into one group so once they combine into one group then we have P6 and p7 which will be obviously L greater than the previous distance and we may get this kind of computation and another combination or cluster will form get formed over here then you have seen that okay P3 and P5 are nearer to each other so we are going to combine this so I'm going to basically combine P3 and P5 okay and let's say that this distance is greater than the previous one because we are basically going to sh start with the shortest distance and then we are going to capture the longest distance now this is done now you can see that the next point that is near right to this particular group is P4 so we are going to combine this together into one group so once we combine this into one group this P4 will get connected like this let's say it is getting connected like this P4 has got connected then what is the nearest Point whether it is P6 p7 group or P1 P2 obviously here you can see that P1 P2 is there so I am probably going to combine this group together that basically means P1 P2 let's say I'm just going to combine this group group together again circle is coming so I will make a dot let's say I'm going to combine this group together because these are my nearest groups so what will happen P1 and P2 will get combined to P5 sorry P4 P5 this one so I will be getting another line like this and then finally you'll be seeing that P6 p7 is the nearest group to this so this will totally get combined and it may look something like this so this will become a total group like this so all the groups are combined so finally you'll be able to see that there will be one more line which will get combined like this this is basically called as dendogram dendogram okay which is like bottom root to top now the question arises is that how do you find that how many groups should be here how do you find out that how many groups should be here the funa is very much Clear guys in this is that you need to find the longest vertical line you need to find out the longest vertical line that has no horizontal line pass through it no horizontal line passed through it this is very much important that has no horizontal line pass through it now what this is basically meaning is that I will try to find out the longest line longest vertical line in such a way that none of the horizontal line passes through it what is horizontal line suppose if I consider this vertical line This vertical line over here if you see that if I extend this green line it is passing through this if I extend this line it is passing through this right if I'm extending this line it is passing through this right so out of this the longest line that may be passing in such a way that no horizontal line probably is this line that I can actually see so what you do over here is that you basically just create a straight line over this and then you try to find out that how many clusters it will be there by understanding that how many lines it is passing through if it is passing through this one line two line three line four line that basically means your clusters will be four clusters this is how we basically do the calculation in heral clustering again here it may not be the perfect line I've just drawn with some assumptions but if you are trying to do this probably you have to do in this specific way okay I've already uploaded a lot of practical videos with respect to highill clustering and all now now tell me maximum effort or maximum time is taken by is taken by K means or hierle clustering this is a question for you yes guys number of clusters may be three but here I'm just showing you that how many lines it may be passed by how do you basically determine whether maximum time will be taken by kin or Hier clustering this is an interview question the maximum time that will be taken is by hierarchical clustering why because let's say that I have many many many data points at that point of time hierle clustering will keep on constructing this kind of dendograms and it will be taking many many many time lot time right so hierle clustering will take more time maximum time that it is going to basically take so it is very much important that that you understand which is making basically taking more time so if your data set is small you may go ahead with hierle clustering if your data set is large go with K means clustering go with K means clustering in short both will take more time but K Min will perform better than hle clustering see guys you will be forming this kind of dendograms right and just imagine if you have 10 features and many data points how you're going to do it it will be a cubers some process you'll not be even able to see this dendogram properly and manually obviously you cannot do it so this was with respect to K means clust swing and H mean clust swing I hope everybody's understood now the next topic that we'll focus on is that how do we validate see how do we validate a classification problem we use performance metric like confusion Matrix accuracy um different different true positive rate Precision recall but how do we validate clustering model Model S we are going to use something called as so we are going to basically use something called as Sil score I'll show you what Sid score is I'm going to just open the Wikipedia so this is how a CID score looks like a very very amazing topic okay how do we validate whether my model basically has perfect three or four model perfect three suppose if I find out my K value is three how do we find out now see one more one more issue with K means one issue with K means which I forgot to tell you let's say that I have a data point which looks like this and suppose I have some data points like this I have some data points which looks like this let's say I have like this now in this one issue will be that suppose I try to make a cluster over here obviously you'll be saying my K value will be two okay in this particular case suppose this is one cluster this is my another cluster right because of my wrong initialization of the points okay understand because suppose if I initialize just randomly some centroids like this then what may happen is that there is a possibility that we may also have three clusters like like like this kind of clusters one cluster will be here one cluster will be here one cluster will be here so this initialization of the centroids one condition is that it should be very very far if we initialize our centroids very very far at that point of time we will be able to find the centroid exactly in the center because it will keep on updating it'll keep on going ahead right but if we don't initialize that very far then there will be a situation that probably if I wanted to get only the real thing was to get only two centroids I was probably getting three centroids right so this is a problem so for this there is an algorithm which is called as K means Plus+ and what this K means Plus+ will do which I will probably show you in Practical this will make sure that all the centroids that are initialized it is very very far okay all the in centroids that is basically there it is initialized very very far we'll see that in practical application where specifically those centroids are basically used now let me go ahead and let me show you with respect to Sid clust string now what is the solo color string I'm going to explain you in an amazing way this is important if someone says you how do we validate how do we validate cluster model then at that point of time we basically use this site it will be used in it will be used with respect to it will be used with respect to K means it can be used in hierle mean right if you want to validate how do we validate okay that is what we are basically going to see over here now in C's clustering what are the most important things the first and the most important thing is that we will try to find out we will try to find out a ofi we will try to find out a of I now what is this a ofi see this a ofi that you basically see a ofi is nothing but see three major steps happens in order to validate cluster model with the help of solo first thing is that I will probably take one cluster okay there will be one point which will be my centroid let's say and then what I'm going to do I'm just going to whatever points are there inside this cluster I'm going to compute the distance between them so I'm going to do the summation and I'm also going to do the average of all this distance so here you can see that when I said distance of I comma J I basically means this point J basically means all these points I is nothing but it is the centroid so here is nothing but this this is the centroid let's say that I'm having the centroid so I'm going to compute all the distance over here which is mentioned by this and this value that you see that I'm actually dividing by C of I minus one in Short I am actually trying to calculate the average distance so this is the first point where I'm actually Computing the a ofi now similarly what I will do is that what I will do is that the next point will be that suppose I have computed a ofi the next the next that we need to compute is B ofi now what is b ofi b ofi is nothing but there will be multiple clusters in a k means problem statement we will try to find out the nearest cluster okay suppose let's say that this is the nearest cluster and in this I have all the variety of points then B ofi basically says that I will try to compute the distance between each point and the other point in this centroid sorry in this cluster so this is my cluster one this is my cluster two so what I'm actually going to do is that here I'm going to compute the distance between this point to this point then this point to this point then this point to this point this point to this point this point to this point this point to this point every point I'm actually going to compute the distance once this point is done we will go ahead with the next point and we'll try to compute the distance and once we get all this particular distance what we are going to do we are going to do the average of them average now tell me if I try to find out the relationship between a of I and B of I if my cluster model is good will a of I will be greater than b of I or will B of I will be greater than a ofi if I have a good clustering model if I have a good clustering model will a of I is greater than b of I will be greater than b of I or whether B of I will be greater than a of I out of this if we have a really good model obviously the distance between B of I will be greater than a of I in a good model that basically means if I talk about sloid clustering the values will be between -1 to +1 the more the value is towards +1 that basically means the good the model is the good the clustering model is the more the values towards negative one that basically means this condition is getting applied now what does this condition basically say that basically means that this distance is far than the cluster distance this is what this information is getting portrayed and this is the importance of CID clustering finally when we apply the formula of CID clustering you'll be able to see that sloid clustering is nothing but let me rub this everything guys for you let me just show you what is CID clustering CID clustering formula will be something like this this B of I so here you have solid clustering this is the formula B of I minus a of I Max of a of I comma B of I if C of I is greater than one right so by this you will be getting the value between -1 to + 1 and more the value is towards + one the more good your model is more the values towards minus1 more bad your model is because if it is towards minus1 that basically means your a of I is obviously greater than b of I so this is the outcome with respect to cot crust string if s is equal to zero that basically means still your model needs to be uh per basically the clustering needs to be improved what is I over here I is nothing but one data point you you can just read this guys data point in I in the cluster C of I so I hope everybody's understood this now let's go ahead and let's discuss about the next topic we have obviously finished up solart clustering over here let's discuss about something called as DB scan so for DB scan clustering this is an amazing clustering algorithm we'll try to understand how to actually do DB clustering and probably you'll be able to understand a lot of things from this now in DB scan clustering what are the important things so let's start with respect to DB scan clustering and let's understand some of the important points over here the first point that you really need to remember is something called as score point points I'll also talk about when do you say core points or when do you say other points as such so the first point that I will probably discuss about is something called as Min points the second point that I will probably discuss about is something called as score points the third thing that I will probably discuss about is something called as border points and the fourth point that I will definitely talk about is something called as noise Point okay guys now tell me in C's clustering if I have this kind of groups don't you think with the help of two different clusters I may combine this two like this with the help of two different clusters I may combine something like this right but understand over here what what problem is basically happening with the second clustering this is actually an outliers let's say that let's say one thing very nicely I will put okay let's say I have one point over here I have one point over here here so if I do clustering probably I will get one cluster here and I may get another cluster which is somewhere here now understand one thing this point is definitely an outlier even though this is an outlier with the help of K means what I'm actually doing I'm actually grouping this into another group so can we have a scenario wherein a kind of clustering algorithm is there where we can leave the outlier separately and this outlier in this particular algorithm and this is B basically uh we will be using DB scan to relieve the outlier and this point will be called as a noisy Point noisy point or I can also say it as an outlier so this will be a noise point for this kind of algorithm where you want to skip the outliers we can definitely use DB scan that is density based spatial clustering of application with noise a very amazing algorithm and definitely I have tried using this a lot nowadays I don't use K means or Hier means instead use this kind of algorithm now see this what are the important things over here first of all you need to go ahead with Min points Min points so first thing is that you need to have Min points this Min points is a kind of hyperparameter this basically says what does hyper parameter says and there is also a value which is called as Epsilon which I forgot I will write it down over here this is called as Epsilon now what does epsilon mean Epsilon basically means if I have a point like this and if I take Epsilon this is nothing but the radius of that specific Circle radius of that specific Circle okay so Epsilon is nothing but radius over here in this specific T what does minimum points is equal to 4 mean let's say that I have I have taken a point over here let's say that this is my point and I have drawn a circle which looks like this and let's say that this is my Epsilon value okay this is my Epsilon value if I say my Min point point is equal to 4 which is again a hyper parameter that basically means I can if I have four at least four points over here near to this particular Circle based on this Epsilon value then what will happen is that this point this red point will actually become a core point a core point which is basically given over here if it has at least that many number of Min points inside or near to this particular within this Epsilon okay within this particular cluster suppose this is my cluster with the help of Epsilon I have actually created it is there a particular unit of Epsilon or we simply take the unit of distance no Epsilon value will also get selected through some way I I'll show you I'll show you in the practical application don't worry now the next thing is that let's say let's say I have another another point over here let's say that I have another point over here and this is my circle with respect to Epsilon I have created it let's say that here I have only one point I have only one point inside this particular cluster at that point this point becomes something called as border Point border Point border point also we have discussed over here right so border point is also there so here I'm saying that at least one at least one if it is only one it is present then it will become a border point if it has Force definitely this will become a core Point core Point like how we have this red color so and there will be one more scenario suppose I have this one cluster let's say this is my Epsilon and suppose if I don't have any points near this then this will definitely become my noise point and this noise point will nothing be but this will be a cluster okay so here I have actually discussed about the noise point also so I hope everybody is able to understand the key terms now what is basically happening is that whenever we have a noise Point like in this particular scenario we have a noise point and we don't find any points inside this any core point or border point if you don't find inside this then it is going to just get neglected that basically means this is basically treated as an outlier I hope everybody is able to understand here this point will be treated as an outlier or it can also be treated as a noise point and this will never be taken inside a group okay it will never never be taken inside a group suppose I have this set of points which you see basically over here red core and all and there is also a border Point by making multiple circles over here here you can definitely say that how we are defining core points and the Border points and this can be combined into a single group okay this can be combined into a single group because how the connection is now see this this yellow line is basically created by one sorry this yellow point is basically created by one Epsilon and we have one One Core point over here remember over here it should be at least one core Point okay not one point but one core point at least if it is having one core point then it will become a border point this will become a border point that basically means yes this can be the part of this specific group so what we are doing Whenever there is a noise we are going to neglect it wherever there is a broader and core points we are going to combine it so I'll show you one more diagram which is an amazing diagram which will help you understand more in this a k means clustering and Hier mean clustering now see this everybody now the right hand side of diagram that you see is based on DB scan clustering and the left hand side is basically your traditional clustering method let's say that this is K means which one do you think is better over here you see this these all outliers are not combined inside a group But whichever are nearer as a core point and the broader point separate separate groups are actually created right so this is how amazing a DB scan clustering is a DB scan clustering is pretty much amazing that is basically the outcome of this here in C's clustering you can see this all these points has also been taken as blue color as one group because I'll be considering this as one group but here we are able to determine this in a amazing groups so in I'm saying you guys directly use DB scan with without worrying about anything so now let's focus on the Practical part uh I'm just going to give you a GitHub link everybody download the code guys I've given you the GitHub link quickly download and keep your file ready I'm going to open my anaconda prompt probably open my jupyter notbook we'll do one practical problem I've given you the link guys please open it so this is what we are going to do today this will be amazing here you'll be able to see amazing things how do you come to know that over fitting or underfitting is happening you don't know the real value right so in in clustering there will not be any underfitting or overfitting so uh what all things we'll be importing first is that we'll try cin clustering we'll do silot scoring and then probably we'll see the output and um and we'll do DB scan Also let's say DB scan is also there so uh what are the things we have basically imported one is the cin clustering one is the Sout samples and Sout scores these all are present in the SK learn and it is present in metrics that basically means we use this specific parameter to validate clustering models okay now we'll try to execute this and apart from that mat plot lib we are just trying to import numai we are trying to import and all here we are executing it perfectly the next thing is that here the next step is that generating the sample data from make underscore blobs first of all we are just trying to generate some samples with some two features and we are saying that okay should have four centroids or C centroids itself with some features I'm trying to generate some X and Y data randomly and this particular data set will basically be used in performing clustering algorithms okay forget about range undor ncore clusters because we need to try with different different clusters and try to find out the solid score so right now I just initialized with 2 3 4 5 6 values it is very simple so if I go and probably see my X data so my X data will look something like this so this is my X data with two features and this is my Y data with one feature which is my output which belongs to a specific class okay so that you can actually do with the help of make underscore blobs let's say how to apply kin's clustering algorithm so as I said that I will be using W CSS W CSS basically means within cluster sum of square so I'm going to import K means over here for I in range 1A 11 that basically means I'm going to use different different K values or centroid values and try to C which is having the minimal wcss value and I'll try to draw that graph which I had actually shown you with respect to Elbow method so here I will basically be also using K means number of clusters will be I and initialization technique I will will be using K means Plus+ so that the points the centroids that are initialized those those points are very very far and then you have random state is equal to zero then we do fit and finally we do wcss do upend cins doin inertia okay this dot inertia will give you the distance between the centroids and all the other points and this is what I'm going to append in this wcss value and finally I'll just plot it now here you can see that I'm just plotting it obviously by seeing this graph this graph looks like an elbow okay this graph looks like an elbow so the point that I'm actually going to consider over here see which is the last abrupt change so if I talk about the last abrupt change here I have the specific value with respect to this okay I have one specific value with respect to this this is my abrupt change from here the changes are normal so I'm going to basically select K is equal to 4 now what I'm actually going to do with the help of sart with the help of s CL score we are going to compare whether K is equal to 4 is valid or not so that is what we are going to do valid or not so here we are going to do this now let's go ahead and let's try to see it how we are going to do it so here you can see n clusters is equal to 4 then I'm actually able to find out the prediction and this is specifically my output okay this is done now see this code okay this code is a huge code I have actually taken this code directly from the SK learn page of Silo if you go and see this this code is directly given over there but I'm just going to talk about like what are the important things we need to see over here with respect to different different clusters see see this clusters 2 3 4 5 6 I'm going to basically compare whether the K value should be four or not with the help of solid scoring so let's go here and here you can see that I'm applying this one first I will go with respect to for Loop for ncore clusters in range underscore clusters different different cluster values are there first we'll start with two so here you can see initialize the cluster with and cluster value and a random generator seed of 10 for reproducibility so ncore clusters first I take took it as two and then I did fit predict on X after I did fit predictor on X I'm using this score on X comma cluster label now what this is going to do understand in Solo what did we discuss it will it will try to find out all the Clusters the Clusters over here like this and it'll try to calculate the distance between them which is the a of I then it'll try to compute the B of I then finally it'll try to compute the score and if the value is between minus1 to +1 the more the Valu is towards + one the more better it is right so these all things we have already discussed and that is what this specific function will do and this will give my solo average value over here solid value will be over here okay this we have done and then we can continuously do it for another another things you can actually find it over here and this value that you see this code that you see is nothing nothing so complex okay this is just to display the data properly in the form of graphs okay in the form of graphs so again I'm telling you I did not write this code I've directly taken it from the uh SK learn page of solid okay so just try to see this particular uh plotting diagrams and all that you can definitely figure out but let's see I will try to execute it and try to find out the output now see for ncore cluster is equal to 2 the average solid score is 70 I told you the value will be between -1 to +1 and I'm actually getting 704 which is very very good and then for ncore cluster is equal to 3 588 then ncore cluster is equal to 4 I'm getting 65 which is pretty much amazing and then for ncore cluster equal to 5 the average score is 563 and ncore cluster is equal to 6 you are saying .45 here directly you can actually say that fine for _ cluster equal to 2 I'm getting an amazing score of 704 obviously you're you're getting the highest value over this so should we select ncore cluster isal to two Okay we should not directly conclude from it because here we need to also see that any feature value or any cluster value is also coming as negative value that also we need to check so here we will go down over here you will see the first one over here with respect to the first one you see that I'm get getting the value from 0 to 1 it is not going going to Min -.1 so definitely two clusters was able to solve the problem so I'll keep it like this with me I definitely have a chance that this may this may perform well I may have a chance that this K uh K is equal to 2 May perform well okay so I may have a chance let's see to the next one to the next one over here you can see that for one of the cluster the value is negative if the value is negative that basically means the AI is obviously greater than b ofi so I'm not going to prer this because it is having some negative values even though my cluster looks better but again understand what is the problem with respect to this cluster is that if I take this cluster and probably compute the distance between this point to this point and if I probably compute from this point to this point or this point to this point this point is obviously nearer to this right it is obviously nearer to this so that is the reason why I'm getting a negative value over here okay negative value over here this is my uh output my score this point that you see dotted points this is my score 58 what whatever it is this is basically my score so obviously this basically indicates that this point is near the other cluster point is nearer to this so I'm actually getting a negative value right so this you really need to understand okay now similarly if I go with respect to ncore Cluster is equal to 4 this looks good because here I don't have any negative value and here you can see how cooly it has basically divided the points amazing inly with the help of k equal to 4 right and similarly if I go with five obviously you can see some negative values are here some dotted line negative value are there with respect to six you also have some negative values so definitely I'll not go with six I may either go with four or I may either go with two now whenever you have this options always take a bigger number instead of two take four because four is greater than two because it will be able to create a generalized model so from this I'm actually going to take and is equal to 4 K is equal to 4 now should we compare with this with the elbow method here also I got four right so both are actually matching so this indicates that with the help of this clustering this siluette score we can definitely come to a conclusion and validate our clustering model in an amazing way so I hope everybody is able to understand and this way you basically validate a model and definitely you can try it out you can understand this code definitely I but till here you have understood that here I'm going to get the average value then for iore clusters whatever cluster this is matching it is just mapping over there and it is basically giving so this was the session and uh yes in today's session we efficiently covered many topics we covered kin hierle clustering solid score DB clustering in tomorrow's session the topics that are probably pending is first I'll start with svm and svr second I will go ahead with XG boost and and third I will cover up PCA let's see whether I'll be able to complete this session uh one one amazing thing that I want to teach you guys because many people ask me the definition of bias and variance so guys uh many people get confused when we talk about bias and variance you know because let's say that uh I have a model for the training data set it gives us somewhere around 90% accuracy let's say I'm getting a 90% accuracy for the test data I may probably getting somewhere around 70% accuracy now tell me which scenario is basically this most of the people will be saying that okay fine it is overfitting now when I say overfitting I basically mention overfitting by low bias and high variance right so many people get confused Krish tell me just the exact definition of bias and variance low bias obviously you are saying that because the training is performed like the model is performing well with the help of training data set but with respect to the test data set the model is not performing well with respect to training data set why do we always say bias and with respect to test data set why do we always say variance so for this you need to understand the definition of bias so let me write down the definition of bias over here so here I can definitely write that bias it is a phenomena that skews the result of an algorithm in favor in favor or against an idea against an idea I'll make you understand the definition uh um but understand the understand understand what I have actually written over here it is a phenomena that skewes the result of an algorithm in favor or against an idea whenever I say this specific idea this idea I will just talk about the training data set initially now when we train a specific model suppose if I have this specific model over here and I'm training with this specific training data set so this is my training data set now based on the definition what does it basically say it is a phenomenon that skews the result of an algorithm in favor or against an idea or a this specific training data set so even though I'm training this particular model with this training data set with this data set it may it may be in favor of that or it may be against of that that basically means it may perform well it may not perform well if it is not performing well that basically means the accuracy is down if the accuracy is better at that point of time what will say see if the accuracy is better that time what we'll say we we'll come up with two terms from here obviously you understand okay there are two scenarios of bias now here if it is in favor that basically means it is performing well with respect to the training data set I will basically say that it has high bu if it is not able to perform well with the training data set then here I will say it as low bias I hope everybody is able to understand in this specific thing because many many many people has this kind kind of confusion now similarly if I talk about variance let's say about variance because you need to understand the definition a definition is very much important okay if I if I just talk about the definition of variance I'm just going to refer like this the variance refers to the changes in the model when using when using different portion of the training or test data now let's understand this particular definition variance refers to the changes in the model when using different proportion of the test training data or test data we obviously know that whenever initially if I have a model understand from the definition everything will make sense I am basically training initially with the training data okay because we divide our data set see our data set whenever we are working with we divide this into two parts one is our train data and test data okay because this is a tra test data is a part of that particular data set right and suppose in this particular training data it gets trained and performs well here I'm actually talking about bias but when we come with respect to the prediction of the specific model at that point of time I can use other training data that basically means that training data may not be similar or I can also use test data now in this test data what we do we do some kind of predictions these are my predictions and in this prediction again I may get two scenario I may get two scenario which is basically mentioned by variance it refers to the changes in the model when using when using different portion of the training or test data refers to the changes basically means whether it is able to give a good prediction or wrong predictions that's it so in this particular scenario if it gives a good prediction I may definitely say it as low variance that basically means the accuracy with the accuracy with respect to the test data is also very good if I probably get a bad if I probably get a bad accuracy at that time I basically say it as high variance so if I talk about three scenarios over here let's say this is my model one and this is my model two and this is my model three now in this scenario let's consider that my model one has the training accuracy of 90% and test accuracy of 75% similarly I have here as my train accuracy of 60% and my test accuracy of 55% now similarly if I have my train accuracy of 90% And my test accuracy of 92% now tell me what what things you will be getting here obviously you can directly say that fine your training accuracy is better now you're talking about bias so this basically indicates that this has low bias and since your test accuracy is bad because it is when compared to the train accuracy it is less so here you are basically going to say high variance understand with respect to the definition similarly over here what you'll say high bias High variance because obviously it is not performing well this is another scenario last the last scenario is that this is the scenario that we want because it is low bias and low variance okay many many people have basically asked me the definition with respect to bias and variance and here I've actually discussed and this indicates this gives me a generalized model and this is what is our aim when we are working as a data scientist so I hope you have understood the basic difference between V bias and variance and I was able to give you lot of examples lot of understanding with respect to this so I hope you have actually got this particular uh understanding of this uh two terms which we specifically talk about high bias low bias High variance low variance right so this was it from my side guys uh and uh I hope you have understood this okay so let's take let's consider a data set credit and let's say this is a approval so we are going to take this sample data set and understand how does XG boost work suppose salary is less than or equal to 50 and the credit is bad so approval the loan approval will be zero that basically means he he or she will not get if it is less than or equal to 50 if the credit score is good then probably approval will be one if it is less than or equal to 50 if it is good again then it is going to get one if it is greater than 50 and if it is bad then obviously approval will be zero if it is greater than 50 if it is good we are going to get it as one if it is greater than 50k and probably if it is normal then also we are going to get it so this is this is my data set so how does XG boost classifier work understand the full form of XG boost is Extreme gradient boosting extreme gradient boosting so we will basically understand about extreme gradient boosting now extreme gradient boosting uh will be actually used to solve both classification and the regression problem statement so first of all let's understand how it is basically exib basically how it actually if you if you just talk about XG boost you understand that it is a boosting technique and internally it tries to use decision tree so how does this decision Tre is basically getting constructed in the case of XV boost and how it is basically solved we are going to discuss about it so whenever we start exib boost classifier understand that first of all we create a specific base model suppose if I say this is my base model and this base model will be a weak learner okay and this base model will always give an output of probability of 0.5 in the case of classification problem so suppose if I say this is probability 0.5 then I will try to create a field over here this field is called as residual field so first base model what I'm going to do any data set that you give from here to train it will always give you the output as 0.5 so this is just a dummy base model now tell me if my probability output is is 0.5 if I want to calculate the residual that basically means I need to subtract approval minus this particular value so what will be the value over here 0 -.5 will be -.5 1 -.5 will be5 1 -.5 will be5 and 0 -.5 will be -.5 and this 1 -.5 will be uh 0.5 and this will also be 0.5 let's consider that I have one more record uh and this specific record can be anything uh because I want to keep some more records over here so let's consider that I have one more record which is less than or equal to 50K and if the credit scod is normal you're going to get zero so here also if I try to find out the residual it will be minus5 now the first step I hope everybody's understood we have to create a base model okay this base model is very much important because we have to create all the decision Tree in a sequential manner so the first sequential base tree which is again this is also a decision tree kind of thing you can consider but this is a base model which takes any inputs and gives by default the probability as 05 now let's go ahead and understand what are the steps in constructing decision tree after creating the base model the first step is that create uh binary decision tree so I'm going to write it down all the steps please make sure that you note it down so so create a binary tree binary decision tree using the features second step we basically Define we we we say it as okay Second Step what we do we actually calculate the similarity weight we calculate the similarity weight I'll talk about this similarity weight what exactly it is if I want to use this a formula it is summation of residual Square divided by summation of probability 1 minus probability plus Lambda I'll talk about this what is exactly Lambda it is the kind of hyperparameter again so that it does not overfit the third thing is that we calculate the Information Gain okay Information Gain so these are the steps we basically use in constructing or in solving uh in creating an HD boost classifier the first step is that we create a inary decision tree using the feature then we go ahead with calculating the similarity weight and finally we go ahead and calculate the information gain so how does it go ahead let's understand over here and let's try to find out okay now let's go ahead and let's try to construct the decision tree as I said that let's consider that I'm considering salary feature So based on using salary feature what I'm actually going to do I am going to take this as my node and I'm going to split this up and remember whenever we are creating decision Tree in this particular case it will be a binary decision tree let's say that in salary one is less than or equal to one is greater than 50 so this two you obviously have in the case of binary in case of credit where there are three categories I'll also show you how that further split will happen and how that will get converted into a binary team so here you have less than or equal to 50K and greater than 50k now let's go ahead and understand how many vales are there in this salary so if I see before the split you can definitely see that I'm going to use this residual and probably train this entire model now if I really wanted to find out the residual initially these are my residuals over here so one resid is -.5 then I have 0.5 over here then I have .5 then again I have -.5 then again I have 0.5 then again I have 0.5 and finally I have minus .5 so these are my total residuals that are there suppose if I make this split less than or equal to 50 First less than or equal to 50 the residuals what are things are there so here I'm going to have minus5 then less than or equal to 50 again I'm going to have 05 then again less than or equal to 50 I'm going to have 0.5 and less than or equal to again one more 0.5 is there I'm just going to remove this the last5 which is nothing but Min -.5 so I hope you understood this split so half of the things came over here the remaining half will be greater than or equal to greater than 50 so you have one value here one value here one value here so it will be Min -.5 then you have 0.5 and then finally you have 0.5 residuals how do we get it guys see from the base model which is by default giving 0.5 first my data goes over here by default probability I'm going to get 0.5 so residual is basically calculated from this probability and approval so this probability minus approval so if you subtract 0 -.5 sorry I'm just going to rub this so if you subtract 0 -.5 you're going to get -.5 1 -.5 you're going to get .5 1 -.5 you're going to get .5 so everybody I hope is very much clear with respect to this so this is the first step we constructed a binary tree now in the second step it says calculate the similarity weight now how to calculate the similarity weight similarity weight formula is sum of residual Square now what is residual Square let's say that I'm going to calculate the the the uh I'm going to calculate for this okay similarity weight now in this particular case if I go and calculate my similarity weight it will be summation of residual Square this is my residual values this is my residual Valu so I'm going to do the summation of this Square okay this value square you can see over here sum of residual Square everybody you can see sum of of residual squares so what do you think sum of residual squares will be in this particular case how I have to do it I will just take up this all values like -.5 +5 +5 and -.5 whole square right I'm just going to do the squaring of this divided by understand what it is divided by it is divided by probability of 1 minus probability now where do we get this probability value where do we get this probability value value we get this probability value from our base model right so here I'm basically going to say that we are going to do the summation of probability of 1 minus probability 1 minus probability that basically means for each and every point for each and every Point what is the probability see probability is basically coming from the base model so for each Pro each point I'm going to come compute two things one is the probability and then 1 minus probability and this I'm going to do the summ like this I will do it four times 1 -.5 then .5 * 1 -.5 and finally you'll be able to see one more will be there which is +5 1 -.5 so this will be your total things with respect to this so I hope you have understood till here uh where you are able to understand that what we have done this is summation of uh residual square and this is the remaining probability multiplied by 1 minus probability now tell me what are you able to find out from this if you cancel this and this this and this this value is going to become zero so this entire value is going to become Zer because 0 divided by anything is 0er so here I hope everybody is understood what is the similarity weight of this specific node if I want to write it is nothing but zero now you may be considering where is Lambda value okay we will initially initialize Lambda by 1 I'll talk about this hyper parameter let's consider it as 1 so here + 1 or plus 0 let's let's consider Lambda value 0 let's say for right now okay I'm just going to make it Lambda is equal to0 I'm just going to talk about it because it is a kind of hyper parameter by Z -.5 -.5 +5 +5 if I do the summation if I do the summation here you will be able to see that I'm going to get zero so this calculation we have done and we have got uh the sumission of weight is equal to Z and let's go ahead and calculate the sumission of the weight of the next node no no no it's not first Square it is whole squar so here also if I do so it is5 +5 now let's do it for this if I want to find out the similarity weight again see I'm going to repeat it .5 +5 whole squ and since there are three points so I'm going to basically use probability 1 minus probability for one point then plus probability 1 minus probability second point and then probability and 1 minus probability for the third point and Lambda is zero so I'm not going to write anything now go let's go and do the calculation for this node so - 5 - 5 it becomes zero then .5 whole square right so here I'm going to get 0.25 here if you do the calculation here you are going to get 75 so this value is going to be 1x3 and which is nothing at33 so the similarity weight for this node for this node is33 so here you can see probability of multiplied by 1 minus probability okay now the next step that we do is that calculate the information gain now you know how to calculate the information gain but before that let's do the computation for this also for this root node also go ahead and calculate the similarity weight of this okay they why the base model probability is5 because it is just understand that it is a dummy dummy model I have just put a if condition there saying that it is going to give 0.5 now do it for this one guys root node what it will be see I can calculate from here only minus1 gone this is also gone this is also gone this will be .25 divided by something now tell me guys what should be for the root node what is the similarity similarity weight what is the similarity weight for for this do this calculation everyone up one I know it will be. 25 divided by this will be 1.75 are you getting this similarity weight which will be nothing but 1 by 7 and if I divide 1 by 7 if I say what is 1 by 7 it is42 so it is nothing but .14 if I want to calculate the root node similarity weight over here is4 so I know 0.14 here 0 here 33 now see over here we calculate the Information Gain Next Step the third step what we do is that we calculate the information gain now Information Gain is nothing but in this particular case the root node similarity weight we'll try to add up so I will be getting 0.33 minus this particular Top Root node whatever split has happened that similarity weight I'll take 0 +33 -14 so Point -14 and if I do it it is nothing but just open your calculator again and 33 -14 so it is nothing but .19 I'm getting .19 as my information gain the information gain of this specific tree I got it as19 obviously you know how the features will get selected based on the Information Gain but let's say that the highest Information Gain that is given by salary okay now we will go ahead and do the further split let's go ahead and do the further split so I I know my information gain now it is1 n and Information Gain is basically used to select that specific node through which the split will happen now I'll further go and do the split let's say that I'm going to do the further split with the next feature that is which one credit so I'm going to take credit over here I'm going to take credit over here and again I have to do a binary split again but you may be considering chish here are only three categories how we are going to basically do this particular split right because we don't know how to do the split because we have three categories over here so in this case what I will do is that we what we can definitely do is that in this particular case the split that we are probably going to do is that let's consider two categories like good and normal at one side bad at one side so here it becomes a binary split again now let's go ahead and let's try to see that how many data points will fall here and how many data points will fall here so for writing down the data points let's say if it is less than or see go to the path if it is less than or equal to 50 it'll go this path and if it is B then we are probably going to get how much is the residual we are going to get one residual over here first of all so this is my one residual that is -.5 then similarly if I see less than or equal to 50 good is there right good or normal is there so here again 0. five will come I hope everybody is able to understand see the second record less than or equal to 50 we go in this path but it is good we come over here again less than or equal to 50 good again we are going to get 1 more5 then go with respect to greater than or equal to 50 which is coming over here we'll not worry about it right now again less than or equal to 50 normal again it is -.5 right so this many records definitely coming over here only one record is basically coming over here then again we will start the same process again we will start the same process now for the same process what we are going to do again try to calculate the similarity weight now in order to calculate the similarity weight what I will do I will basically say this is my similarity weight this will become .25 divided 025 why because this whole square right this whole Square residual square right summation of residual square but here I have only one residual so this Square it will become and then what I'm actually going to do I'm going to basically write .5 - 1 -.5 this is nothing for only for one data point so this is nothing but .5 * .5 which is nothing but 0.25 right now in this particular case I will get similarity weight as I hope everybody I'm getting it as one now what about this similarity weight if you want to compute it is again very very simple this and this will get cancelled then again it will be 025 divided by um if I say one like this .25 then again it will be 75 then this will also be 1 by3 that is nothing but 33 so similarity weight will be33 then again I have to calculate the information gain of this node what I will do I will add this up see 1 +33 I'll add like 1 +33 minus 0 why zero because the information gain the similarity weight of this uh the up one is basically 0 right for this particular credit node similarity weight is zero so 1 +33 minus 0 this will be 1.33 so like this further split will again happen over here with different different node and we will only be getting a binary split but we will be comparing based on Information Gain which one is coming good now let's say that I have created this path I have I have designed I have developed my entire binary decision tree which is a speciality in XG boost now what I'm going to do over here is that see everybody what I'm going to do let's consider the inferencing part let's say this record is going to go how we are going to calculate the output so this first of all went to this base model now let's go ahead and see how the inferencing will happen suppose This Record is going right so first of all this record will go to this base model the base model is giving the probability as 0.5 so the first base model is basically giving 0.5 now base based on this 05 how do we calculate the real probability how do we calculate the real probability in this okay so we apply something called as logs so we basically say log of P / 1us P so this is the formula we basically apply in only the case of base model so if we try to see this it is nothing but log of5 / .5 which is nothing but zero log of one is nothing but zero so in the first case whenever any record goes I will be getting the zero value over here okay zero value over here then plus why plus I'm doing because it will now go to the binary decision tree now this record will go to my binary decision Tre whatever value I'm getting from this I'm actually adding that up and now it will go over here now when it goes over here first of all let's see which branch it is following it is following less than or equal to 50 Branch first Branch over here then this is bad it'll go and follow here so here I can see that the similarity weight is one now the similarity weight is basically one in this case so what we do in the case of this we pass it to a learning rate parameter so this specifically is my learning rate multiplied by 1 one because why similarity weight is one over here so this will basically be my first references and Alpha over here is my learning rate it can be a very small value based on the learning parameter that we use like how we have defined learning parameters elsewhere on top of this we apply an activation function which is called as sigmoid since this is a classification problem we apply an activation function which is called as sigmoid and I hope you know what is the use of sigmoid based on this based on the alpha value based on this the output will be between 0 to 1 now I hope you getting it guys this is how the entire inferencing will probably happen now similarly what I will do I will try to construct this kind of decision tree parall so we we can also write our entire function will look something like this Alpha 0 + alpha 1 and this will be your decision tree 1 output then Alpha 2 your decision tree output Alpha 3 your decision 3 output like this Alpha 4 your decision 3 output fourth decision tree like this it will be alpha n your decision tree n output and this will be your output finally when you're trying to inference from any new record now the reason why we say this as boosting because see understand we are going to add each and every decision tree output slowly to finally get our output with respect to the working of the decision tree this is how XG boost actually work don't credit further needs to be simplified yes see like this similarly we can split credit with the help of like we can make blue green one side normal at one side But whichever will be giving the information gain more that will be taken into consideration right and this is how your entire X boost classifier works it is very very difficult to basically calculate all those things so that is the reason we say that XG boost is also a blackbox model so this is basically a blackb model it is it prone to overfitting see at one stage we also need to perform hyperparameter tuning and this we specifically say pre- pruning we tend to do pre pruning and since we are combining multiple decision trees no no this decision tree this decision tree is this one this independent decision tree which I have created now parall after this what I'll do I'll create one more decision tree so it'll be looking like this see finally how it will look so this is my base model then my data then my data will go to this decision tree which I have actually done as a binary split on different different records then again we will make another decision tree which will again be a binary tree the splits will look like this then this is my base model where I'm getting the value as zero this will be alpha 1 multiplied by decision tree 1 which is this then this is Alpha 2 multiplied by decision tree 2 which is this and like this we will keep on continuously adding more decision trees unless and until this entire things becomes a very strong learner so this is how how we basically do the combination of all these things so I hope everybody is able to understand about the XG boost classifier now you may be thinking how does regressor work do you want a regressor problem statement also the decision tree will get constructed based on Independent features and again Lambda value is a hyperparameter we basically set up Lambda value with the help of cross validation now uh let's go ahead and discuss about ex boost regressor the second algorithm that we we will probably discuss about is something called as XG boost regressor and how does X boost regressor actually work some fundamental is follow in random Forest no in random Forest it is completely different there bagging happens bagging happens so over here let's go ahead with the regressor so here I'm going to take some example let's say that I have this many experience this many Gap and based on that we need to determine the salary my salary is my output feature let's say the experience is 2 2.5 3 4 4.5 okay now in this Gap let's say it is yes yes no no yes and let's say that the salary is somewhere around 40K it is 41k 52k and uh let's see some more data set over here 60k and 62k now the first step in classifier we created a base model here also we'll try to create a base model first of all this base model what output it will give it will give the average of all these values what is the average of all these values okay what is the average of all these value 40 81 52 60 62 if I just do the average it is nothing but 51k so by default I will create a base model which will take any input and just give the output as 51 this is the first step now based on this I will try to calculate my residual now how do I calculate my residual I will just subtract 40 by 51k so this will basically be - 11k and uh this will be 10 K - 10 K - 10 and this will be 1 this will be 9 and this will be 11 I hope everybody's able to get this let's say that I I make this as 42k okay for just making my calculation little bit easy so I have 9 over here so this is my residual then again the first step is that I construct my uh decision tree now let's say say that I'm going to use The Experience over here so this is my experience node and based on this experience node I have my features over here so here I will take up all my residuals - 11 99 1 99 11 and then how do I do the split based on experience this is a continuous feature so I have to basically do split with respect to continuous feature which I have already shown you in decision tree how do we do so here is my residual here it is 40 minus this is - 11 K - 9 K uh this is 1 K this is 9 K and 11k - 9k so now I will just create take up my first node here I'm going to use my experience feature I know my values what all things are going to come 11k in the root node - 9 1 9 and 11 now what we are going to do over here is that so I'm going to do again a binary split over here now the binary split will happen based on the continuous feature that is experienced so two types of Records I may get one is less than or equal to two and one is greater than 2 less than or equal to two and one is greater than two now less than or equal to two when I do the split let's see how many values we are getting less than or equal to two I will get only one value that is -1 and here I'm actually going to get all the other values - 9 1 9 11 now what we are going to do after this is that calculate the similarity weight now here the similarity weight will little bit the formula will change with respect to regression so similarity weight is nothing but summation of residual squares divided by number of residuals plus Lambda again here we are going to consider Lambda is zero because this is a hyper parameter tuning more the value of Lambda that basically means more more we are penalizing with respect to the residuals so this will be the formula that we are going to apply okay so let's see for the first number that that we want to apply so how this will get applied again I'm going to write this formula here it'll be better let's say here similarity weight is equal to summation of residual square and here you have number of residuals plus Lambda see previously we were using probability and then all those things we are using so if you want to calculate the similarity weight of this this will become 121 divided by number of residual is 1 plus Lambda is 0 so this is going to be 121 so here we are going to calculate the similarity weight which is nothing but 121 if if we probably take Alpha let's let's do one thing if we probably take uh if if we probably take Alpha is equal to 1 then what will happen if you take Alpha is equal to 1 just think over here what will what may happen we may directly penalize the similarity weight right by just adding one okay so let's do that also suppose I say I'm going to take Alpha is equal to 1 so what will happen this will not be the formula now now what will become 121 divided number of residual is 1 + 1 this is nothing but 65.5 let's say that I now have 65.5 as my similarity weight now similarly I will go ahead and compute the similarity weight for the next one so here it will become - 9 + 9 + 9 + 11 whole Square divided 4 + 1 so this and this will get subtracted 12 squ is nothing but 14 4 144 divid 5 so if I go ahead and calculate 144 ID 5 it is nothing but 28.5 so here I get 28.5 so the similarity weight for this is 28.5 similarly I can go ahead and calculate the similarity weight for this for the top one so it'll be nothing but what it will be 11 + sorry - 11 - 11 - 9 + + 1 + 9 + 11 divided 1 2 3 4 5 5 + 1 is 6 so this is getting subtracted this will be 1X 6 anyhow this will be whole square right so anyhow it will be 1X 6 only so 1X 6 will be my similarity weight over here okay 28.8 hits okay now finally The Information Gain that we need to compute will be very much simple what will be the Information Gain 65.5 + 28.8 minus 1X 6 so try to get it whatever we are trying to get it over here just tell me what will be the output is it 98.34% 60.5 60.5 + 28 88 then this will change just a second 89.1 3 understand you don't have to worry about calculation automatically that things will be doing it okay so you don't have to worry now see we have now further the decision tree can be splitted into any number of times probably the next split what we can do is that we can we can do next split something like this this will be my experience the two splits that may happen with respect to less than or equal to 2.5 less than or equal to 2.5 or greater than 2.5 now if this probably gives the Information Gain better then the split will happen like this otherwise whichever gives the better information again the split will basically happen like this I hope like let's say that this is this is the split that is required - 11 - 11 is 9 is over here and then we have 1 comma 9A 11 okay because less than or equal to 2.5 this two records will definitely go over here and this two This Record will definitely go over here now if I try to calculate the similarity weight for this it will be nothing but - 11 - 9 - 11 - 9 whole S ided 2 + 1 right now in this particular case it will be - 20 s / 3 which is nothing but 400 2 20 into 20 is 400 which is nothing but 3 so if I go and probably use a calculator and show it to you 400 / 3 which is nothing but 133.33 so the similarity weight for this is 133.33 similarly I can go ahead and compute for this it will be 1 + 9 + 11 whole s / 3 + 1 right so it will be 10 + 11 10 + 11 is nothing but 21 whole s/ 4 so what it is 21 whole square if I open my calculator 21 s 21 * 21 which is nothing but 441 divid by 4 divid by 4 so this will probably 110 110. 2.25 and similarly I can go ahead and compute for this so if I want to compute for this what it will be the same thing that we have got over here that is 1x 6 so this will basically be 1X 6 so finally if I compute the information again it will be what it will be 133 1333 + 1.25 - 1X 6 obviously this value will be greater than the previous one what we have got that is 8913 so definitely we are going to use this split which is better than the previous split right let's say that this split has been considered finally how do we see the output okay I hope everybody is able to understand right let's say that this split has worked well so I'm going to rub all these things 11.25 is there now suppose I want to do the inferencing how the inferencing will be done 11.25 here 110.2 now suppose any record comes from here first of all any record that will go it will go to the base model so the base model whenever it goes the value is 51 51 plus alpha 1 this is my learning rate one suppose if it goes in this route then what we have we have - 11 - 9 whenever we go in this rote which has - 11 and - 9 the average of both these numbers will be considered what is average of both these numbers - 11 - 1 9/ 2 this is nothing but - 10 right so - 10 will get multiplied here suppose if it goes in this route then here what will happen here will 1 + 9 + 11 divide by 3 average will be taken so 21 divid 3 7 will be there so this will get replaced by 7 so similarly anything that you are doing this is with respect to decision tree 1 like this we will again construct decision tree separately and again it will become Alpha 2 by decision Tre 2 Alpha 3 by decision 3 3 and like this you will be doing till Alpha and decision 3 n and once you calculate this this will be your specific output in a regression tree so in this particular case what will happen you're just trying to play with parameters and you're trying to use in a different way to compute all this things everybody clear but again it is a blackbox model you cannot visualize all this things now let's go to the third algorithm which is called as s VM see svm is almost like decision uh logistic regression okay so the major aim of svm is that major aim of svm is that suppose if I have a do data points like this okay we obviously use uh logistic regression to split this data points right like this we try to create a best fit line which looks like this and probably based on this best fit line we try to divide the point now in svm what we do is that we not only create a best fit line but instead we also create a point which is called as marginal planes so like this we create some marginal plane so this is your hyper plane and this is your marginal plane and whichever plane has this maximum distance will be able to divide the points more efficiently but usually in in a normal scenario you know whenever we talk about hyper plane or whenever we talk about marginal plane there will be lot of overlapping of points right suppose if I have some specific points I have one point which looks like this I may also have another points which may overlap so it is very difficult to get an exact straight marginal planes and split the point based on this now this specific marginal plane should be maximum because we can create any type best fit line and probably uh use this marginal plane now if we have this overlapping right if for what do we call for this kind of plane this kind of plane is basically called as hard marginal plane so this is basically called as hardge marginal plane okay and similarly if any points are overlapping suppose this yellow points can also get overlapped over here and there may be some kind of Errors so for this particular case we basically say as soft marginal plane because here we will be able to see that errors will be there now in asvm what we focus on doing is that we focus on creating this marginal plane with maximum distance even though there are some errors we consider it in solving it by providing some kind of hyper parameter now how do we go ahead and basically create this all marginal planes and how do we go ahead with this it's very much simple uh just imagine in this specific way that initially let's consider that I have this data point suppose this is my best fit line how do we give this best fit line as equation we basically say yal mx + C right we we basically say this equation as y mx + C no hard hard marginal it is impossible in a normal data set obviously you'll not be able to get it but definitely we go ahead with creating a soft marginal plan now Y is equal to MX plus C what does this m indicate m is nothing but slope and C indicates nothing but intercept can I say that this both equations are same ax + b y + C isal 0 can I also say that this is the equation of a straight line can I say that this is also the equation of straight line I will say that both of them are equal can I say both of them are equal see if I try to prove this to you if I take this equation and try to find out y it will be nothing but minus C Min - c minus a sorry - a x and this will be divided by B this will be divided by B this will be divided by B so here you can see that it is almost the same in this particular case my M value will be - A by B and my C will basically be minus C by B so both the equation are almost same so let's consider that this is my equation and I am actually and whenever I say Y is equal to mx + C can I also write something like this Y is equal to W1 X1 + W2 X2 plus like this plus C or plus b same thing no so here also we can write y w transpose x + B same equation right we are basically using same equation yes we can also write it in a different way but at the end of the day we are also treating something like this let's say that this slope is in this direction if this slope is in this direction then I can basically say that let's consider that the slope is minus one let's say that this slope is minus one see it is in the negative Direction let's say that this slope is minus one I'm just trying to prove that this slope is negative value let's consider this now suppose this is one of my point - 4a 0 and obviously this particular equation is given by this particular line is given by this equation now if I really want to find out the Y value let's say that this is my X1 this is my X1 and this is my X2 let's say that I want to find out I want to find out this W transpose x + b the Y value based on this line if I want to compute the y- value based on this line how will I compute W transpose X basically means what w value what all things will be there one value is B right B is intercept right now intercept is passing from origin can I say my B will be zero obviously I can assume that b will be zero now in this particular case if I talk about w w in this case is minus one which I have initialized over here so if I want to do this matrix multiplication it will be W transpose can be written as like this and this x value can be written as -4 comma - 4 and 0 -4 and 0 right so I can basically write like this now if I do this multiplication what will my value I get I will basically get four right so this is a positive value this is a positive value Now understand since this is a positive value any points that are below this line any points that I consider below this line and if I try to calculate the Y can I say that it will always be positive yes or no similarly if I could probably consider one point over here as 4A 4A 4 now tell me in this 4A 4 if I calculate the Y value what will you get whether you'll get a positive value or a negative value if I try to calculate the Y value in this case because here only positive values will'll be getting right so if I calculate the Y value will the Y value be negative or positive just try to calculate how do you calculate again I will use y equation this time again my slope is minus1 my intercept is zero and here I will have 4 comma 4 now here Min -4 and then this is + 0 this will be Min -4 right so this will be a negative value negative value guys negative see - 4 + 0 negative so any point that I will probably have in top of this any points Above This Plane right and if I try to calculate the Y value it will always be negative so what two things you are able to get positive and negative so you can consider this entirely one category this another category at least these two things you can basically consider guys I hope everybody is able to understand this so this will be my one category and this will be my another category obviously so that basically means I can definitely use a plane and split this point I hope everybody is able to understand now let's go ahead and let's see how this marginal plane will get created and what is the cost function to basically do this or what is the cost function in making sure that the marginal plane will definitely work right it becomes difficult right so suppose let's consider an example suppose I say that this is my lines let's say uh I want to basically create a kind of I have two variety of points one is this point let's say I have all this points like this and the other points I have somewhere here let's consider I am just using directly good number of points so that I can split it okay because I will try to talk about it what I'm actually trying to prove so obviously this is my best fit line that splits and apart from that what I will do is that I'll also create a marginal points so in order to create the marginal point I may use some different color let's see which color this will be my one marginal point remember it will be to the nearest point over here and basically we will construct like like this and similarly here we will be constructing like this I've already told you guys this equation can be mentioned at w transpose x + B = 0 right I can definitely say this because ax + b y + C is equal to 0 so this I can also write it as W transpose x equal to 0 sorry plus b plus b equal to 0 so both are same okay this I don't have to prove it I hope everybody's clear with this now what I'm going to do let's represent this line also with some equation so this line if I want to represent this will be W transpose x + B what value will come over here positive or negative C from this line anything above this plane right any any any distance that we try to find out it will always be negative so let's say that I'm using it as minus one to just read as it is a negative value and this line that I am going to mention it it will be W transpose x + B is equal to + 1 Min -1 above + 1 because we have already discussed from this point if you're trying to calculate the Y value it is always going to be + one this is going to be minus one here I should definitely say this as K okay but I'm not mentioning K in many articles you'll see it as minus one uh many research paper also they use it as minus one but I would like to specify uh minus and plus K but here let's go and write minus1 and plus now my aim is to increase this distance okay this distance I really want to increase this distance now in order to increase this if I increase this distance that basically means my model is performing well so let's say I want to find this distance first of all so if I write w transpose X Plus Bal to 1 and here I will write w transpose x + B isal minus1 so what I'm going to do I'm going to do the computation and subtract it like this so here obviously this will be my X1 this will be my X2 okay because these are my another points X2 and X1 so I can write w transpose X1 - X2 B and B will get cancell and here I will be writing two right so from here we can definitely write two different things let's see what all things we can write so here this is nothing but the difference between my this plane and this plane which is given by like this okay now always understand whenever we consider any any vector vors right any vectors right it also has something called as magnitude so if I want to remove this magnitude I can divide this by W this magnitude of w then only my Vector will remain which is indicated like this so I'm going to basically divide by this particular operation both both the side I'm dividing by this magnitude of w and I don't care about the directions over here right now we just care about the vectors now when I write like this what is our aim our aim is to can I say our aim is to our aim is to maximize 2 byw can I say this guys yes or no what is our aim our aim is to basically maximize this right by updating W comma B value I need to maximize this yes everybody's clear with this can I say that yes I want to maximize this yes or no everybody I want to maximize this if I maximize this that basically means my marginal plane will become bigger my marginal plane will be bigger okay now can I write along with this that such that y of I my output will be dependent on two different things one is I can say that my y y of I is plus of uh is + one when w transpose x + B is greater than or equal to 1 everybody see in this equation what I'm actually trying to specify such that y of I is + 1 when w transpose x + B is greater than 1 and when it is minus 1 that basically means w transpose of X is B is less than or equal to minus now what does this basically mean see all my values whenever I compute W transpose x + B is greater than or equal to 1 I'm obviously going to get this + one when w transpose X+ B is less than or equal to 1 I'm always going to get the output as minus one I hope that is the reason why I have actually written like this so this two we have already discussed why we are specifically writing we want to increase the marginal plane which is this this is my marginal plane and I'm writing one condition that my Yi value will be+ one when w transpose X plus b is greater than or equal to 1 otherwise it when it is less than or equal to minus one it is going to be very much clear with this transpose condition we have already done it everybody clear with this now on top of it we can add one more very important Point instead of writing such that and all you can also say that our major aim our major aim is that if I multiply y i multiplied by W transpose X of I + B If I multiply this two this will always be able greater than or equal to 1 for correct points right for correct points because understand if it is minus one if I'm multiplying with this and if it is a correct Point minus into minus will obviously be greater than or equal to one only right similarly for this it will be greater than 1 so I can also definitely say that my major M If I multiply y of I with this it will be always greater than or equal to + 1 U which is definitely saying that it will be a positive value so this is just a representation guys but understand what is the minimized cost function this is my minimized cost function maximized cost function now I'm going to again write it down maximize W comma B maximize W comma b 2 by magnitude of w I can also write something like this minimize W comma B and I can just inverse this which looks like this are these both are same or not because always understand in machine learning algorithm why do we write minimize things because we are trying to minimize something okay both are equivalent these both are equivalent and why we specifically write minimization because in the back propagation when we we are continuously updating the weights of w and B so we can definitely write like this so here my main target is to minimize this particular value by changing W and B and I will start adding some more parameters over here this is fine till here I think everybody has got it this is our aim and we are going to do this but I'm going to add two more parameters in this Optimizer one is C of I and one is summation of I equal 1 to n and here I will use something called as EA EA of I first of all I'll tell what is C of I see if I have this specific data point let's say if some of my points are over here then is it a right right prediction or wrong prediction if some of my points are over here is it a right prediction or wrong prediction obviously it is a wrong prediction if my points are somewhere here is it a WR prediction wrong wrong incorrect prediction right so this C value basically says that how many errors we can have how many errors we can have if it says that fine we can have six errors or seven errors how many errors we can have even though we are using the marginal plane how many errors we can have so here I'm specifically writing how many errors we can have this is what is specified by C ofi EA of I basically says that what is the summation of I'm going to write it down since we are doing the sumission this entire term basically mentions that sumission of the distance of the values distance of the wrong points and how do we calculate the distance from here to here suppose this is a wrong point I will try to calculate the distance from here to here I will do the sumission of this I'll do the sumission of this I will do the sumission of this similarly for the Green Point another sumission will happen from here to here like this here to here and we going to do that specific sumission so we are telling that fine if you are not able to fit properly try to apply this two hyperparameters and try to make sure that this many errors are also there it is well and good no problem we will go ahead with that try to do the submission of the data points and based on that try to construct the best fit line along with the marginal plane like this even though there are some errors over here or errors over here we are good to go with respect one more thing is there which is called as Al svr svr only one thing is getting changed in svr only this value will get changed so I want you all to explore and just let me know this will be one assignment for you only this value will be changing remaining everything are same so just try to if you change this particular value that becomes an svr just try to explore and just try to find out and just try to let me know so overall uh did you like the entire session everyone okay in this one more thing is there which is called as kernel Matrix svm kernel we say it as svm kernel now in s VM kernel what happens suppose if I have a specific data points which looks like this which looks like this so we obviously cannot use a straight line and try to divide it so what we do we convert this two Dimension into three dimensions and then probably we push our Point like this one point will go like this and the white point will go down and then we can basically use a plane to split it so I uploaded a video around uh around that and uh you can definitely have a look onto that and I have also shown you practically how to do it that is the reason I've created that specific video so great uh this was it from my side I hope you like this session so thank you everyone have a great day keep on rocking keep on learning and never give up

Download Subtitles

These subtitles were extracted using the Free YouTube Subtitle Downloader by LunaNotes.

Download more subtitles

Most Viewed

ดาวน์โหลดซับไตเติ้ล DMD LAND 3 The Final Land Day 1

ดาวน์โหลดซับไตเติ้ลสำหรับวิดีโอ DMD LAND 3 The Final Land Day 1 เพื่อช่วยให้เข้าใจเนื้อหาได้ง่ายขึ้น และเพิ่มความสะดวกในการติดตามทุกช่วงเวลา เหมาะสำหรับผู้ชมที่ต้องการความชัดเจนและเข้าถึงข้อมูลอย่างครบถ้วน

Untertitel für 'Nicos Weg' Deutsch lernen A1 Film herunterladen

Laden Sie die Untertitel für den gesamten Film 'Nicos Weg' herunter, um Ihr Deutschlernen auf A1 Niveau zu unterstützen. Untertitel helfen Ihnen, Wortschatz und Aussprache besser zu verstehen und verbessern das Hörverständnis effektiv.

Descarga Subtítulos para NARCISISMO | 6 DE COPAS - Episodio 63

Accede fácilmente a los subtítulos del episodio 63 de '6 DE COPAS', centrado en el narcisismo. Descargar estos subtítulos te ayudará a entender mejor el contenido y mejorar la experiencia de visualización.

Subtítulos para TIPOS DE APEGO | 6 DE COPAS Episodio 56

Descarga los subtítulos para el episodio 56 de la tercera temporada de 6 DE COPAS, centrado en los tipos de apego. Mejora tu comprensión y disfruta del contenido en detalle con nuestros subtítulos precisos y accesibles.

Download Subtitles for Your Favorite Videos Easily

Enhance your video watching experience by downloading accurate subtitles and captions. Enjoy better understanding, accessibility, and language support for all your favorite videos.

If you found these subtitles useful, consider buying us a coffee. It would help us a lot!

Download Subtitles and Captions for Any Video Easily

null

Related Videos

Download Accurate Subtitles and Captions for Your Videos

Download Subtitles for Your Favorite Videos Easily

Download Subtitles for Creaconnect V3 Video Easily

WhatsApp Video Subtitles Download - Easy Access & Enhanced Understanding

Download Subtitles for SMCCC2 Video and Enhance Understanding

Most Viewed

ดาวน์โหลดซับไตเติ้ล DMD LAND 3 The Final Land Day 1

Untertitel für 'Nicos Weg' Deutsch lernen A1 Film herunterladen

Descarga Subtítulos para NARCISISMO | 6 DE COPAS - Episodio 63

Subtítulos para TIPOS DE APEGO | 6 DE COPAS Episodio 56

Download Subtitles for Your Favorite Videos Easily

Start Taking Better Notes Today with LunaNotes!