Comprehensive Guide to Pandas for Data Analysis in Python

Convert to note

Introduction to Pandas

Explanation of why pandas is essential beyond numpy for complex datasets
Illustration using house price dataset with multiple feature columns

Pandas Vs Numpy

Numpy arrays lack labeled columns, making data interpretation difficult with many features
Pandas provides a tabular, Excel-like data structure with labeled rows and columns

Key Features of Pandas

Easy importing of various data sources (CSV, Excel, SQL databases)
Powerful data cleaning capabilities (handling missing and invalid values)
Size mutability for adding/removing rows and columns
Data reshaping, pivoting, and efficient extraction
Built-in statistical analysis functions

Prerequisites

Basic programming knowledge in Python or any other language
Understanding of fundamental statistical concepts (mean, median, mode, variance, standard deviation)

Core Data Structures

Series

One-dimensional labeled array
Holds homogeneous data types
Size immutable: operations return new Series objects
Supports index customization and various indexing methods (positional and label-based)

DataFrame

Two-dimensional, size mutable, heterogeneous data structure
Can represent entire datasets with multiple columns
Supports sophisticated selection via .iloc (positional) and .loc (label-based)

For a deeper understanding, you can refer to Understanding Pandas Series and Data Structures in Python.

Importing Pandas and Creating Data Structures

Installation via pip install pandas
Import with import pandas as pd
Creating Series from lists and dictionaries with examples
Modifying series name and indexes

Indexing and Selection in Series

Basic slicing and indexing syntax
.iloc for integer-based positional indexing
.loc for label-based indexing
Difference in slice inclusivity between .iloc (exclusive end) and .loc (inclusive end)

Conditional Selection and Logical Operations

Filtering Series based on conditions
Combining conditions using and, or, not operators
Practical filtering examples

DataFrame Operations

Creating DataFrames from dictionaries
Viewing data subsets: .head(), .tail()
Selecting rows and columns using .iloc and .loc
Adding, dropping columns with inplace parameter

Data Exploration Methods

Checking data shape (.shape), info (.info()), and description (.describe())
Viewing unique values and value counts for categorical columns

Broadcasting with Pandas

Performing arithmetic operations on entire columns with scalars
Example: Increasing all salaries by a fixed amount

Data Cleaning Techniques

Handling Missing Values

Detecting missing data with .isnull() and .sum()
Removing missing values with .dropna() and parameters (how='any' or 'all')
Filling missing values with .fillna(), using constants, mean, median, forward fill (method='ffill'), backward fill (method='bfill')

Handling Duplicate Data

Finding duplicates with .duplicated() and the keep parameter
Removing duplicates with .drop_duplicates()

Handling Invalid Data

Using .apply(lambda x: ...) for conditional transformations
Example of adjusting salary values exceeding a threshold

For more on data inspection, cleaning, and transformation, see Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation.

String Operations

Using .str.split() to split columns with string data

Advanced Lambda and Apply Usage

Applying user-defined functions and lambda expressions to columns for transformations

Joining and Merging DataFrames

Concepts of left join, right join, inner join, outer join
Concatenating DataFrames using pd.concat() along rows or columns
Merging DataFrames with pd.merge() on common columns

To master these techniques, consider Mastering Pandas DataFrames: A Comprehensive Guide.

Importing Real Datasets

Reading CSV or Excel files with pd.read_csv() or pd.read_excel()
Adapting to environment limitations (e.g., Google Colab file uploads)
Converting string date columns to datetime objects with pd.to_datetime()

Best Practices and Final Notes

Emphasis on hands-on practice with the shared notebook
Encouragement to explore datasets from Kaggle for further learning
Summary of pandas as an essential tool for data analysis and preprocessing

For a thorough foundational overview, see Python Pandas Basics: A Comprehensive Guide for Data Analysis.

This tutorial equips learners with both conceptual understanding and practical skills to efficiently manipulate and analyze data using pandas in Python, building a solid foundation for data science projects.

So hello guys, welcome to this full course video on pandas. In the last video, we had covered uh numpy. If you

guys haven't checked it out, you can go ahead and watch that video. I have provided the link in the description as

well as in the i button. So before we move ahead with pandas, I want you guys to understand the basic difference or

the basic requirement why we need pandas. All right. So basically we saw in numpy how we were able to perform

various numerical as well as arithmetic operations on our arrays that was basically a data given to us in the form

of arrays right so why exactly do we need pandas or what is pandas all right so in order to understand that I am

going to go ahead and show you guys an excel sheet so as you can see over here I have some data so once you guys get

familiar with data analysis you guys will be able to tell me that this is something known as house price

prediction data set but we need not get into the details of that right now. So for now all you have to understand is

that there are different uh features or different columns representing uh the features of a house right. So we have uh

the square feet that is the area of the house we have the city we have the state we have the condition we have the

waterfront which basically is telling us whether there is a waterfront. So basically 0 and one is how our data is

being represented. Okay. So the number of floors, the price. All right. So we have all these features in our data. All

right. Now if I were to represent the same thing in a numpy array, I would have my array something like this. Okay.

So say I'm just representing two houses and these two houses have two bedrooms, two bathrooms. Okay. and the other house

has just one uh bedroom and one bathroom. Okay, this is what my numpy array would look like. Okay, this is

basically my house one, my house two, and this is my bedrooms and this is my bathrooms. Okay, so do you guys agree

with me or not? Would our numpy array look something like this or not? Except we wouldn't have this information with

us. You would just have an array that would look like this. So this is the major issue that comes in numpy when we

are dealing with data. Okay. So since we only have two columns over here, maybe we can try and remember that the first

one is bedrooms, the second one is bathrooms. But in case we have 12 or say 50 columns, how are we going to remember

what each and every column is representing? So this is where pandas steps in. Pandas is nothing but a

tabular format of data. Just how we saw in Excel. It gives us a similar representation of data in our notebook.

All right. So before we go ahead and see the demonstration with pandas, first we'll understand why pandas. So here are

some of the important features that pandas provides us. So it gets very easy to import our data sets. It could be a

CSV file. It could be an excel sheet. Okay? It could even be your SQL database. So you can import any sort of

data set in your pandas environment. I I'll be using Google Collab in this video but you can use Anaconda or even

your VS code. All right. Then comes your data cleaning. So this is a very important step. Now what exactly does

data cleaning mean? So when we say data cleaning, we are talking about missing values and majorly the invalid values.

Okay. So when we talk about missing values, okay, let me go back to my Excel sheet. Okay. So this is the data that I

have, right? So say in case of this particular value over here 10500 what if I had a blank uh cell over here that

would represent my missing value. Okay. So that is basically what missing value means. So if we have a particular data

set and some of the values in our data set are missing. So whenever we dealing with that kind of scenario we will be

dealing with those missing values. Okay. We will be seeing how we can do that with pandas. And the other case is

invalid values. So say I have an age column. Okay. Now we know our age can normally range from 0 to 100. Taking the

worst case scenario over here we have 100 or say 120. Okay, it cannot go above that and it cannot go below zero. All

right, so this is basically a standard age range. But what if I have a value say 200 in my age column. Now this is

basically an invalid value, right? I cannot have an age that is 200 if it's for humans especially. Okay. So I would

understand that there is some uh typing error or there is some mathematical error and the common way of dealing with

this uh invalid value would be dividing by 10. Right? So 20 is a valid age but 200 is not. Okay. So that is basically

what invalid value means. Size mutability. So we can easily add and delete columns or rows. Okay. Then we

can reshape our data set. We can pivot our data set. We can efficiently manipulate and extract from our data

set. And most important step is statistical analysis. Okay. So we can easily have all the statistical analysis

of our data with just a few lines of code. Okay. Now moving on to the prerequisites to learn pandas. So you

need to uh know any programming language preferably Python but if you know any other programming language as well it's

not an issue. Okay. And the next one is maths. That is basically you need to know some uh stats or inferial uh

statistics. You need to have a basic idea like uh what is standard deviation, what is variance, what is mean, median,

mode. So basically you need to have a basic idea of all those terms. So moving on to pandas, we have two primary data

structures in pandas. First is city, second is data frame. Okay. Now what is data structure? Data structure is

basically a collection of data types that provide the best way of organizing the items or the values in terms of

memory usage. Okay. So series is going to be our one-dimensional uh data structure whereas data frame is a

two-dimensional data structure. Okay. Now series basically represents only a single column from the entire data set

whereas a data frame is able to represent the entire data set. Okay, that's the basic difference. So let's go

ahead. Okay. So series we already discussed is one-dimensional. Why it's onedimensional? Because a series would

look something like this. So say I have a price column and these values would be represented by an index value by

default. Okay. And remember our index is always and always going to start from zero. So this is our series. Okay. So

over here you cannot say this is a two-dimensional or it has two columns. This is the index and this is our one

and only column. Okay. So this is why series is a one-dimensional data structure because it can accommodate

only one column and you can extend it to as many rows as you want. Okay. So it is a one-dimensional labeled homogeneous

array. Homogeneous array means it can accommodate only either integers or string values. Now it's not that if I

were to uh add say a string ABC I would get an error. Rather if I were to add a value like this all of my integers would

be converted to string. Okay. So this is why series is called a homogeneous array because if I add a single string value

into my series all of my integers are going to be converted into string and my data type of this series would be

object. Okay. If I were to have only inteious my data type would be integers. Okay. So this is basically what it means

and very important property is it is size immutable. Now what this means is I cannot delete a particular uh row from

here. I can only remove it and when I perform the remove operation basically I am going to be returned a new string

that is my string originally is not going to get manipulated rather I'm going to be returned a new string where

my values would look something like this. Okay, so this is basically what uh size immutability means. And even if we

try to append a new element, we would be again returning a new series rather than manipulating or rather than changing the

original uh series itself. Okay. Now moving on to data frames. It is a two-dimensional data frame because we

can have multiple columns. So we can have a price column, we can have number of bedrooms, number of bathrooms and so

on. Okay, we can have unlimited or you can say n number of columns and n number of rows. So basically it can be extended

in this direction as well as this direction. Whereas series was able to extend itself only in this direction.

Okay. And this is going to be a size mutable tabular structure and it can be heterogeneous type. Heterogeneous means

we can have say integers in this column and a string all the string values in this column. Okay. So we can have

multiple data types uh supported in data frames and it is size mutable means you're easily able to add and delete the

rows and columns. Okay. So I hope the difference between series and data frame is clear. So let's go ahead and look at

the implementation in our notebook. But before we do that, I hope you guys remember you are supposed to install

pandas using pip install command. Let me just show you guys quickly. So you can go to your terminal or any command

prompt and all you have to do is type pip install pandas. When you run it, you're

going to have a few lines of code run for you. And then finally, you're going to get your requirement is satisfied.

And then after that, you can proceed with importing pandas as pd. Okay. So again

this is an allias just like we used allias for numpy. We are going to use one for pandas which is PD. Okay. So I'm

going to run it. And if you guys remember I told you guys in the numpy video that numpy is going to be used

throughout your data analysis process. So right now we are going to focus only on the pandas uh features. But when we

come to real data sets which I'll be showing you I'll be importing a real data set also using a CSV or an Excel

file and then I'm going to show you how numpy and pandas both are going to be used to perform various operations on

that data set. All right but right now we're going to be only focusing on pandas understand the features and the

implementation. Okay. So as you can see I have successfully imported my pandas and the green take over here basically

represents that the operation was successful. Okay. So now what I'm going to do is I'm going to create a basic

series using PD do series and I'm going to do so using a list. Okay. Now if I go ahead and print it. So you can see my

data type is in 64 because all of these are integers. And by default I got an index value of 0 1 2 3 and 4. Okay. So

as I told you my index is always and always going to start from zero. And now if I were to add another uh element say

a string value ABC you can see now my data type is object and all of these uh that were integers earlier have been

converted into a string format. Okay. So I'm going to remove this for now because I want to show you guys some operations

that we can perform on series. So I can go ahead and check the data type of my series. It's N64. then I can go ahead

and check the values of my series. Okay, so you do not need any parenthesis. So these are the values of my series uh

data structure. So you can use s.t index to check the index of your uh series. So you can see over here my starting value

is zero and my stop value is five and my step value is 1. All right. So basically I have my index from 0 to four. So here

you can get an idea about your index as well. Okay. And lastly you have s dotn name. So it's not printing anything

because it's none. So if I print s dotname you can see it's none because we have not given any sort of uh name or

any sort of value to our column. Okay. In order to assign a value to our column we can just say s do.name and we can

assign a value to our column say those are just numbers. So I'm going to write numbers. And if I print my series you

can see this is what my series is going to look like. All right. So that was it about the random functions. Now we're

going to look at an important topic called indexing. Okay. So you can do indexing using square brackets like we

normally do for lists and we have also done for numpy arrays. So similarly you can do it for uh your series in pandas.

So if I want the first element I'm going to say s0 and I get the first element that is 10. Okay. So if I want multiple

elements, all I'm going to do is s 0 to2. And you can see my uh ending index or my ending uh value is not included in

my output. So let me just comment it over here. And then we have the step value. So we have done this indexing in

list as well. And you guys are well aware that your starting value is always included whereas your ending value is

never included. All right. So whichever value you want at the end, you give your stop value uh plus one to your desired

value you want in the end. All right. So say I want values at index 2 and three. My syntax for that would be s 2 to 4.

Okay. So as you can see over here I got my values at index 2 and three. All right. Now another way to perform

indexing is by using a function called iO which basically does locationbased indexing. So what this means is it is

going to use the indexes that is the 0 1 2 3 4 that you have as indexes and it is going to fetch the values at those

particular indexes for you. So all you have to do is say S do and value at three and I print it and you can see I

got 40. So at index three I have value 40 and it's giving me the correct value. Okay. Now if I want multiple values all

I have to do is s dot illog and use double brackets and pass all the indexes that I want my values at. Okay. So I

want value at index 1 3 and four. So as you can see I got my output just as I intended. So this was all about location

based indexing. Now another important feature in series is that you can change the index according to you. Okay. So

instead of 0 1 2 3 4 these numbers actually represent the calories. Okay, these are the different calories. So you

can see I have changed the column name to calories. And now I want uh these calories to represent the fruit. Okay,

so I'm going to create a list of all the fruits that are present in my carrier. So you can see I've created a random

list of fruits and I've stored it in the variable index. Now what I'm going to do is I'm going to say s.t dot index and

I'm going to pass my index variable and then I'm going to print my series. So you can see over here I am basically

trying to say my apple has 10 calories, banana has 20 calories, grapes has 30 calories and so on. Okay. So this is how

you can change your indexes according to the information you are trying to represent. Now do not comment that I

have put wrong calories for the wrong fruits. This is just a random uh data that I have created just to show you

guys how indexing works. Okay. So now if I try to access any value using my indexes. So if I want to find the

calorie of grapes, I can just say s grapes. Okay. In square brackets. Now can I use iO function over here? Let's

try it. Okay. So I get an error because this is not a location anymore. Right? So when I had my numerical indexes,

eyelock was working absolutely fine. So I can still use but for my numerical indexes, right? The original indexes

that is we have indexes in lists, we have indexes in numpy, we similarly have indexes in series as well. Okay. So we

can do that by log. But again how will we remember what is index 3, right? So that is why we have an other function

called log. This is label based indexing. Okay. So using LO we can go ahead and say and you can see I get my

value. Okay. Now another important thing you need to remember is in label based indexing your start as well as stop

value both are included in the output. Okay. Now this is a very important thing that you need to remember because unlike

normal indexing or uh the indexing that we do in list numis and all whenever we using a normal list our stop value is

never included. All right. Now say I want my calories of banana, grapes and orange. Okay. So what I'm going to do

over here so you can see my banana, grapes and orange. So all three of them that is my starting value as well as my

ending value both have been included. I hope label based indexing and uh location based indexing that is your

normal indexing. The difference between both of them is clear. Difference between lo and is clear. Okay. So let me

give you guys a quick uh recap. We learned how to create a series using a list. Okay. We saw the default indexes.

We saw the default name which is none of the column. Then we saw how to check the data type of the series. We saw how to

check the values and the indexes of the series. Then we saw how to rename our column or the series. Okay. And then we

went ahead and looked at indexing. We saw normal indexing. We saw uh indexing using eyelock. Okay. Then we went ahead

and changed the index of our series. We gave a different uh set of indexes to our series using s.index equal to uh the

new index. And then we went ahead and performed indexing on our labels that is we used uh the new indexes to index the

values. And then we also looked at the label based indexing that is using lock. Okay. Now if I want to access multiple

uh elements over here multiple values again I have to give double brackets and I can say grapes and then I want apple

and I can print it. So you can see I got my desired output. Okay. So that is all about indexing. Now let's also see how

we can create a series using a dictionary. Okay. So I have already created a dictionary for you guys which

is basically fruit protein. So our keys are the fruits and our values are the proteins or the grams of protein stored

in that particular fruit. Okay. So this is our dictionary. Now, if I want to create a series out of this dictionary,

all I have to do is say I'm going to name it S_2 because this is my second series and I'm going to say PD dot

series fruit protein and I'm going to print it. Okay, so you can see and by default the name of our column over here

that is representing the protein is zero. So I can go ahead and say name equal to

protein. Let's run it. Now my series is giving us a complete information of what our data is representing. Okay. And you

can see the data type over here is float. Okay. Again you can go ahead and perform the similar kind of operations

on uh this particular series. I will be sharing this notebook with you guys in the description below. So you can

download it and do the necessary operations that we saw in our earlier series that is our series S. So all

those operations can be performed on this particular series as well. So I want you guys to pause the video,

download the notebook and perform these operations and if you guys face any doubt comment it in the chat box below.

I will respond to you guys at the earliest possible. Okay. All right. So I'm assuming you guys are done with the

earlier operations. Now we'll be looking at something important and something new which is called conditional selection.

Okay. Now what conditional selection basically means is now I want to fetch all the fruits over here that have

protein greater than one. Okay, I want to find out all the proteins that are greater than one. So what I'm going to

do is I'm going to say s_ub_2 greater than one. Now you can see over here I've got boolean values that are representing

true and false. So wherever I have protein greater than one I got a true and wherever it is less than one I got a

false. Okay, but what if I want only those rows in my output that have the value greater than one? So what I'm

going to do is I'm going to put this conditional uh selection statement inside my series. So I'm going to pass

it inside my original series and you can see I got the result. Okay, so all these uh fruits have protein greater than one.

Okay, so this is basically how conditional selection works. It is first going to provide you with a mass uh

series mass series which is basically a series of uh boolean values true and false and if you pass that mass uh

series into your original series using square brackets as we just did you get your desired output. So our next topic

is logical operators. Okay logical operators we have and or and not. So I hope you guys are uh clear with what is

and or not. Okay, if you guys are not, let me just give you guys a quick example. Okay, so say I have two values,

one is true, the other is also true. Now, if I do a and between these, if I perform an and operator between these,

my result is al also going to be true. Okay? Now, if I have a true and a false and I perform an and operation between

these two values, my result is going to be false. Okay? Similarly, if I have a false first and a true later on and I do

and operation, my result is going to be false. And if both my values are false, my result is again going to be false.

Okay? So, we need both our values to be true in order to have a result true. Okay? But in case of or any one of our

value has to be true, this is going to be true. But if I have true or false, it is again going to be true. Okay? If I

have false or false then it is going to be false. Okay. So this is basically uh and and or. Okay. Now let's see how it

is used in a series. So say I want to give a range. Okay. I want to have all my fruits or I want to see all those

fruits which are greater than 0.5 but less than two. Okay. So all I'm going to do is s greater than

0.5 and s is less than 2. So this is the syntax that you have to follow. Okay. So you can see I got the true and false

values. Now if I want to uh find only those rows which satisfy this condition. So you can see all these values are

going to be between 0.5 and 2. Okay. I can say equal uh sorry I can say less than or equal to two and now I have my

fruits which have the protein value equal to two as well. Okay. So this is how and is performed. So basically what

happened there is wherever we had 0.5 and greater but less than two this range was passed to us when we use the and

operator. Okay. Now if I were to use the or operator in this particular uh scenario where I'm trying to find this

range, would I get this desired output? No. For the or operator, my first condition would be greater than 0.5. So

I would get all my values that are greater than 0.5 and then my second condition would be less than 2. So I

would get all the values which are less than two. That means below 0.5 as well. All right. So as you can see over here I

have 0.3 and I have 2.6 as well in my output. So this is basically telling me that I have used an or operation over

here. Okay. So similarly you can use and and or and you can also use your not operation. So I can say I want all my

values which are not greater than one. Okay. So you can see all these values are smaller than one. Okay. So this is

how the logical operators like and or and not are performed. And then finally we have a topic of modifying the sins.

Okay. So say instead of my calorie of mango to be 0.8 I want it to be 2.8. So what I'm going to do is just like a

normal uh list I'm going to go ahead and say s mango and I'm going to say 2.8. Okay. So it's run successfully. Now if I

print my series, you can see my mango protein has been changed to 2.8. Okay. So this is how you can modify your

series. Okay. At any particular uh index uh you can change the value. You can also change the index but you will have

to pass your entire uh index again. Okay. So this was all about series. Now we're going to move ahead with uh data

frames. So I hope series was clear. If you guys have any doubt, comment your doubts in the chat box below. So I will

respond to you guys as quickly as possible and do make sure that you guys have this particular notebook downloaded

in your systems and you guys are working with me side by side. So that's going to make the entire process more fun. Okay.

So so far you guys have gotten a basic understanding of series and we have already covered numpy. So I have a

question for you guys. So basically I have a series and I want you to find the answer of this particular query. Okay.

So if I say s dot not null and I say I want a sum of all the values that are not null. I want you guys to quickly

comment the answer in the chat box and tell me what's going to be the answer in if we run this particular query. Okay.

So moving on let's look at our very next topic that is data frame. So it is important and very interesting if you

guys understand it carefully. So basically I have some data in the form of dictionary as you can see over here.

So I have some names, I have some ages and I have some departments and salaries. Okay. And now I want to

convert this particular. So if I print this data it's going to look like this. Okay. It's not very neat and it's not

very easy to understand. So what we're going to do is we're going to go ahead and convert it into a data frame and let

me just print the data frame for you. Okay. All right. So I have this particular data set with me. So if I

want to see just the starting values, I can just say df do head and say I want to see the first two rows. Okay. So I

can do it using dot head function. And if I want to see the last values, I can just say

df.tail. Okay. So by default the parameter over here is five. But if you want, we can tailor it to our own needs.

So I want to see the last three uh rows. So you can see five, four and three. Okay. So moving on, we can again use our

lo and functions just like we did with our series. So say I want to fetch the first and second row of my data set. So

what I'm going to do is I'm going to say df do.lo dialogue and I'm going to say 1 2 3. Okay. So you can see I have my

first and second row and all the columns. Okay. So now in case of all these columns what if I want only

department and salary in my output. So what I'm going to do is I'm going to say df.lo. So instead of

iOS 1 2 3 and I'm going to say I want columns age and department. So I'm going to have a comma over here to separate

the columns. So you can see over here I have age and department and my three rows. Okay. So we discussed earlier

whenever we use the lock function our starting as well as ending uh indexes are going to be included. So I have 1 2

and three rows instead of just one and two. So if I want just the two rows I have to say one and two. All right. So I

hope this is clear. Okay. So we can use eyelock and lock in this manner. But if I wanted to include my uh first and

second column in my eyelock uh column itself. So you can see I have my first two columns in my output. Okay. So

basically when we using for data frames we have to pass our rows and our columns and the range for these are going to be

separated by a column. Okay. So we can do that using i log and log. So we'll be using this a lot. Okay. Direct indexing

in a data frame gets a little complicated. So that's why we'll be only using iLOC and log functions. Okay. So

you can go ahead and play around a little with iOS different uh rows and different

columns. So now we'll be moving to accessing only an individual column. Okay. Now if I want to access just one

particular column, all I can do is say DFH. So this is how I get only the H column. Okay. Now if I want multiple

columns I have to add another bracket and I can simply separate my uh columns and say I want age and department. So

you can see this is how I get my output. Okay. Now what if I want to drop this age column. Okay. So you can see I have

a non null value also over here. Now I want to drop this age column because I feel it's not very relevant to my data

set. So how I can drop it is I will just say df dot drop and I will say h okay and I will pass axis equal to 1 because

I want this entire column to be gone. Okay if I want a particular row to be gone my axis would be zero. Okay so it

is a particular row wise operation but if I want a column gone I want my access to be one. Okay. So if I run it now, you

can see my age has been disappeared from this particular output. But if I print my data frame, you can see the age is

still there. Okay. So until and unless we use the parameter in place equal to true. By default, in place is equal to

false. Okay. So in place is basically telling us to perform the operation in the original data frame. Okay. So if we

want the changes to be displayed in our original data frame we are going to use in place equal to true otherwise by

default it is going to be false and it is only going to return a new data frame and our original data frame is going to

remain unchanged. Okay. So as you can see it has been unchanged. Now this is a missing value over here. This and this.

So we'll be uh dealing with these missing values later on. But first let's understand a few more functions. So we

can go ahead and check the shape of our data set using df.shape. So this is telling us that we have six rows and

four columns. Okay. 1 2 3 and four. Okay. So again the indexes are not counted as a part of a column just like

we did in series. Okay. So these are by default values over here. And we have six rows. Okay. So we basically have six

samples. Okay. These are known as samples. This is the sample sets. Okay. So we can also go ahead and check the

data type of our data frame. So we can also go ahead and check other information about our data set. So using

info. Okay. So this is telling us that we have these particular columns name, age, department and salary. Okay. And

these are the data types in our column. So we have two columns having float data type and two columns having object data

type. Okay. So name has object data type, age has float data type, department has again object data type

and salary has float data type. Okay. Now over here we have a non-null count. Non-null count basically means that we

have six non-null values. That is we have total six samples and all of these six samples are non-null. Okay. So

basically in the name column we have zero null values. Okay. And we move to the age column and over here we have

five non-null values. That means we have one null value over here and over here also we have one null value whereas over

here we have zero null values. Okay, it is also displaying the memory usage to store this particular data set. Okay,

and another operation that you can perform on your data set is describe your data set. So over here you get uh

statistical information on your data. So over here we only have two uh data sets that are of type float or integer. Okay.

So we can perform these particular uh operations that is the count mean standard deviation minimum value and

these 25% 50% 75% are basically the outliers in our data set. Okay. And the maximum value. Okay. So using the

describe function you get a basic information about your data like you have five values in your age column and

five values in your salary column that are present. Then the mean value of your age column is 28 whereas mean uh value

of your salary column is 5840. Standard deviation is this. The minimum value of age is 25 and minimum

salary is 50,000. Then you have other information about your data set. Okay. So in numpy we saw an important concept

called broadcasting. Okay. We also looked at the rules for broadcasting. Now we'll see how broadcasting is

performed in pandas. Okay. So I have a salary feature over here. So our columns are also known as

features. Okay. So these are our columns or our features. Okay. So if I want to say increase the salary of all the

people over here. Okay. Of all these people I want to increase the salary by say 5,000. All I have to do is say DF

salary and I will say DF salary plus 5,000. Okay. Now you guys must be wondering this is a scalar uh digit or

an integer whereas this is a one-dimensional array right if I just print df dot salary let me go ahead and

print df salary if I print this you can see this is a one-dimensional array right this is basically an array having

five rows and one column right and this over here is just a scalar value now I want to increase all these values by

5,000. So do you think if I perform this particular operation, I will get an error or will it successfully perform?

So if you guys said it will successfully perform, you guys have understood broadcasting really well. So if I go

ahead and print my salary now, so you can see it has been increased by 5,000. Okay. So what's happening over here is

instead of uh treating this as a scalar value, it has been broadcasted to match the shape and size of this particular

column. Okay. And then the operation has been performed over here. Cool. So moving on to the next concept. We have

renaming columns. That is we can rename the columns in our data. Okay. So if I want to rename a column, all I have to

do is call df.treame and pass my columns. Okay. So I want my department column to be

displayed as dpt. Okay. And I'm going to say in place equal to true because I want this operation to be performed in

my original data set itself. Okay. So if I go ahead and print my data. So you can see my department has been changed to

dpt. Okay. So this is how you can go ahead and rename your columns using dot rename function. So you can also go

ahead and check the unique values in your data for a particular column. So I want to see all the unique salaries in

my data set. So you can see over here these are the unique salaries in my data set. Okay. I can check it for another

column. So I can check the unique departments. So again these are case sensitive. So if I check the unique

departments in my data set. So I have three unique departments that is HR, IT and finance. Okay. Now if I want to see

how many employees are in which department, I can go ahead and do the value count of these departments. So I

will say DP dot value counts. Now as you can see over here HR department has three counts, it has two counts and

finance has one count. Okay. So basically it's telling us that there are three people in HR team, two people in

IT team and one in finance team. Okay. So this is how you can uh check the distribution of one particular column

across your data set using value counts. Now what if I want to create a new column called promoted salary that is I

want to check the salary of all the employees after their promotion. Okay. So say some of my employees have been

promoted and I want to create a promoted salary column in my data set itself. So if I want to do that all I have to do is

say df original salary and after the promotion the salary is going to be say multiplied by 10. Okay. So if I run this

particular query and then print my data set sorry my data frame. So you can see the salaries have been increased in my

promoted salary column. Okay. So they have been multiplied by 10. Again this is a broadcasting rule. again. Okay, so

this is how you can go ahead and create a new column in your data set or your data frame. All right, so now that we

have learned how to create a new column, I think it's time to dive into data cleaning. So as you can see, we have a

few null values in our data set. So I want to get rid of them because till I have those, I will not be able to

perform some operations. Plus, it doesn't look very good. Okay. So, first I'm going to check how many null values

exactly I have. So, all I did was I check for my null values inside the entire data set and then I did a sum of

all those values. So, as you can see over here, my name column has zero null values. My age column has one null value

and my salary column also has one null value and so does my promoted salary column. Okay, so I have a few options

over here. I can either just get rid of all the rows that have these null values. Okay, so I can do that by using

df.drop na. NA stands for null values. So I'm basically dropping all the null values. And if I just run this uh

prompt, if I just run this uh without any sort of argument or without any sort of parameter, as you can see, I'm just

left with four rows. Okay. So basically any row that had null value has been removed and again I have not used in

place equal to true. So my original data set is still untouched. Okay. So this is basically just a new sort of data set

that has been altered and returned so that I can see how it's going to look. All right. So by default over here you

can see my argument of how I want my null values to be deleted or dropped is any. So basically any row that has any

sort of null value has been dropped. Okay. So if I run this using how equal to any. So again my result is the same.

So we saw one uh sort of parameter in drop na that is how equal to any. Now the other parameter in drop na is all.

Okay. Now the other argument that drop na takes is all. Now what this is representing over here is particular

row. Okay. when we said how equal to any, we were just saying any row that had any null value. Okay, so whichever

row had any null value was dropped. Now over here we are saying if all the values in a row are null, we are going

to drop that row otherwise we are not going to touch it. Okay, so we do not have any row that has all the null

values. We just have uh this row that has two null values but again we have three uh non-null values as well in this

particular row. So again none of my rows were altered and the data remains as it is. Okay. So this is one option when we

are dealing with missing values. We can just drop we can just get rid of those null values. Okay. The other way is to

fill those missing values. Now how how I can do that? So let me just find okay here here is my null value right. So I

can either fill this null value with the most repeated value in this particular column. Okay that is one option. Let me

just write it down for you guys. Okay. So dealing with missing values is one of the most important aspects when it comes

to a data set because you will be dealing with a lot of unclean and messy data sets and missing values is going to

be one of them. Okay. So let me just write missing values. So one way of dealing with missing values is getting

rid of them which we just saw by using drop NA. Okay. And the next one is by filling them up. And for filling those

missing values we use the function fill na. Okay. So this is the second method. Let us see how it works. Okay. So all I

have to do is say df do.fill na. Okay. And let me just run it. So I must specify a fill value or method. Now I

want to fill all the null values with say zero. So you can see this is how this is what my data set is going to

look like finally. Okay. But this is again a very um noob method to perform on a data set because I do not want my

age to be zero or my salary to be zero just because it's missing. Right? So instead what I can do is for my age

column I am going to fill all my missing values with the mean. Okay. And I'm going to say okay let's not keep it in

place equal to true because I do not want to modify my original data but if you guys want to you you guys are free

to do it. Okay. So I'm just going to replace all my missing values in the age column with the mean of the age. Okay.

So initially my third row had a missing value and now it has been replaced with the mean. All right. Similarly for the

salary column what I'm going to do is if I replace it with the mean it's it's not going to be very technically or

mathematically correct. So instead what I can do is I can replace it the median. Right? I can go ahead and uh replace my

salary in my DF column and I'm going to say I want to I want it to be filled with the

median of my salary. So you can see over here my fourth uh row was missing in my salary

column and now had it has been filled with the median value. Okay. So this is another way that you can fill your

missing value. Okay. Now this is something that we can do mathematically, right? But what if I want to just

randomly fill my value? So this was the missing value, right? So there is another method in fill NA. Instead of

providing a default value, what we can do is something known as forward fill. So basically when we use fill NA, we

have two methods forward fill and backward fill. So in forward fill, what happens is we move from top to bottom.

Okay? and say this was our missing value. So when we use forward fill 35 is going to be replaced in our missing

value. Okay. And similarly when we do backward fill 29 is going to be replaced for our missing value. Okay. So this is

basically what forward fill and backward fill means. So let us quickly go and perform forward fill

first dot fill na. And in our method and in our method we just going to pass forward fill. Let's run

it. So you can see 35 has been uh replaced in our missing value column. Okay. So it has been uh done using

forward fill. Similarly I can perform a backward fill. So you can see 29 has been replaced in my missing column.

Okay. Now using these methods backward fill and forward fill can be a little tricky and they can also give you errors

in cases where your first or last value is null. Okay, so in case your first value is null and you try to do a

forward fill, your first value will still remain null. All right, and same goes for backward fill. If your last

value is null and you do a backward fill, your last value will still remain null because there won't be any value

before that to replace it with. All right. So in cases like that, we go for uh statistical operations like mean,

median, mode. All right. According to what's more technically sound. Okay. So these are some of the methods to uh deal

with your missing values. Okay. So now we saw how we can deal with missing values. Now we'll look at another

operation where we can uh say replace a particular value randomly. Okay. So say instead of Charlie over here, I want to

replace this name with another name. So now just because we did not use in place equal to true throughout our queries

when we were dealing with our missing data does not mean we have uh not already dealt with them. we have already

dealt with our missing data and currently we shouldn't have any sort of missing values in our data set. Okay. So

moving on I want to just randomly replace this particular name say with rows. Okay. So I want to I want uh

instead of Charlie I want rows in my name column at I just have to say DF dot replace. Okay DF in the name column.

Okay, I have to specify the name. Sorry, I have to specify the column and I have to say replace Charlie with rows. Okay,

so as you can see my operation has been performed successfully. Okay, and again if I wanted this change to be reflected

in my original data set, I would have to say df name and I would basically be storing the change in my original data

set. And now I can go ahead and run it. So as you can see I have changed the name from Charlie to Rose in my original

data set as well. Okay. So this is how you can make your changes permanent. Okay. All right. So let's look at the

next concept when we're uh cleaning our data that is dealing with duplicate data. Okay. So we already saw how we

deal with missing values. Now we will see how we deal with duplicate values. Okay. So as you can see in this

particular data set that we have over here we have the entry Alice twice. Okay. And as you can see the age

department salary and the promoted salary they match for both these records. Correct. So how do we deal with

duplicate values? So we already have an inbuilt duplicated function. So in order to check for all the duplicate values

all you have to do is say df.duplicated duplicated and pass it in the data frame variable and when you run it and your

output will be your second repeated record in your data set right so we are not going to consider this as duplicate

rather we're going to consider this as duplicate so let's see how duplicated works okay so say I have a record over

here of different countries okay so I have Italy France Greece and then Italy again okay so if I go for duplicated

method what uh what happens is we work our way from the top itself to the bottom. Okay. So, first I'm going to

check for Italy. So, Italy is not present. No problem. I'm going to simply add it to my list or uh a list where

basically I'm keeping a track whether an element is duplicate or not. Right? Then comes France. So, it's not there. No

problem. I add it to my list. Then we come across Greece. No problem. I add it to my list. Then I come across Italy

which is already present in my list. So therefore this element is going to be marked as duplicate. All right. So this

is basically how the duplicated method works by default. All right. So basically we can pass a parameter called

keep and we're going to say keep first. Okay. So let's see what happens now. So you can see our first element is not

going to be marked as duplicate. So this is basically by default the parameter that duplicated it takes. Okay. If I say

last instead of first. You can see my first element is marked as duplicate. Okay. So when we say keep last basically

we are going to work our way from bottom to up. Okay. So that's why this element is going to be marked as duplicate. All

right. So I hope this is clear. So if you want to uh drop the this duplicate value, all I have to do is okay. So I'm

not going to pass anything because I would prefer my second record to be deleted. So I'm just going to run it. So

if I print my data set, you can see my duplicated record that was the second record has been removed or deleted from

my data set. All right. Okay. So I hope the data cleaning process is clear. we have uh learned how to deal with missing

values as well as duplicate values. Okay. All right. So let's see how we can deal with invalid values. Okay. So since

I want to make the changes in my promoted salary column, I have simply written DF uh and promoted salary in the

brackets. And now in order to do these uh changes in the column of a particular data set, I will be introducing you guys

to a very important topic or an important function called lambda. So you guys must have studied this in Python as

well, right? So all we have to do is promoted salary and I will say apply lambda x and I'm going to say divide

that x. X is nothing but the records of my column promoted salary. So this is my input variable X and this is my output

variable X / 10. But I have a condition if X is greater than uh 6 lakh 50,000. Right? Else if it is not greater than 6

lakh 50,000 I'm simply going to return X. All right. So let me run it. Okay. So it has run successfully. I'm getting

some sort of documentation but I'm going to ignore it for now. Okay, so as you can see the two records

that I had that were greater than 6 lakh 50,000 have been divided by 10. All right, so this is how you can deal with

your invalid values using the lambda and apply function. All right, so this is the syntax for that. So moving on, we'll

also look at some string functions. Uh so in this particular sample data set that I have taken, we do not have any

sort of explicit uh need to do these string functions. Okay. So maximum times whenever you have a string input. Okay.

So say the names over here I just have the first name but in case I had a name like

Aliceore say Fernandez. Okay. So in order to deal with inputs like this all I would have to do was df and I would

create two uh separate columns saying first name and last name. Okay. And then all I would do was DF and

I would split my name column. So basically I'm right now I'm using name as this name instead of my column name

because I do not have any sort of input like this which I want to demonstrate to you guys. Okay. So even though this is

not a column per se, this is just a variable that I've created and I'm pretty sure I'm going to get an error.

Okay. So I am going to split it using string dotsplit method. Okay. And I want to split it based on underscore. Okay.

So wherever I have an underscore, I want to split my string. Okay. So in case instead of underscore, all I had was a

space, I could use an empty string over here. Okay. But since I have an underscore between my uh first string

and my second string, I'm going to split it based on my underscore. Okay. And let me see if I run it what's going to

happen. Okay. So I do not have any column name. But again this is how you would split your uh data set. So

obviously this code is not going to run for me because I do not have any column called name. And wherever I have name I

do not have any uh sort of connectors between two strings. These are just single names. All right. So I cannot I

will not be able to split these uh particular uh strings over here that I have. So this is just a demo for you

guys that in case you have a column like this, this is how you're supposed to deal with it using string.split method.

All right. So now that we are clear with data cleaning, I want to show you guys a few more examples of apply and uh

lambda. Okay, because there will be instances where these methods are going to be very useful because they let you

change the single entries of a column or a row, right? Based on particular conditions. All right, so let me create

another scenario where I want to multiply the age. Okay, I want to multiply all the ages by two. Okay. So

all I'm going to do is h and then I'm going to call my h function h column over here. And then I'm going to say I

want to apply. Now instead of passing lambda, what I can do is I can go ahead and create a function where I am

multiplying my age. Okay, I'm going to have my input as x and then I'm going to simply return x into 2. Okay. So instead

of calling lambda over here, what I can do is I can say multiplying h and pass x. Okay. So it will not take any input.

I'm sorry about that. So let me run it. Okay. So now if I print my uh data frame, you can see my values of age have

been multiplied by two. Okay. So let me bring back my ages to the original value. Now this time I will use lambda

instead of creating a function by default. Okay. And I'm going to say apply and I'm going to pass lambda and I

will say x and I will say x / 2. Okay. Now there are no conditions. There are no if else conditions. I want all of

these entries in the age column to be divided by two. So I'm not going to pass any if else like we did previously.

Right? So I'm going to run it. And if I print my data frame, my ages are back to the original value. Okay. So this is how

lambda and apply functions are used. And you will be using them quite a lot. Okay. So moving on, I want to tell you

guys about another very important concept that is joins and merges. Okay. Now in order to understand joins, let us

go through a quick overview of what are joins. Okay. So consider this scenario where I have two data sets A and B.

Okay. Now when you're dealing with large data sets, it's not necessarily that you're going to have all the columns in

one particular data set. Okay. So if you guys remember, I showed you guys an Excel sheet of uh various columns or

various features of a of a house, right? number of balconies, number of um bedrooms, the location, the state and so

on. Right? Now I had all those columns in one particular sheet. But what if I had two sheets A and B where in one

sheet I only had say the features of the house that is the number of bedrooms the number of bathrooms the balconies and in

another sheet I had the different features say when the house was built where it's located what's there nearby

and stuff like that okay so I had two separate sheets now I want to combine these sheets all right so there are

different methods so if we talk about joins joints there is left join, right join, outer join and inner join. Okay,

these are the joints that exist. And another method is to merge. Okay, merge is similar to inner join. All right, so

let's quickly see what left join would look like. So if we were to combine these two data sets on say a particular

property and we perform a left join our left side of the data set. Okay, whatever there is common between the two

data sets is going to be ignored. Whatever exists in only data set B is going to be ignored. Rather, we're going

to have something that is solely and solely available with data set A. And this happens in left join. Okay.

Similarly, in right join, it happens the opposite way. We're going to have only the properties of B in our data set.

Okay. And when we perform a full outer join, we're going to have everything of A and B. Okay. So each and everything

that we have in A and B is going to be included in our outer data set. Sorry, outer join. And when we perform an inner

joint, it basically happens over here. Okay, only the common properties between A and B. And when we merge two data

sets, we merge them based on this particular common column. and whatever properties A and B have both are visible

in our merge data set. Okay. So let me say this is another data set. So let me just convert it into a data frame. So

I'm going to call it DF2 and I will say PD dot data frame and I will pass my department. Okay. So let me just print

my other data frame. So as you can see this is my second data set. So now if I want to combine the information from

this data set and the data set that I had earlier, I have something known as concat. Okay. So concat is going to let

me merge the information from both these data sets. So if I use pd.con and pass the two data sets that I have, this is

the output that I'm going to get. Okay. So since in our second data set we have a location and manager, but we do not

know uh which employees have this location and this manager. we have a null over here. Okay. So by default both

my uh data sets are being combined in a horizontal manner. That is they are being stacked horizontally. Right? So

wherever I don't have the names I am getting null and wherever I don't have the location and manager information I'm

getting a null. Okay. But if I were to do the same operation column-wise that is if I were to combine these two data

sets in this particular direction I would have to say pd.concat concat pass my two data sets DF and DF2 and I would

say I want it to be joined on access one. So as you can see over here wherever I have department HR my

location and manager is for the department HR right so this is how you can combine or merge two data sets using

PD concat okay there is another function called PDM merge okay so you can pass your data sets dfdf2 and I want it to be

merged on my uh column department because that is the common column so as you can see over here my department is

being repeated twice. But if I do not want that to happen, I can use merge. And as you can see over here, my

department column appears only once. And my data is managed accordingly. So wherever I have HR department, I'm going

to have my location as New York and manager is going to be Laura. Okay, it is you can see it over here as well. New

York and Laura. All right. So this is also how you can merge or combine your uh two data sets into one. Okay, so

these were basically all the operations that were there to know on data sets. The basic operations obviously as you go

down the line you deal with bigger data sets your operations get more complex but this is the standard basic base that

you require to get started with pandas. All right. So now we're going to look at how we can import uh a complete data set

file that is an excel or a CSV file in our notebook. Uh so since I'm using a Google collab notebook I cannot uh

directly pass the entire path of my file or my CSV or Excel sheet. If you're using Anaconda or VS code you can do

that. So if you're using those platforms, all you have to do is right click on your uh data set and click on

copy as path and create a variable and just say pd dot read csv because this is a CSV file and

all you have to do is pass the path inside the parenthesis as an argument. All right. But since this is a virtual

environment and it is not going to be able to access my local environment, I what I have to do is I have to upload my

data on Google Collab. Okay. So I'm simply going to select it. So as you can see it's uploaded over here. Okay. So

now all I have to do is in quotes I have to say data dot CSV. So you can see it's run

successfully and now I can do data dot head. that is I'm trying to get the first five rows of my data set. So you

can see this is the date, price, bedrooms, bathrooms. So I can easily access all the columns and I can go

ahead and perform the basic operations like I can check the shape of my data set using data.shape. So I have 4,600

rows and 18 columns. All right. So accessing such information in an excel sheet would have been a much more

difficult task. So this is why we used pandas which helps us gather information like this in a much more efficient and

easy manner. All right. So I can also go ahead and check the information of my data using

data.info. So you can see I have 4,600 non-null values. That means I have zero null values in my entire data set. I

have zero null values. Okay. So you can also check the data type of each and every column. Now you can see over here

this is an object data type but date is supposed to be in a time format. Right? You're supposed to be able to access the

day, time uh and year of a particular date in an efficient manner. But this is stored in an object manner. All right.

So I'm just going to quickly show you guys how you can convert this date time from an object to a time. Okay. So this

is the uh syntax that you have to follow. So I'm going to pass my date and then I'm going to uh I'm going to use pd

dot date time to convert my date from object to okay. So it is

data. Okay. So as you can see it has been executed successfully. Now if I go ahead and check my information

uh of of the data you can see it is converted to datetime format. All right. So this is how you can convert your uh

datetime features into the valid format using pandas. All right. So this is again only a oneline code but you should

be aware that it can be done and this is how you bring your data to the correct format. All right. So this was it for

this video. Um I think we have covered almost everything there is to know about pandas and I'm sure you guys got a basic

understanding of how you guys can implement pandas on even a large data set. And I am going to give you guys

this uh notebook link in the description. So you can go ahead and download it and practice on this data

set or you can download data sets from Kaggel. Let me show you guys the website. So you can go ahead and type

Kegel and click on the first link that you get. Register on this website and you get hundreds and thousands of data

sets for absolutely free. You can download them and you can also go ahead and check out the discussions uh if you

face any issues or doubt. Right? So get started with your pandas journey right away. If you found this video helpful,

make sure you hit the like button and thank you and see you in the next video. [Music]

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation

Learn the fundamentals of Python's Pandas library for data manipulation and analysis. This tutorial covers data inspection, selection, cleaning, transformation, reshaping, and merging with practical examples to help beginners master Pandas.

Python Pandas Basics: A Comprehensive Guide for Data Analysis

Learn the essentials of using Pandas for data analysis in Python, including DataFrames, operations, and CSV handling.

A Comprehensive Guide to Pandas DataFrames in Python

Explore pandas DataFrames: basics, importing data, indexing, and more!

Mastering Pandas DataFrames: A Comprehensive Guide

Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.

Understanding Pandas Series and Data Structures in Python

In this video, Gaurav explains how to work with Pandas Series in Python, including how to create, manipulate, and analyze data structures. He covers the basics of importing Pandas, creating Series from lists and dictionaries, and modifying index values.