Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation
Introduction to Pandas
Pandas is a powerful Python library used for data manipulation and analysis. It primarily works with two data structures:
- DataFrame: A two-dimensional labeled data structure similar to a spreadsheet or SQL table.
- Series: A one-dimensional labeled array, similar to a list or a single column in a table.
Installing and Importing Pandas
- Pandas is not installed by default in Python but is pre-installed in Google Colab.
- To install on your machine:
pip install pandas
- Import using:
import pandas as pd
Data Inspection
- Reading Data: Use
pd.read_csv()
for CSV files orpd.read_excel()
for Excel files. - Preview Data:
df.head()
shows the first 5 rows;df.head(10)
shows first 10 rows. - Tail Data:
df.tail()
shows last 5 rows. - Data Info:
df.info()
displays data types, non-null counts, and memory usage. - Summary Statistics:
df.describe()
provides statistics for numeric columns; usedf.describe(include='all')
to include non-numeric data. - Data Types:
df.dtypes
lists data types of each column. - Index and Columns:
df.index
shows index range;df.columns
lists column names. - Shape:
df.shape
returns the number of rows and columns. - Null Values:
df.isnull().sum()
counts null values per column. - Unique Values:
df['column'].unique()
lists unique values;df['column'].nunique()
counts unique values. - Value Counts:
df['column'].value_counts()
counts occurrences of each unique value. - Random Sample:
df.sample(n=8)
returns 8 random rows.
Data Selection and Indexing
- Selecting Columns: Use
df['column']
for single ordf[['col1', 'col2']]
for multiple columns. - Implicit Indexing (
iloc
): Zero-based positional indexing; supports negative indexing. - Explicit Indexing (
loc
): Uses the DataFrame’s index labels; can be customized. - Changing Explicit Index: You can reset or set a custom index using
df.index = range(1, len(df)+1)
ordf.set_index('column')
. - Row Selection: Use
df.iloc[1:10]
for rows by position ordf.loc[1:10]
for rows by index label. - Conditional Selection: Filter rows using boolean masks, e.g.,
df[df['vot_average'] > 6]
.
Data Cleaning
- Identifying Nulls:
df.isnull()
returns a boolean DataFrame. - Filling Nulls: Use
df['column'].fillna(value)
with mean, median, or mode. - Dropping Nulls:
df.dropna()
removes rows with nulls;df.dropna(axis=1)
removes columns. - Dropping Columns/Rows: Use
df.drop('column', axis=1)
ordf.drop(index)
. - Handling Duplicates: Use
df.duplicated()
to find duplicates anddf.drop_duplicates(keep='first'/'last'/False)
to remove them.
Data Transformation
- Changing Data Types:
df['column'] = df['column'].astype(float)
. - Renaming Columns:
df.rename(columns={'old_name': 'new_name'}, inplace=True)
. - Adding Columns: Create new columns by calculations, e.g.,
df['profit'] = df['revenue'] - df['budget']
. - Modifying Columns: Round values using
df['column'] = df['column'].round(1)
. - Adding Rows: Convert a dictionary to DataFrame and concatenate with
pd.concat([df, new_row_df], ignore_index=True)
. - Modifying Rows: Access rows by index and assign new values.
- Setting and Resetting Index: Use
df.set_index('column')
anddf.reset_index()
. - Grouping and Aggregation: Use
df.groupby('column').agg({'col1':'mean', 'col2':'sum'})
. - Applying Functions: Define a function and apply it row-wise with
df.apply(func, axis=1)
.
Data Reshaping
- Merging DataFrames: Use
pd.merge(left, right, on='key', how='inner/left/right/outer')
for SQL-like joins. - Wide vs Long Format: Wide format has separate columns for each variable; long format compacts variables into fewer columns.
- Pivot Table: Use
pd.pivot_table(df, index='product', columns='month', values='sales', aggfunc='sum')
to convert long to wide format. - Melt: Use
pd.melt(df, id_vars=['product'], var_name='month', value_name='sales')
to convert wide to long format. - Stack and Unstack: Stack compresses columns into a hierarchical index; unstack reverses this operation.
- Multi-level Indexing: Set multiple columns as index to create hierarchical indexing for complex data analysis.
Conclusion
This tutorial provides a comprehensive overview of Pandas for beginners, covering essential operations from data inspection to advanced reshaping techniques. Mastering these concepts enables efficient data analysis and manipulation in Python.
For further reading, check out Python Pandas Basics: A Comprehensive Guide for Data Analysis, A Comprehensive Guide to Pandas DataFrames in Python, and Mastering Pandas DataFrames: A Comprehensive Guide. These resources will deepen your understanding of data manipulation with Pandas.
hello folks welcome to all next in this video we would be exploring Panda's library in Python so if you are someone
who is looking to learn pandas from scratch or struggling with the basic concepts in pandas library then this
video is for you in this video we will cover data inspection using pandas then data selection and indexing using pandas
how is data cleaned in pandas we will also have a look at how data is transformed and it is reshaped using
pandas Library so let's jump in we would explore the functions and the attributes of pandas that are required for data
inspection along with the introduction to pandas so let's start what is pandas pandas is nothing but a powerful python
library that is used for data manipulation and Analysis it contains um two main terminologies that is data
frame and a series data frame could be thought of as a spreadsheet or as a or similar to SQL table which contains um
two dimensional label data structure okay um along with some columns it may be of different types then there is
something called as a series which can be thought of as similar to a list or column in a particular table so let's
jump in into the collab notebook so pandas doesn't come up installed by default along with your basic python
okay however in collab it's already installed but if you are uh doing something on your machine
or U your machine won't have pandas install so you will have to install pandas and then import it after to
import we have this command import pandas as PD PD is nothing but an area's name so to demonstrate attributes and
functions of pandas which are required for data inspection I have taken this data set called as movies all right so
how do we read the data we use pd. read CSV if it is it's Excel file it will be different function it will be read excl
or something so it depends what kind of data are you looking for then I have a function called as DF do head let's see
what it returns me okay if you could see here it returns first five rows of the table so that you can have a look at it
um like in this particular case we have movies so there is a movie ID that is mentioned and there's a budget
popularity Revenue the title of a movie vot average it just may correspond to the rating vot count then director ID
year month and day so you can have a glimpse um of what what data is there in the data set with this particular uh um
DF do head using DF do head function okay so if I want say 10 rows then I will just mention DF do
head whatever number of rows I want I can mention it here so you would see this time around you will get 10 rows
okay first 10 rows by default it returns you first five rows however if you want to get
more or any other more more number of rows or maybe less number of rows you can mention that number inside the
parenthesis okay then similar you can also get last VI rows that is if I do uh ef ef 10 this will give you last P by
default okay you would see here these are last P then
um there is something called as DF do info right you would see what this will give
you is the information about the um data that is there with you that is it will give you what kind of data type it it is
that is it's a data frame because whatever table like structure in Paras we call it as data frame and for a
particular column we call it as Series right so it gives you the information of each and every column these are the
column names then um it gives the non-null count so um there are total of 1465 values and there is no null value
inside it and there are three types of um data types in pandas that is in 64 or integer float and object okay and no
other so if if it is the data set type should be either integer or float if it is not that then it will be considered
as object all right and moving ahead um let's see there is one more function DF do
describe what DF describe gives us is the statistics for numerical column so you would see here for all the numerical
columns it has given me the statistics like count main standard deviation minimum maximum then the intercal ranges
and so on so um if I want you would notice here though that it's only doing all this stuff for numeric columns
however if I want to include the object data type as well I will have to do a include all so DF do
describe okay inside the parenthesis I will mention include is equal to [Music]
all I run this you would see there are few more rows added over here like unique top because
those are of object data type right you cannot calculate um you cannot calculate mean for
that see here look at this for title you cannot calculate mean right for month you
cannot calculate mean so for all those non-numeric columns you can cannot calculate mean that's the reason you get
the output of you you get this kind of out that is look at this mean standard deviation
minimum all these ranges it's not possible to calculate for what for object data type so I hope you
understand what I mean by uh object data type it's a data type that doesn't have integer and Float okay so moving on to
our next um next attribute that we will be checking is we remember describe when
I'm using this parenthesis these are all the functions all right and I would also be showing you some attributes now like
if I'm if I type in DF do D types here it's an attribute it will show all
the data types uh like data types of all the columns present in the data okay so columns or in terms of fers all the data
of the series that are there in the data frame okay then DF do index let's see what DF do index gives
you so it will provide you with the range of where the index starts and where does it stop okay
after that if I do DF do columns it should return all the column names then there is something called as
DF do shape it will give me the shape of the data set that is 1465 rows and 11
columns then moving ahead this is what are we doing by checking all this this is something called as data inspection
so we are inspecting the data we are checking all the um uh I mean all the parameters of data with respect to
inspection so next one is I would like to check if there is any null value and also doing a sum of that null value what
will this return return me oh is function
yeah you would see there is no n value otherwise it would have written me column wise some okay however there is
no n value here so I don't get any output for this so that is fine now moving
ahead let's say I want unique uh unique values in a particular column um so I would take this in our data I would just
take DF again so I want unique values I will check it
on ear column okay because I want unique values now in the ear column how to get the unique values there is a function
called as unique if then that column name that column name ear
ear function function is unique look at this these are all the ears that are present in the data okay these are
all the I mean they are repeated but I have got the unique values so in our data these are all the ears that are
present then mov forward if I want the count of unique values what is the count of these values then there is one simple
change here instead of unique you just mentioned n unique it will return return you the count there are total of 41
unique values in the ear column then if you want to do a value count then you can do that as
well DF do ear the same thing and I will do a value count value underscore counts
okay oh my bad it's a function at this so it will straight away give you which year appear how many times all
right so it Returns the count of unique values in a particular
[Music] column at this so 2006 appeared maximum types in our data then 2005 then 2002
accordingly let's move ahead with the next one if you want some random rows from the data frame then you can mention
DF do sample random rows how many rows do you want say you want some eight random
rules this will give you any eight random rules you would see you would identify that with the index of the data
indexes are very much random so they have picked any random row so that's about data inspection part for pandas
now let's move ahead with next topic we will discuss how can we select columns from a particular data set or maybe a s
it may be a single column it may be multiple columns then we will have a look at what is implicit index what is
explicit index and using that how we can select specific rows from a particular data set and then we will also see how
we can um select data based on conditions so let's jump in you would see here the first
statement is input pandas and input number so this is quite familiar to you now it's been like we are four to five
videos um old in exploring python libraries so this is something that is familiar to you then I
have imported the data set movies with the read uncore CSV function uh or you may use some other function if
it is a XLS file it will be read uncore Excel function then I would just uh like to have a look at the existing data
that's the reason I using DF doad oh my bad here I have to run this sales.ad now you would see this data
consist of movie ID budget and popularity Revenue title vot average vot count and all these things uh like
director ID year month and day okay so we had a glimpse of what is there in data suppose I want to have a look at
only one column or I want to select only one column from this particular data what would I do I would simply enter the
column name inside um inside the existing data and I would
select it with this syntax I would simply write DF then Square Braes inside that I will write the column name and if
I run this you would see only one column will be displayed okay this is this is the output that you get if I want to
select multiple columns same thing I will do but this time I will have to specify whichever columns I
need I would have to specify their name all right so I needed title and vot average so I mention their name in a
list the syntax is the same instead of single column name this time what we have done we have entered a list of
number of columns like we want to select two you specify their names and the data will be displayed all right now there is
something called as implicit and explicit in um uh in Bas all right so let's see
what is implicit and what is explicit now implicit index is represented by ioc and explicit index is represented by Lo
explicit index is an is a index which you can see here when uh you get the output um if you have noticed this
number here this is not a part of table all right this is something called as explicit index because as I mentioned
you can see this right it starts from zero and it goes all the way till 1464 okay you would see here as well
so whenever you try to see a particular data set or data in a particular data set then you would see there are there
is one additional um column you may say it doesn't have name here and that is nothing but an index okay this is not a
part of our existing data okay this is not a serial number or I have not forgotten any column name over here it
comes it gets generated and we can can see that and that is nothing but called as an index but it is a explicit index
because I I just mentioned that explicit indexes are those indexes which you can see over here and what is implicit
indexes which you cannot see but they are there they are getting generated in the backround you will see what it is um
when we have a look at the example all right so if I do DF of one let's see what is
the output you get an output of the uh you get you get output as
a second row because you mention DF of one one is at where here the one means the index right so the data
corresponding to the index one is displayed now this is iloc and when I do DF of Lo
DF do looc one which is explicit this again shows the same data all right then you would wonder hey um these both are
showing us same data then then what's the difference yes of course there is a difference that's the reason we are
studying this all right um suppose I want index to start from range one and not from zero by default it started from
zero explicit index starts with zero and um even implicit index starts with zero that's why we got the similar outputs
however in we can change explicit index range and we can start it from wherever we want it to now suppose I want the
index to be started from one so I have used this range function and um using it inside list function I
have modified the index I want this start uh to start it from one so this is what I did U I gave one and range till
where DF DF do shape of 0 + 1 you would say what is this DF do shape 0 + 1 so it is nothing let's explore DF do shape
first so DF do shape gives you what shape is 1465 and 11 and what I have done shape of zero shape of zero in the
sense this number 1465 so if I do oh hold on this yeah DF of shape of zero if I do
over here okay you would get 1465 all right and plus one plus one is
1466 okay so my range should go to from 1 to 1466 only then I I would able I would be able to get 1465 rows you
remember the range function from the for Loop that we discussed we had discussed range function when we studed for Loop
all right so that's how it works the range will be from 1 to 1466 wherein 1466 will get
ignored and you will get all the 1465 rows all right so to proceed ahead now if you look at this the index started
from one because we wanted the index to start from one and not from zero so that's that's an advantage of expli
index it it will it will take the range from the one you want it to be okay so
now if I do DF Lo of one look at this it will give me first row not the second one as you as we used to get earlier
because the index has changed and at the first row this is these are the details and those details will be shown to me
however if I'm still using iloc all right let's see what I'm getting this is
implicit index it still returns me to because we cannot change implicit is this index we cannot change all
right it it behaves the um way you can related with the python indexing similar way whatever we have
studied earlier while we did string slicing why weed list liing so at that time we had studied positive indexing
and negative indexing it is that index all right so which whatever is the default behavior um by default which
index operate that is that that is the same index referred by iloc over here all right you would see it did not
change now if I do it dfoc of minus one okay now you would quickly very quickly answer V it will give us the last row of
data because you are already aware what is negative indexing okay after that DF do L of minus
one let's see what this gives me as an output and yeah this gives me an error key error minus one why do you think
would this have happened if you if you look at this carefully DF do Lo of minus one right
now in the Lo or in explicit indexes what I have is a range from 1 to 1465 all right and these are explicit indexes
so whatever range I I had already mentioned whatever range I want I would be specifying there right so I have a
range from 1 to 1465 and I'm referencing I'm trying to give minus one in Lo so there is no minus one
in um in this explicit index range it is from 1 to 1465 had I mentioned minus one then this would not give me an error
look at this it is giving me a key error it's saying key error minus one um we we don't have this key u we don't
know what this key you are talking about so negative indexing and positive indexing will apply only in terms of um
ioc I mean both will apply only in terms of ioc explicit index that is looc it will not apply because that is something
that we have given the range whatever um item or whatever row you are trying to reference should be from that same range
that's the reason we get an error here when we give something that is not there in one two 1465 all right now if you if
you want to see a particular index then you can see that index as well all right now let's have a look at
row selection using implicit and explicit indexing here you would see DF do Lo one colon all right
so ioc1 means which row um of using explicit index it will be second row second row why second row because uh
I'm I'm talking in terms of explicit index why I'm trying to explain this is because see the data this indexes are
explicit indexes what I'm saying iloc of one is equivalent to explicit index two whatever is there in explicit index 2
iloc of zero will be equivalent to explicit index one because we have changed the order of explicit index and
it now starts from one right and that's why it started here from Two and it went all the way to 10 okay now if I would
have written zero over here look at this now it will return return me the first row look at this it return me from the
first to the 10th all right so this is what we should be able to um actually notice and if you see 0 to 10 is um 11
right if if we count but it follows the range concept here as well and it returns you how many rows it returns you
rows from 1 to 10 that is number of rows written here at 10 not 11 because the 10th one is excluded this is very much
the same um when we did slicing um uh in strings or list all right so I I always keep on referring to the to our older
Concepts because you are you will be able to correlate very easily all right then df.loc of 1 to 10 now if the same
thing I want to do it for LC from 1 to 10 then what's the result um look at this it starts with one now we are we
are telling the interpretor that okay use Lo right so lo I said will follow whatever range we have specified we have
specified 1 to 10 in this case it will show all 10 rows now it doesn't uh ignore this
10 and it doesn't show only nine rows in this case in case of Lo it shows all the rows whatever you mention over here so
these are few key things that you should remember um while you explore explicit index and implicit index the way ioc and
looc works is totally different in looc you can change um the range of the index that is it did not necessarily start
from zero you can mention whatever range you want and the way it works is this whatever you specify in the range it
will take all the rows it will not ignore the um upper limit upper limit range and in iloc it
will always give one one I mean always retri one row less than what you have specified because it is similar to the
um to our range function all right now if the same thing if you want some specific rows you have to mention those
row numbers or not the row numbers you have to mention the row indexes and this is what we have done here DF is equal to
ioc 11 12 and 18 you would see one that is um the output that will be replied by this um first element in the list is
this row that is the second explicit index row then the 13 explicit index row because you have each 12 it will always
be one more in terms of explicit index guys again I'm saying don't get confused explicit index you can see that's the
reason you see this 213 and 19 implicit index you cannot see here on the screen but we can um definitely um correlate
how these things are working it doesn't show you the exact first row details because you haven't put zero over here
implicit index will always start from zero it will have negative indexing as well where in explicit index doesn't
follow any thing kind of negative indexing as such by default if and only if you mention some negative range in um
explicit index only then it will follow all right then um DF do Lo um 1A 12 18 then exact same rules will be red see 1
12 and 18 all right now let's move on to the next section that is condition selection or and filtering all right so
based on condition also we can select particular number of rows or specific set of data here I wanted to have a look
at movies which have vote average or you can say which have rating more than six all right so what I have used here I
have used the comparison operator great than and try to do the same however if you notice we get output as a Boolean
array the output is true and false it doesn't give me the actual data now if I want actual data what I will do I will
apply this Boolean array on uh as a mask on my original data so my original data is DF and I simply put this Boolean
array on the original data and this will actually retry the values so wherever it is True Values
those values or those rows will be red and wherever it is false it will not show up here that's why you see there
are 977 such rows which have wrting greater than six so if someone ask me that which are the movies that are
greater than that have greater rating than six then I would simply use this but no kindly note here
that the output that you get by comparing the comparing it with whatever number you decide whatever rating you
decide was Boolean it gave you true and false output now we are not interested to see what is
what all data is true and false we just want to filter out or we want to select whichever movies have rating greater
than six and that's the reason we applied that mask on our data set this is also called as a fancy indexing
sometimes also called as U masking so these are various different terms that are used for the very same purpose now
if you want to do a mean of particular thing I mean let's take the same
rating column or you can say vot average column um and I'm saying it as rating because vot average here shows the
rating of a particular movie so if I want to take an average what I will do I will first select that vot average
column I will take it out and store it in a different variable and then I would take a take a mean of that so here I
just I've done the same look at this I get the vot average I just selected the column you know how to select columns if
I want to take a meme I can always take a mean of that now let's move ahead with next topic we would see how to um
identify the null values in a particular data set um how to fill them up and how to drop a particular column with null
values or a row with null values and also how to explicitly drop a particular row or a column from a data set and we
will also identify the duplicate rows that are there in the data sets and drop the duplicate um
rows using different variations now let's jump in you would see first line of code is I have imported pandas and
with and specified the alas name this is something that you're already aware from the previous videos then moving ahead I
have data which is there in this file XLS X and I have loaded the data in the data variable now U let's have a look at
what is there in data so we have movie ID title Jer then release year rating director and
run typ all right now I need to identify which all n values are there so I have a function
called as data do is null so if I use this function it would return true wherever there is a
null value it will actually return the table containing true and false values and wherever there is a n or empty value
you would get true at this point of time our data is short so we can easily have a look at
this you have got true in this release year and the column rating all right however if data is larger you may not be
able to see it see this very easily that's the reason um we would we would do a
sum uh of True Values in a particular column for that we have just a some function to be attached
to data do e null all right so this is the way we write I have a variable n values is equal to data do e null do sum
that would give me column wise count of how many null values are there in particular column from this output I can
clearly understand that release year has got one n value and rating has got one n value and I need to do some treatment
with that um please understand we cannot just leave those null values or we cannot just replace it
by anything we want we need to think about how the data is um what number of null values are there
um for our data is it some substantial number of n values or it's a ignorable one and accordingly we will replace
those null values now here you would see I have replaced it with median and mean in the release year I have replaced um
the null value by median and in the column rating I have replaced the null value by mean now first let's see what
does um the output of data release Here median gives it to me okay so I'll just um take this part of the code
see what is the output that I get on doing a medion of ReliOn so I get 2,11 so have a look at data reading
mean a do 1 0 0 all right let's have a look at data one more time so that you can easily
compare after we replace the N values now here you would see rating has got one null value and uh release year has
got one n value as we have already seen all right now look at this the release year not median turns out to be 2011 and
rating turns out to be 8.1 so these values should be replaced according uh accordingly like for release year
instead of this nine we should get 2011 and rating in in the rating column instead of this n we should get the
value 8.10 that is the mean and if you want to replace this we have a function called as fillna okay that's that's what
we have used and we are trying to replace it so we just need to mention a column name and um by what you want to
replace do you want to replace with medium do you want to replace with mean or even then even it is more to it we
can also do a replacement with something called as mode that is um the Val that is the particular n value or empty value
will be replaced by frequently used value in that particular column okay so we are not using that one here but we
are using median and mean all right so let's see what what's the output that we get
here you would notice that wherever there was null value it got filled with 2011 as expected
and um in in the real here and in the rating um we got it we got the N value replaced by 8.1 so it depends on which
one which option is suitable for you is it median is it mean according to the data that you are
working okay you cannot simply ignore and go on deleting things you will have to replace at times using mean median
and mode all right and if there is a a text column you would say hey it that doesn't work uh the mean and median
doesn't work so you can simply write some text like not available not applicable na whatever you wish to I
understand because those will be um like strings you cannot take any mean or median of that all right um moving
ahead if you want to drop particular rows which have um n values then you can simply
drop it so in this case I have I had null values on Fifth fifth row and sixth row let's see if they are getting
dropped yep see fifth and sixth row are dropped dropping is a good option when uh the number of rows with null values
are very less compared to the um data the number of rows in the entire data say you have 10,000 or one lakh rows and
only um three four rows have some null values you might uh prefer dropping those then we can also drop columns
which have null values so we have the same function here we should specify the axis so axis is equal to one states that
you are looking for dropping columns when you don't specify anything by default it takes us rows all
right so here if you would see those columns have been dropped which had n values
okay so there may be situations where you don't need those columns then you can use this option else you would
prefer doing a fillna filling those values right then um irrespective of null value will you be able to drop a
particular column yes that is very much possible if you want to drop a single column just mention the column name
mention the axis AIS is equal to one and you can very well drop it I dropped runtime from data look at this I don't
have any runtime column why we are looking at this is because when we receive data U you may think particular
column is not required then in that case you will just simply drop that column and um prefer Only The Columns that are
required so for doing that you will use this you may also want to drop some rows you can
use this indices to drop rows and mention X is equal to zero and you would not see um row six and seven which are
mentioned in the um drop statement over here in your data so if any if you want to delete any specific rows that is very
possible and now dropping duplicate rules first to drop duplicate rules we need to identify if at all are there any
duplicate rules in the data that we are looking in all right so let me check if we have any duplicate rules
first look at this you have a duplicate row at index 8 in data all right so it duplicates uh let's have a look at data
again uh and see which row it duplicates to um
just type in data over here yeah see seven and
8 last two rows these two rows are duplicate they have duplicate data so what we will do we will try to remove
duplicates now if we want to remove both of the rows then we will have to keep this parameter keep is equal to false
okay in this case both of the rows seven and 8 will be deleted that is seven and 8 is
the in number for that particular row you would see only six rows however if you want to keep the last one that is
the eighth one then you just mention lost over here then the eight one will be kept and
the other one will be deleted or in case there would have been three four rows the other rows would have been deleted
and only the Ed one would be um retain if you keep first over here first
in the skip parameter of the drop duplicates function then you would get the first row that will be kept and the
others will be deled so this is how you perform data cleaning in pandas now let's move ahead with with next topic we
will have a look at um how to actually transform the data that is how to change the data type of a particular column
then how to rename a particular column then how to add new columns modify a column add a row modify rows then how to
set index and again reset index and how to group and aggregate the data in a data set then also we will have a look
how to apply functions to a particular data frame so let's jump in our first statement as usual import pandas SPD um
PD being the alas name then I have data set um called as transform movies well I forgot the S here but it doesn't matter
um so when I load this database and I try to see what is there in database you would see
some 10 rows there I I kept it as simple as possible I'm not taking a lot of rows here um so you would see here the data
consist of movie movie ID title then release year Jor IMDb rating budget and revenue all right so let's check what
are the data types of all these columns for that we have DFD types which we are already aware and we would see there are
three data types that come in by default that is integer float and if it is not both of these it is an object however we
have some specialized data types as well in which we can convert these are something that come by default as we
already know from our various videos uh presumably say I want to change um data type of budget and revenue to float
instead of integer then I have this function as type and I can check change it to float so if I run this and again
try to see DF do data types then you would see here the data types are changed from integer to float all right
then let's see how to rename column in a data set okay so if I run this um before running this I'll just explain the
syntax um we have rename function inside which we have this parameter called as columns wherein you can mention the
column name and the name that you want to replace it with and it should be um in a form of a dictionary if you're are
not um well vered with how dictionaries work we have a separate video in our um python for beginner series you may refer
to it okay um and I have also used in place so these two columns I want to replace is the column is release year I
want to replace it by release underscore year then there is next column IMDb underscore sorry IMDb rating I will
replace it by IMDb underscore rating and interest is equal to True indicates that these changes will be saved permanently
you would see here the column names earlier where release here and IMDb space rating so instead of space uh by
using this this this uh this command what should happen is space should be replaced by the underscore so let's see
if we are able to do it successfully yeah look at this look at the output we have successfully renamed
the colum now moving on to the next section um how to add a new column there might be a need where you want to do
some calculation based on existing columns and you want to add a new column to your data set now we have revenue and
we have budget and we want to calculate profit now this calculation happens um on each row right so that's the reason
we want to add new column and for each row what is the profit we want to show it and formula is pretty simple it's
Revenue minus budget however look at this by using this DF and in the square based profit we are creating new column
so let's see if that gets created or not yes we are able to successfully create the column called as new column
called as profit now suppose you want to modify existing column that is if you want to
um explicitly um tell the database that the rating should be um rounded to one okay so if I want to do that then I can
simply write the code AS IMDb rating do round of one so round of one in the sense after decimal uh sorry yeah after
decimal point there will be only one uh one digit if you want to round it to two digits or you want to you don't want
rating um then simply use round function that's it I want to round it to one that's the reason I have uh written IMD
_ wrting DF round one explicitly I'm telling that this column will only have uh one digit after the decimal point
something called as Precision all right so yeah you would see I'm rating being displayed as 9.0 7.8 and 8.7 so all
are displayed in uh all are displayed in the form of one digit after decimal now let's see how to add a new row now here
you are adding it in the form of dictionary again so we need to clear how we add um how we create a dictionary
actually then we have to convert that new row into a data frame first if you directly go on to add using this concat
method it is not going to happen first you need to convert it to the format of the data frame that's the reason I have
used the data frame function and then you add it to new row and then you write ignore index is equal to True okay the
ignore index because whenever you add new row it starts from new index that is it starts again from zero so if I if you
see here new row would be added easily okay at the 10th index all right if I don't write ignore index here it will
simply start the index from zero if you would notice look at this look at this we don't want this to
start from zero we want this to start from 11 that's the reason we write it as ignore index now moving on to the next
section um you you would see here earlier when we used before moving to the next
session you would see here earlier we had used ignore index um the index got adjusted to 10 however when we removed
it the index again started with zero which is something we don't want all right we want to maintain the indexes
that's the reason we ignored it then if you want to modify a particular row even you can do that look at this um I have
just mentioned the location these are the explicit indexes so I used the explicit index 6 and on the explicit
index whichever value is there I'm just trying to modify so instead of 2014 over here uh in our data I am trying to take
what let let me show you the data first there in our DF this is what is what we have in DF
right and I want to change it change the year 20 uh instead of 2014 I want to mention
the year 20 15 so what I would do I would simply come here and mention the index is five and I will try to change
this so look at this the year has changed so if you want to change a single value also um in a particular row
you can definitely do it in this way accessing using the explicit index or maybe implicit index Okay then if you
want at any point of time to set index um with a column name here in this case I'm setting the movie ID as a index all
right so in that also I can do it very easily with using the set index function set index movie ID so you would see here
the explicit index has vanished and movie ID has become an index and if you want to come back to where what we we
earlier just do a DF do reset index that will again uh bring explicit index into the play okay so that's how you set um
indexes and reset index for a particular data set then moving ahead how to group and aggregate the data okay so if you at
any point of time want to group data here in this case I group The databas JRE to perform aggregations like sum
mean average or in this case I I'm performing only sum and mean if you want to do average whatever aggregation you
want to perform you can very well do it using DF do group by function and inside the group
by uh along with Group by you can use AG function wherein you can mention on what basis you want to make an agre is it a
mean is it a sum or is it a average I have used these three columns on which I have done aggregation that is MDB rating
I have taken mean that is I'm trying to um take average IMDb rating and I'm trying to look for total budget and
total revenue that was generated per J this was my goal and that's the reason I use this one so let me check if I'm able
to successfully get respective outputs yes I did look at this action for the J action this was the budget and this was
the revenue and this was the average rating then for drama this was the one and for scii this was the one so
accordingly whatever I try to group the data would be generated and so is it always necessary that you do aggregation
for group by whenever you use Group by statements no absolutely not right um You may
want in some cases that you don't want to perform aggregation you just want to apply Group by so that also you can do
very much uh possible and here I group It Again by Jor this time I don't do any aggregations but if I want to see a
particular group I need to use getor group function and I would be easily able to see what is there in that
particular group so if I run this cell you would see here uh I had used get group and all the uh information uh
or all the rows that are aligned with the Sci-Fi group are displayed over here because I did a get group sci-fi group
okay so this is very much possible then what is next is um hold on let me just check yeah DF do
reset index perform this DF do reset index again and then what I do
is apply function to data frames okay so how do you apply a particular function you know how to write function we have U
studied this in our python for beginner Series so this is a function with written with name profit underscore
margin def is the keyword that defin so this is a function definition basically and it takes one par parameter the
parameter is nothing but named as row all right what does it do returns row profit all right so whatever there is in
profit um column divided by Revenue into 100 that is nothing but the formula to calculate the profit margin right and
you want to calculate this for each and every row so you would not be performing uh or writing this formula over and over
again instead you can very well use the apply function to the uh specific dat frame and get the expected result now
this profit uh margin will be calculated by applying the profit margin function
which you would see here in the apply function the first parameter is the function name and the next parameter is
AIS AIS is equal to 1 that is on every row um in that column um the profit margin will be calculated so if I try to
look at the profit margin uh per movie that's the reason I took title over here and I easily able to see what is the
profit margin that was achieved per movie as per our data set so that's all about data
transformation now let's move ahead with next topic we will have a look at how to reshape a data that is how to
merge two different data sets then um we will have a look at what is wide format data what is long format data and then
we will also see what is Pivot and melt functions and
then we will also have a look at how to stack and unstack the data so let's jump in our first line of code import pandas
SPD as we already have done this before uh on several occasions then um I would like to show it to you that we have some
sample data sets like movies ratings and NBA do c CSV sales data these are the data sets we will be working on um today
and we would see what are the various different outputs that we get when we perform various different joints when we
try to merge the data now first I have um loaded or you can say I I will read the data set movies and ratings so if
you have a look at the um movies data set and if you have a look at ratings data set you would see that there are
two two things uh there are there is one thing in common that is movie ID on the basis of movie ID we can definitely
perform a join now what is join what is inner join what is Right join so these come under the SQL Concepts that you
should be aware of to understand this video all right so I will in the first case join on the
basis of inner join to join two tables here in pandas we have a function called as merge you pass the name of the two
tables that is the left table and the right table and you also have to mention on what column you are going to make a
join that is moviecore ID in this case and what kind of join that is inner join is something that you are going to
perform here so inner joint as we know it displays the intersection part or the common part between the data sets all
right so here you would see the output excluding movie id5 because movie id5 is there in the movies data set look at
this but it isn't there in the ratings data set so you would not see that uh movie id5 however if you perform a left
joint so whatever is there I mean all rows in the left table should be displayed here in this case the left
table is a movie table in movie table we have this phos so accordingly the phos will be displayed and the intersection
part as well so you would see here uh in the output for left joint you would see here we
have the common part as well as all the rows in the movies data set that's why you would see this movie ID
however there is no data corresponding to the user ID and rating that's why it is showing nine over here all right and
then um moving ahead s on similar um lines we can perform right join as well wherein all the rows from the right
column will be considered and the output will be displayed accordingly that's the reason you get such kind of a output
then um if you are aware of what is outer join all the rows and columns will uh in the uh
all the rows or column that are there in both the data set will be displayed eliminating the duplicate values all
right so this is the output for the outer joint so here what I have done I've used same merge function and
same syntax and just modified the way I am joining so whichever join you want to perform you have to mention in the
parameter how all right so that's about uh how to join the particular data now there is one another data set called
called called as say sales data and it looks like this that it has product region month and sales now say if I want
to convert this into wide format data okay so what is wide format and what is long format data wide format has got um
a column for each variable like in this case say Jan month of Jan or say Jan sales or FB sales they will be
considered as individual columns um in wide format data however in Long format data we try to compact them as much as
possible and whatever we see here on the screen this can be called as a long format data so in most of the real world
scenarios long format data is preferred over wi format however for doing some analysis maybe you find that white
format is suitable then you may very well go with the white format as well so here what we are looking at how
to uh port a particular data set here we have got two functions in pandas for doing that that is pcore table and P so
if you want to use some aggregation functions along with uh pting so you might be you might be better off with
the p underscore table all right otherwise you can go for p function as well so here what I'm doing in this P
underscore table function is I will be trying to make this February January whatever the columns were there earlier
all right uh sorry um whatever was there in the month whatever the value was there in
the month column that is February Jan or any any month you listed down here here it is only two that I will try
and put it in one uh in separate columns all right that is I will make a separate column for Jan and I'll make a separate
column for Feb and I will calculate some aggregation that is in this case I'm doing a sum all right that's the reason
I get this kind of output so your first thing that is you define what is the index that you want to keep I have kept
index as a product column as a index and then columns month I am keeping is as it is then what I'm
doing in that um values I'm putting it as sales all right so whatever was there in the sales I'm putting it in value and
then um for the for the columns I'm considering month right so month where January and February that's the reason
they are looking here as individual columns because you mention here that columns what what columns you want
whatever is there in the month that we want it as a column and then you do a aggregation that is some average what
some or average whatever you want to do you can perform and then you would get an output of such kind that it will
display all the months that are there in our data it's only Fab and genen so it will display that as a individual column
and they will have their um sum of sales for product a and sum of sales for product B which are individually which
will be individually shown so if you want if you're looking really for this kind of output you may go uh with
performing P using this pcore table function all right and converting it back to what it was original we have a m
function and I've also used reset index over there now when you convert it back you need to mention what are mention
these parameters that what you are up to so in the whichever you want to um put it as a column then you have to mention
it over here idore v um that is one uh parameter where you can mention the product column name and then value what
value you want that is Jan and Feb all right value means earlier it was Jan and Feb right if it were all 12 months I
will have to put those values over there because those are the values that will be coming repeatedly in the month column
and what is the name of that variable that is month and what is the value name that is sales because ultimately you
want to look out for sales for a month and for that product here you are not doing any kind of aggregation and you
are as well trying to shrink or you can say make a data in a long format you need you want to compact the data and
that's why you are getting this kind of output the aggregation Still Remains the Same because you are Ping On You are
menting on what data set pcore data set it is not the original one here you have uh the same sales and the same output
but you want to see it in this format that's the reason you melted melted it back again to the format that uh you
preferred that is long format all right this data has been melted back it's not about the original data please note
these things don't get confused it's it's pretty simple this was the original data we did a we did a p word we we
converted it into this however uh we felt that maybe the preferred format should
be this one because this looks uh pretty good for analysis and that's why in some cases it might be that might happen that
you feel that this is the better one then you may will go with this there is no issue as such all right then U moving
ahead we have one another small data set that is nba. CSV where we have this columns name team member position age
height weight College salary and all these things and we want to do a stack so stacking again is making or
converting a particular data into uh just hold on let me run this again yeah converting a particular data
into compact form if you would see here whatever was displayed in the in the rows that is say this is the first row
this entire thing got compacted into one single index and in one column that is very important and instead of showing it
as one row it got converted into one column and in a single index okay so similarly um you can now also unstack
things to get back to there to the original format again it it matters which format are you trying to prefer
it's a wide format or it's a long format okay now as a formal definition it says that compress columns into hierarchical
index and turn them into rows that is what stag does all right compress columns into hierarchical index so this
is what was done earlier they compressed all all the columns into a single column and they turned it into a hierarchical
index that for this one what is the name every Bradley they mo is the team again the same hierarchy in the second row
again the same hierarchy in the third row and so on the entire data all right then unstack something is a reverse
operation if you want it back again uh in the format that what that it was there so you were able to achieve that
as well that is uh in the in in its original format all right so you simply use unstack for that now um let's see
what is the concept of like multi-level index so for that um what we have done is we have prepared a small data frame
over here which has got columns City ear population and area then what we have done is we have set columns City and ear
as a index so this is how the data looks like when you set it as a index now you try to stack that data if you try to
stack this is the output you get all right you use stack there is a stack function to
stack the data as we saw earlier and this is how the data looks after stacking so you would see why they're
saying as a um hierarchy or as a multi level index you would see here that new city and year then that year again is
divided into two okay and two that is 2020 and 2021 and in 2020 again we have two different things
that those were columns we converted here so in 2020 what was the population and what was the area in 2021 what was
the population what was the area and this looks like a hierarchy right because again for Los Angeles again it
has its hierarchy Chicago it will have its hierarchy the data is not there for 202 that's why it isn't showing over
here okay but that's the importance what kind of data and in what way you want to have a look at it so that it will be
helpful when you analyze it further or when you try to draw some plots or visualize it further so it's all about
what are the questions you are going to solve or what exploratory analysis you are going to perform depending on that
you have to shape your data first after shaping the data only then um you can go for your desire outputs all right then I
unstack it back again it goes to the original format that is the one displayed over here okay so that's how
stacking and unstacking works see this Concepts like we melt stack unstack wide format long format it's very important
to know because ultimately what is the what is our uh end goal to perform some kind of exploratory analysis on this and
generate insights out of it so that's all about pandas and I hope you were able to learn the concepts discussed in
this tutorial of pandas and see you in the next video Until then goodbye
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries

Python Pandas Basics: A Comprehensive Guide for Data Analysis
Learn the essentials of using Pandas for data analysis in Python, including DataFrames, operations, and CSV handling.

A Comprehensive Guide to Pandas DataFrames in Python
Explore pandas DataFrames: basics, importing data, indexing, and more!

Mastering Pandas DataFrames: A Comprehensive Guide
Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.

Understanding Pandas Series and Data Structures in Python
In this video, Gaurav explains how to work with Pandas Series in Python, including how to create, manipulate, and analyze data structures. He covers the basics of importing Pandas, creating Series from lists and dictionaries, and modifying index values.

Master Excel for Data Analysis: From Basics to Interactive Dashboards
Learn Microsoft Excel for data analysis starting from the basics to advanced features like formulas, pivot tables, and Power Query. This comprehensive guide covers data cleaning, dynamic filtering, advanced lookup functions, and building interactive dashboards for real-world business insights.
Most Viewed Summaries

A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.

Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.

Pag-unawa sa Denotasyon at Konotasyon sa Filipino 4
Alamin ang kahulugan ng denotasyon at konotasyon sa Filipino 4 kasama ang mga halimbawa at pagsasanay.

How to Use ChatGPT to Summarize YouTube Videos Efficiently
Learn how to summarize YouTube videos with ChatGPT in just a few simple steps.

Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.