Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation

Introduction to Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It primarily works with two data structures:

DataFrame: A two-dimensional labeled data structure similar to a spreadsheet or SQL table.
Series: A one-dimensional labeled array, similar to a list or a single column in a table.

Installing and Importing Pandas

Pandas is not installed by default in Python but is pre-installed in Google Colab.
To install on your machine: pip install pandas
Import using: import pandas as pd

Data Inspection

Reading Data: Use pd.read_csv() for CSV files or pd.read_excel() for Excel files.
Preview Data: df.head() shows the first 5 rows; df.head(10) shows first 10 rows.
Tail Data: df.tail() shows last 5 rows.
Data Info: df.info() displays data types, non-null counts, and memory usage.
Summary Statistics: df.describe() provides statistics for numeric columns; use df.describe(include='all') to include non-numeric data.
Data Types: df.dtypes lists data types of each column.
Index and Columns: df.index shows index range; df.columns lists column names.
Shape: df.shape returns the number of rows and columns.
Null Values: df.isnull().sum() counts null values per column.
Unique Values: df['column'].unique() lists unique values; df['column'].nunique() counts unique values.
Value Counts: df['column'].value_counts() counts occurrences of each unique value.
Random Sample: df.sample(n=8) returns 8 random rows.

Data Selection and Indexing

Selecting Columns: Use df['column'] for single or df[['col1', 'col2']] for multiple columns.
Implicit Indexing (iloc): Zero-based positional indexing; supports negative indexing.
Explicit Indexing (loc): Uses the DataFrame’s index labels; can be customized.
Changing Explicit Index: You can reset or set a custom index using df.index = range(1, len(df)+1) or df.set_index('column').
Row Selection: Use df.iloc[1:10] for rows by position or df.loc[1:10] for rows by index label.
Conditional Selection: Filter rows using boolean masks, e.g., df[df['vot_average'] > 6].

Data Cleaning

Identifying Nulls: df.isnull() returns a boolean DataFrame.
Filling Nulls: Use df['column'].fillna(value) with mean, median, or mode.
Dropping Nulls: df.dropna() removes rows with nulls; df.dropna(axis=1) removes columns.
Dropping Columns/Rows: Use df.drop('column', axis=1) or df.drop(index).
Handling Duplicates: Use df.duplicated() to find duplicates and df.drop_duplicates(keep='first'/'last'/False) to remove them.

Data Transformation

Changing Data Types: df['column'] = df['column'].astype(float).
Renaming Columns: df.rename(columns={'old_name': 'new_name'}, inplace=True).
Adding Columns: Create new columns by calculations, e.g., df['profit'] = df['revenue'] - df['budget'].
Modifying Columns: Round values using df['column'] = df['column'].round(1).
Adding Rows: Convert a dictionary to DataFrame and concatenate with pd.concat([df, new_row_df], ignore_index=True).
Modifying Rows: Access rows by index and assign new values.
Setting and Resetting Index: Use df.set_index('column') and df.reset_index().
Grouping and Aggregation: Use df.groupby('column').agg({'col1':'mean', 'col2':'sum'}).
Applying Functions: Define a function and apply it row-wise with df.apply(func, axis=1).

Data Reshaping

Merging DataFrames: Use pd.merge(left, right, on='key', how='inner/left/right/outer') for SQL-like joins.
Wide vs Long Format: Wide format has separate columns for each variable; long format compacts variables into fewer columns.
Pivot Table: Use pd.pivot_table(df, index='product', columns='month', values='sales', aggfunc='sum') to convert long to wide format.
Melt: Use pd.melt(df, id_vars=['product'], var_name='month', value_name='sales') to convert wide to long format.
Stack and Unstack: Stack compresses columns into a hierarchical index; unstack reverses this operation.
Multi-level Indexing: Set multiple columns as index to create hierarchical indexing for complex data analysis.

Conclusion

This tutorial provides a comprehensive overview of Pandas for beginners, covering essential operations from data inspection to advanced reshaping techniques. Mastering these concepts enables efficient data analysis and manipulation in Python.

For further reading, check out Python Pandas Basics: A Comprehensive Guide for Data Analysis, A Comprehensive Guide to Pandas DataFrames in Python, and Mastering Pandas DataFrames: A Comprehensive Guide. These resources will deepen your understanding of data manipulation with Pandas.

hello folks welcome to all next in this video we would be exploring Panda's library in Python so if you are someone

who is looking to learn pandas from scratch or struggling with the basic concepts in pandas library then this

video is for you in this video we will cover data inspection using pandas then data selection and indexing using pandas

how is data cleaned in pandas we will also have a look at how data is transformed and it is reshaped using

pandas Library so let's jump in we would explore the functions and the attributes of pandas that are required for data

inspection along with the introduction to pandas so let's start what is pandas pandas is nothing but a powerful python

library that is used for data manipulation and Analysis it contains um two main terminologies that is data

frame and a series data frame could be thought of as a spreadsheet or as a or similar to SQL table which contains um

two dimensional label data structure okay um along with some columns it may be of different types then there is

something called as a series which can be thought of as similar to a list or column in a particular table so let's

jump in into the collab notebook so pandas doesn't come up installed by default along with your basic python

okay however in collab it's already installed but if you are uh doing something on your machine

or U your machine won't have pandas install so you will have to install pandas and then import it after to

import we have this command import pandas as PD PD is nothing but an area's name so to demonstrate attributes and

functions of pandas which are required for data inspection I have taken this data set called as movies all right so

how do we read the data we use pd. read CSV if it is it's Excel file it will be different function it will be read excl

or something so it depends what kind of data are you looking for then I have a function called as DF do head let's see

what it returns me okay if you could see here it returns first five rows of the table so that you can have a look at it

um like in this particular case we have movies so there is a movie ID that is mentioned and there's a budget

popularity Revenue the title of a movie vot average it just may correspond to the rating vot count then director ID

year month and day so you can have a glimpse um of what what data is there in the data set with this particular uh um

DF do head using DF do head function okay so if I want say 10 rows then I will just mention DF do

head whatever number of rows I want I can mention it here so you would see this time around you will get 10 rows

okay first 10 rows by default it returns you first five rows however if you want to get

more or any other more more number of rows or maybe less number of rows you can mention that number inside the

parenthesis okay then similar you can also get last VI rows that is if I do uh ef ef 10 this will give you last P by

default okay you would see here these are last P then

um there is something called as DF do info right you would see what this will give

you is the information about the um data that is there with you that is it will give you what kind of data type it it is

that is it's a data frame because whatever table like structure in Paras we call it as data frame and for a

particular column we call it as Series right so it gives you the information of each and every column these are the

column names then um it gives the non-null count so um there are total of 1465 values and there is no null value

inside it and there are three types of um data types in pandas that is in 64 or integer float and object okay and no

other so if if it is the data set type should be either integer or float if it is not that then it will be considered

as object all right and moving ahead um let's see there is one more function DF do

describe what DF describe gives us is the statistics for numerical column so you would see here for all the numerical

columns it has given me the statistics like count main standard deviation minimum maximum then the intercal ranges

and so on so um if I want you would notice here though that it's only doing all this stuff for numeric columns

however if I want to include the object data type as well I will have to do a include all so DF do

describe okay inside the parenthesis I will mention include is equal to [Music]

all I run this you would see there are few more rows added over here like unique top because

those are of object data type right you cannot calculate um you cannot calculate mean for

that see here look at this for title you cannot calculate mean right for month you

cannot calculate mean so for all those non-numeric columns you can cannot calculate mean that's the reason you get

the output of you you get this kind of out that is look at this mean standard deviation

minimum all these ranges it's not possible to calculate for what for object data type so I hope you

understand what I mean by uh object data type it's a data type that doesn't have integer and Float okay so moving on to

our next um next attribute that we will be checking is we remember describe when

I'm using this parenthesis these are all the functions all right and I would also be showing you some attributes now like

if I'm if I type in DF do D types here it's an attribute it will show all

the data types uh like data types of all the columns present in the data okay so columns or in terms of fers all the data

of the series that are there in the data frame okay then DF do index let's see what DF do index gives

you so it will provide you with the range of where the index starts and where does it stop okay

after that if I do DF do columns it should return all the column names then there is something called as

DF do shape it will give me the shape of the data set that is 1465 rows and 11

columns then moving ahead this is what are we doing by checking all this this is something called as data inspection

so we are inspecting the data we are checking all the um uh I mean all the parameters of data with respect to

inspection so next one is I would like to check if there is any null value and also doing a sum of that null value what

will this return return me oh is function

yeah you would see there is no n value otherwise it would have written me column wise some okay however there is

no n value here so I don't get any output for this so that is fine now moving

ahead let's say I want unique uh unique values in a particular column um so I would take this in our data I would just

take DF again so I want unique values I will check it

on ear column okay because I want unique values now in the ear column how to get the unique values there is a function

called as unique if then that column name that column name ear

ear function function is unique look at this these are all the ears that are present in the data okay these are

all the I mean they are repeated but I have got the unique values so in our data these are all the ears that are

present then mov forward if I want the count of unique values what is the count of these values then there is one simple

change here instead of unique you just mentioned n unique it will return return you the count there are total of 41

unique values in the ear column then if you want to do a value count then you can do that as

well DF do ear the same thing and I will do a value count value underscore counts

okay oh my bad it's a function at this so it will straight away give you which year appear how many times all

right so it Returns the count of unique values in a particular

[Music] column at this so 2006 appeared maximum types in our data then 2005 then 2002

accordingly let's move ahead with the next one if you want some random rows from the data frame then you can mention

DF do sample random rows how many rows do you want say you want some eight random

rules this will give you any eight random rules you would see you would identify that with the index of the data

indexes are very much random so they have picked any random row so that's about data inspection part for pandas

now let's move ahead with next topic we will discuss how can we select columns from a particular data set or maybe a s

it may be a single column it may be multiple columns then we will have a look at what is implicit index what is

explicit index and using that how we can select specific rows from a particular data set and then we will also see how

we can um select data based on conditions so let's jump in you would see here the first

statement is input pandas and input number so this is quite familiar to you now it's been like we are four to five

videos um old in exploring python libraries so this is something that is familiar to you then I

have imported the data set movies with the read uncore CSV function uh or you may use some other function if

it is a XLS file it will be read uncore Excel function then I would just uh like to have a look at the existing data

that's the reason I using DF doad oh my bad here I have to run this sales.ad now you would see this data

consist of movie ID budget and popularity Revenue title vot average vot count and all these things uh like

director ID year month and day okay so we had a glimpse of what is there in data suppose I want to have a look at

only one column or I want to select only one column from this particular data what would I do I would simply enter the

column name inside um inside the existing data and I would

select it with this syntax I would simply write DF then Square Braes inside that I will write the column name and if

I run this you would see only one column will be displayed okay this is this is the output that you get if I want to

select multiple columns same thing I will do but this time I will have to specify whichever columns I

need I would have to specify their name all right so I needed title and vot average so I mention their name in a

list the syntax is the same instead of single column name this time what we have done we have entered a list of

number of columns like we want to select two you specify their names and the data will be displayed all right now there is

something called as implicit and explicit in um uh in Bas all right so let's see

what is implicit and what is explicit now implicit index is represented by ioc and explicit index is represented by Lo

explicit index is an is a index which you can see here when uh you get the output um if you have noticed this

number here this is not a part of table all right this is something called as explicit index because as I mentioned

you can see this right it starts from zero and it goes all the way till 1464 okay you would see here as well

so whenever you try to see a particular data set or data in a particular data set then you would see there are there

is one additional um column you may say it doesn't have name here and that is nothing but an index okay this is not a

part of our existing data okay this is not a serial number or I have not forgotten any column name over here it

comes it gets generated and we can can see that and that is nothing but called as an index but it is a explicit index

because I I just mentioned that explicit indexes are those indexes which you can see over here and what is implicit

indexes which you cannot see but they are there they are getting generated in the backround you will see what it is um

when we have a look at the example all right so if I do DF of one let's see what is

the output you get an output of the uh you get you get output as

a second row because you mention DF of one one is at where here the one means the index right so the data

corresponding to the index one is displayed now this is iloc and when I do DF of Lo

DF do looc one which is explicit this again shows the same data all right then you would wonder hey um these both are

showing us same data then then what's the difference yes of course there is a difference that's the reason we are

studying this all right um suppose I want index to start from range one and not from zero by default it started from

zero explicit index starts with zero and um even implicit index starts with zero that's why we got the similar outputs

however in we can change explicit index range and we can start it from wherever we want it to now suppose I want the

index to be started from one so I have used this range function and um using it inside list function I

have modified the index I want this start uh to start it from one so this is what I did U I gave one and range till

where DF DF do shape of 0 + 1 you would say what is this DF do shape 0 + 1 so it is nothing let's explore DF do shape

first so DF do shape gives you what shape is 1465 and 11 and what I have done shape of zero shape of zero in the

sense this number 1465 so if I do oh hold on this yeah DF of shape of zero if I do

over here okay you would get 1465 all right and plus one plus one is

1466 okay so my range should go to from 1 to 1466 only then I I would able I would be able to get 1465 rows you

remember the range function from the for Loop that we discussed we had discussed range function when we studed for Loop

all right so that's how it works the range will be from 1 to 1466 wherein 1466 will get

ignored and you will get all the 1465 rows all right so to proceed ahead now if you look at this the index started

from one because we wanted the index to start from one and not from zero so that's that's an advantage of expli

index it it will it will take the range from the one you want it to be okay so

now if I do DF Lo of one look at this it will give me first row not the second one as you as we used to get earlier

because the index has changed and at the first row this is these are the details and those details will be shown to me

however if I'm still using iloc all right let's see what I'm getting this is

implicit index it still returns me to because we cannot change implicit is this index we cannot change all

right it it behaves the um way you can related with the python indexing similar way whatever we have

studied earlier while we did string slicing why weed list liing so at that time we had studied positive indexing

and negative indexing it is that index all right so which whatever is the default behavior um by default which

index operate that is that that is the same index referred by iloc over here all right you would see it did not

change now if I do it dfoc of minus one okay now you would quickly very quickly answer V it will give us the last row of

data because you are already aware what is negative indexing okay after that DF do L of minus

one let's see what this gives me as an output and yeah this gives me an error key error minus one why do you think

would this have happened if you if you look at this carefully DF do Lo of minus one right

now in the Lo or in explicit indexes what I have is a range from 1 to 1465 all right and these are explicit indexes

so whatever range I I had already mentioned whatever range I want I would be specifying there right so I have a

range from 1 to 1465 and I'm referencing I'm trying to give minus one in Lo so there is no minus one

in um in this explicit index range it is from 1 to 1465 had I mentioned minus one then this would not give me an error

look at this it is giving me a key error it's saying key error minus one um we we don't have this key u we don't

know what this key you are talking about so negative indexing and positive indexing will apply only in terms of um

ioc I mean both will apply only in terms of ioc explicit index that is looc it will not apply because that is something

that we have given the range whatever um item or whatever row you are trying to reference should be from that same range

that's the reason we get an error here when we give something that is not there in one two 1465 all right now if you if

you want to see a particular index then you can see that index as well all right now let's have a look at

row selection using implicit and explicit indexing here you would see DF do Lo one colon all right

so ioc1 means which row um of using explicit index it will be second row second row why second row because uh

I'm I'm talking in terms of explicit index why I'm trying to explain this is because see the data this indexes are

explicit indexes what I'm saying iloc of one is equivalent to explicit index two whatever is there in explicit index 2

iloc of zero will be equivalent to explicit index one because we have changed the order of explicit index and

it now starts from one right and that's why it started here from Two and it went all the way to 10 okay now if I would

have written zero over here look at this now it will return return me the first row look at this it return me from the

first to the 10th all right so this is what we should be able to um actually notice and if you see 0 to 10 is um 11

right if if we count but it follows the range concept here as well and it returns you how many rows it returns you

rows from 1 to 10 that is number of rows written here at 10 not 11 because the 10th one is excluded this is very much

the same um when we did slicing um uh in strings or list all right so I I always keep on referring to the to our older

Concepts because you are you will be able to correlate very easily all right then df.loc of 1 to 10 now if the same

thing I want to do it for LC from 1 to 10 then what's the result um look at this it starts with one now we are we

are telling the interpretor that okay use Lo right so lo I said will follow whatever range we have specified we have

specified 1 to 10 in this case it will show all 10 rows now it doesn't uh ignore this

10 and it doesn't show only nine rows in this case in case of Lo it shows all the rows whatever you mention over here so

these are few key things that you should remember um while you explore explicit index and implicit index the way ioc and

looc works is totally different in looc you can change um the range of the index that is it did not necessarily start

from zero you can mention whatever range you want and the way it works is this whatever you specify in the range it

will take all the rows it will not ignore the um upper limit upper limit range and in iloc it

will always give one one I mean always retri one row less than what you have specified because it is similar to the

um to our range function all right now if the same thing if you want some specific rows you have to mention those

row numbers or not the row numbers you have to mention the row indexes and this is what we have done here DF is equal to

ioc 11 12 and 18 you would see one that is um the output that will be replied by this um first element in the list is

this row that is the second explicit index row then the 13 explicit index row because you have each 12 it will always

be one more in terms of explicit index guys again I'm saying don't get confused explicit index you can see that's the

reason you see this 213 and 19 implicit index you cannot see here on the screen but we can um definitely um correlate

how these things are working it doesn't show you the exact first row details because you haven't put zero over here

implicit index will always start from zero it will have negative indexing as well where in explicit index doesn't

follow any thing kind of negative indexing as such by default if and only if you mention some negative range in um

explicit index only then it will follow all right then um DF do Lo um 1A 12 18 then exact same rules will be red see 1

12 and 18 all right now let's move on to the next section that is condition selection or and filtering all right so

based on condition also we can select particular number of rows or specific set of data here I wanted to have a look

at movies which have vote average or you can say which have rating more than six all right so what I have used here I

have used the comparison operator great than and try to do the same however if you notice we get output as a Boolean

array the output is true and false it doesn't give me the actual data now if I want actual data what I will do I will

apply this Boolean array on uh as a mask on my original data so my original data is DF and I simply put this Boolean

array on the original data and this will actually retry the values so wherever it is True Values

those values or those rows will be red and wherever it is false it will not show up here that's why you see there

are 977 such rows which have wrting greater than six so if someone ask me that which are the movies that are

greater than that have greater rating than six then I would simply use this but no kindly note here

that the output that you get by comparing the comparing it with whatever number you decide whatever rating you

decide was Boolean it gave you true and false output now we are not interested to see what is

what all data is true and false we just want to filter out or we want to select whichever movies have rating greater

than six and that's the reason we applied that mask on our data set this is also called as a fancy indexing

sometimes also called as U masking so these are various different terms that are used for the very same purpose now

if you want to do a mean of particular thing I mean let's take the same

rating column or you can say vot average column um and I'm saying it as rating because vot average here shows the

rating of a particular movie so if I want to take an average what I will do I will first select that vot average

column I will take it out and store it in a different variable and then I would take a take a mean of that so here I

just I've done the same look at this I get the vot average I just selected the column you know how to select columns if

I want to take a meme I can always take a mean of that now let's move ahead with next topic we would see how to um

identify the null values in a particular data set um how to fill them up and how to drop a particular column with null

values or a row with null values and also how to explicitly drop a particular row or a column from a data set and we

will also identify the duplicate rows that are there in the data sets and drop the duplicate um

rows using different variations now let's jump in you would see first line of code is I have imported pandas and

with and specified the alas name this is something that you're already aware from the previous videos then moving ahead I

have data which is there in this file XLS X and I have loaded the data in the data variable now U let's have a look at

what is there in data so we have movie ID title Jer then release year rating director and

run typ all right now I need to identify which all n values are there so I have a function

called as data do is null so if I use this function it would return true wherever there is a

null value it will actually return the table containing true and false values and wherever there is a n or empty value

you would get true at this point of time our data is short so we can easily have a look at

this you have got true in this release year and the column rating all right however if data is larger you may not be

able to see it see this very easily that's the reason um we would we would do a

sum uh of True Values in a particular column for that we have just a some function to be attached

to data do e null all right so this is the way we write I have a variable n values is equal to data do e null do sum

that would give me column wise count of how many null values are there in particular column from this output I can

clearly understand that release year has got one n value and rating has got one n value and I need to do some treatment

with that um please understand we cannot just leave those null values or we cannot just replace it

by anything we want we need to think about how the data is um what number of null values are there

um for our data is it some substantial number of n values or it's a ignorable one and accordingly we will replace

those null values now here you would see I have replaced it with median and mean in the release year I have replaced um

the null value by median and in the column rating I have replaced the null value by mean now first let's see what

does um the output of data release Here median gives it to me okay so I'll just um take this part of the code

see what is the output that I get on doing a medion of ReliOn so I get 2,11 so have a look at data reading

mean a do 1 0 0 all right let's have a look at data one more time so that you can easily

compare after we replace the N values now here you would see rating has got one null value and uh release year has

got one n value as we have already seen all right now look at this the release year not median turns out to be 2011 and

rating turns out to be 8.1 so these values should be replaced according uh accordingly like for release year

instead of this nine we should get 2011 and rating in in the rating column instead of this n we should get the

value 8.10 that is the mean and if you want to replace this we have a function called as fillna okay that's that's what

we have used and we are trying to replace it so we just need to mention a column name and um by what you want to

replace do you want to replace with medium do you want to replace with mean or even then even it is more to it we

can also do a replacement with something called as mode that is um the Val that is the particular n value or empty value

will be replaced by frequently used value in that particular column okay so we are not using that one here but we

are using median and mean all right so let's see what what's the output that we get

here you would notice that wherever there was null value it got filled with 2011 as expected

and um in in the real here and in the rating um we got it we got the N value replaced by 8.1 so it depends on which

one which option is suitable for you is it median is it mean according to the data that you are

working okay you cannot simply ignore and go on deleting things you will have to replace at times using mean median

and mode all right and if there is a a text column you would say hey it that doesn't work uh the mean and median

doesn't work so you can simply write some text like not available not applicable na whatever you wish to I

understand because those will be um like strings you cannot take any mean or median of that all right um moving

ahead if you want to drop particular rows which have um n values then you can simply

drop it so in this case I have I had null values on Fifth fifth row and sixth row let's see if they are getting

dropped yep see fifth and sixth row are dropped dropping is a good option when uh the number of rows with null values

are very less compared to the um data the number of rows in the entire data say you have 10,000 or one lakh rows and

only um three four rows have some null values you might uh prefer dropping those then we can also drop columns

which have null values so we have the same function here we should specify the axis so axis is equal to one states that

you are looking for dropping columns when you don't specify anything by default it takes us rows all

right so here if you would see those columns have been dropped which had n values

okay so there may be situations where you don't need those columns then you can use this option else you would

prefer doing a fillna filling those values right then um irrespective of null value will you be able to drop a

particular column yes that is very much possible if you want to drop a single column just mention the column name

mention the axis AIS is equal to one and you can very well drop it I dropped runtime from data look at this I don't

have any runtime column why we are looking at this is because when we receive data U you may think particular

column is not required then in that case you will just simply drop that column and um prefer Only The Columns that are

required so for doing that you will use this you may also want to drop some rows you can

use this indices to drop rows and mention X is equal to zero and you would not see um row six and seven which are

mentioned in the um drop statement over here in your data so if any if you want to delete any specific rows that is very

possible and now dropping duplicate rules first to drop duplicate rules we need to identify if at all are there any

duplicate rules in the data that we are looking in all right so let me check if we have any duplicate rules

first look at this you have a duplicate row at index 8 in data all right so it duplicates uh let's have a look at data

again uh and see which row it duplicates to um

just type in data over here yeah see seven and

8 last two rows these two rows are duplicate they have duplicate data so what we will do we will try to remove

duplicates now if we want to remove both of the rows then we will have to keep this parameter keep is equal to false

okay in this case both of the rows seven and 8 will be deleted that is seven and 8 is

the in number for that particular row you would see only six rows however if you want to keep the last one that is

the eighth one then you just mention lost over here then the eight one will be kept and

the other one will be deleted or in case there would have been three four rows the other rows would have been deleted

and only the Ed one would be um retain if you keep first over here first

in the skip parameter of the drop duplicates function then you would get the first row that will be kept and the

others will be deled so this is how you perform data cleaning in pandas now let's move ahead with with next topic we

will have a look at um how to actually transform the data that is how to change the data type of a particular column

then how to rename a particular column then how to add new columns modify a column add a row modify rows then how to

set index and again reset index and how to group and aggregate the data in a data set then also we will have a look

how to apply functions to a particular data frame so let's jump in our first statement as usual import pandas SPD um

PD being the alas name then I have data set um called as transform movies well I forgot the S here but it doesn't matter

um so when I load this database and I try to see what is there in database you would see

some 10 rows there I I kept it as simple as possible I'm not taking a lot of rows here um so you would see here the data

consist of movie movie ID title then release year Jor IMDb rating budget and revenue all right so let's check what

are the data types of all these columns for that we have DFD types which we are already aware and we would see there are

three data types that come in by default that is integer float and if it is not both of these it is an object however we

have some specialized data types as well in which we can convert these are something that come by default as we

already know from our various videos uh presumably say I want to change um data type of budget and revenue to float

instead of integer then I have this function as type and I can check change it to float so if I run this and again

try to see DF do data types then you would see here the data types are changed from integer to float all right

then let's see how to rename column in a data set okay so if I run this um before running this I'll just explain the

syntax um we have rename function inside which we have this parameter called as columns wherein you can mention the

column name and the name that you want to replace it with and it should be um in a form of a dictionary if you're are

not um well vered with how dictionaries work we have a separate video in our um python for beginner series you may refer

to it okay um and I have also used in place so these two columns I want to replace is the column is release year I

want to replace it by release underscore year then there is next column IMDb underscore sorry IMDb rating I will

replace it by IMDb underscore rating and interest is equal to True indicates that these changes will be saved permanently

you would see here the column names earlier where release here and IMDb space rating so instead of space uh by

using this this this uh this command what should happen is space should be replaced by the underscore so let's see

if we are able to do it successfully yeah look at this look at the output we have successfully renamed

the colum now moving on to the next section um how to add a new column there might be a need where you want to do

some calculation based on existing columns and you want to add a new column to your data set now we have revenue and

we have budget and we want to calculate profit now this calculation happens um on each row right so that's the reason

we want to add new column and for each row what is the profit we want to show it and formula is pretty simple it's

Revenue minus budget however look at this by using this DF and in the square based profit we are creating new column

so let's see if that gets created or not yes we are able to successfully create the column called as new column

called as profit now suppose you want to modify existing column that is if you want to

um explicitly um tell the database that the rating should be um rounded to one okay so if I want to do that then I can

simply write the code AS IMDb rating do round of one so round of one in the sense after decimal uh sorry yeah after

decimal point there will be only one uh one digit if you want to round it to two digits or you want to you don't want

rating um then simply use round function that's it I want to round it to one that's the reason I have uh written IMD

_ wrting DF round one explicitly I'm telling that this column will only have uh one digit after the decimal point

something called as Precision all right so yeah you would see I'm rating being displayed as 9.0 7.8 and 8.7 so all

are displayed in uh all are displayed in the form of one digit after decimal now let's see how to add a new row now here

you are adding it in the form of dictionary again so we need to clear how we add um how we create a dictionary

actually then we have to convert that new row into a data frame first if you directly go on to add using this concat

method it is not going to happen first you need to convert it to the format of the data frame that's the reason I have

used the data frame function and then you add it to new row and then you write ignore index is equal to True okay the

ignore index because whenever you add new row it starts from new index that is it starts again from zero so if I if you

see here new row would be added easily okay at the 10th index all right if I don't write ignore index here it will

simply start the index from zero if you would notice look at this look at this we don't want this to

start from zero we want this to start from 11 that's the reason we write it as ignore index now moving on to the next

section um you you would see here earlier when we used before moving to the next

session you would see here earlier we had used ignore index um the index got adjusted to 10 however when we removed

it the index again started with zero which is something we don't want all right we want to maintain the indexes

that's the reason we ignored it then if you want to modify a particular row even you can do that look at this um I have

just mentioned the location these are the explicit indexes so I used the explicit index 6 and on the explicit

index whichever value is there I'm just trying to modify so instead of 2014 over here uh in our data I am trying to take

what let let me show you the data first there in our DF this is what is what we have in DF

right and I want to change it change the year 20 uh instead of 2014 I want to mention

the year 20 15 so what I would do I would simply come here and mention the index is five and I will try to change

this so look at this the year has changed so if you want to change a single value also um in a particular row

you can definitely do it in this way accessing using the explicit index or maybe implicit index Okay then if you

want at any point of time to set index um with a column name here in this case I'm setting the movie ID as a index all

right so in that also I can do it very easily with using the set index function set index movie ID so you would see here

the explicit index has vanished and movie ID has become an index and if you want to come back to where what we we

earlier just do a DF do reset index that will again uh bring explicit index into the play okay so that's how you set um

indexes and reset index for a particular data set then moving ahead how to group and aggregate the data okay so if you at

any point of time want to group data here in this case I group The databas JRE to perform aggregations like sum

mean average or in this case I I'm performing only sum and mean if you want to do average whatever aggregation you

want to perform you can very well do it using DF do group by function and inside the group

by uh along with Group by you can use AG function wherein you can mention on what basis you want to make an agre is it a

mean is it a sum or is it a average I have used these three columns on which I have done aggregation that is MDB rating

I have taken mean that is I'm trying to um take average IMDb rating and I'm trying to look for total budget and

total revenue that was generated per J this was my goal and that's the reason I use this one so let me check if I'm able

to successfully get respective outputs yes I did look at this action for the J action this was the budget and this was

the revenue and this was the average rating then for drama this was the one and for scii this was the one so

accordingly whatever I try to group the data would be generated and so is it always necessary that you do aggregation

for group by whenever you use Group by statements no absolutely not right um You may

want in some cases that you don't want to perform aggregation you just want to apply Group by so that also you can do

very much uh possible and here I group It Again by Jor this time I don't do any aggregations but if I want to see a

particular group I need to use getor group function and I would be easily able to see what is there in that

particular group so if I run this cell you would see here uh I had used get group and all the uh information uh

or all the rows that are aligned with the Sci-Fi group are displayed over here because I did a get group sci-fi group

okay so this is very much possible then what is next is um hold on let me just check yeah DF do

reset index perform this DF do reset index again and then what I do

is apply function to data frames okay so how do you apply a particular function you know how to write function we have U

studied this in our python for beginner Series so this is a function with written with name profit underscore

margin def is the keyword that defin so this is a function definition basically and it takes one par parameter the

parameter is nothing but named as row all right what does it do returns row profit all right so whatever there is in

profit um column divided by Revenue into 100 that is nothing but the formula to calculate the profit margin right and

you want to calculate this for each and every row so you would not be performing uh or writing this formula over and over

again instead you can very well use the apply function to the uh specific dat frame and get the expected result now

this profit uh margin will be calculated by applying the profit margin function

which you would see here in the apply function the first parameter is the function name and the next parameter is

AIS AIS is equal to 1 that is on every row um in that column um the profit margin will be calculated so if I try to

look at the profit margin uh per movie that's the reason I took title over here and I easily able to see what is the

profit margin that was achieved per movie as per our data set so that's all about data

transformation now let's move ahead with next topic we will have a look at how to reshape a data that is how to

merge two different data sets then um we will have a look at what is wide format data what is long format data and then

we will also see what is Pivot and melt functions and

then we will also have a look at how to stack and unstack the data so let's jump in our first line of code import pandas

SPD as we already have done this before uh on several occasions then um I would like to show it to you that we have some

sample data sets like movies ratings and NBA do c CSV sales data these are the data sets we will be working on um today

and we would see what are the various different outputs that we get when we perform various different joints when we

try to merge the data now first I have um loaded or you can say I I will read the data set movies and ratings so if

you have a look at the um movies data set and if you have a look at ratings data set you would see that there are

two two things uh there are there is one thing in common that is movie ID on the basis of movie ID we can definitely

perform a join now what is join what is inner join what is Right join so these come under the SQL Concepts that you

should be aware of to understand this video all right so I will in the first case join on the

basis of inner join to join two tables here in pandas we have a function called as merge you pass the name of the two

tables that is the left table and the right table and you also have to mention on what column you are going to make a

join that is moviecore ID in this case and what kind of join that is inner join is something that you are going to

perform here so inner joint as we know it displays the intersection part or the common part between the data sets all

right so here you would see the output excluding movie id5 because movie id5 is there in the movies data set look at

this but it isn't there in the ratings data set so you would not see that uh movie id5 however if you perform a left

joint so whatever is there I mean all rows in the left table should be displayed here in this case the left

table is a movie table in movie table we have this phos so accordingly the phos will be displayed and the intersection

part as well so you would see here uh in the output for left joint you would see here we

have the common part as well as all the rows in the movies data set that's why you would see this movie ID

however there is no data corresponding to the user ID and rating that's why it is showing nine over here all right and

then um moving ahead s on similar um lines we can perform right join as well wherein all the rows from the right

column will be considered and the output will be displayed accordingly that's the reason you get such kind of a output

then um if you are aware of what is outer join all the rows and columns will uh in the uh

all the rows or column that are there in both the data set will be displayed eliminating the duplicate values all

right so this is the output for the outer joint so here what I have done I've used same merge function and

same syntax and just modified the way I am joining so whichever join you want to perform you have to mention in the

parameter how all right so that's about uh how to join the particular data now there is one another data set called

called called as say sales data and it looks like this that it has product region month and sales now say if I want

to convert this into wide format data okay so what is wide format and what is long format data wide format has got um

a column for each variable like in this case say Jan month of Jan or say Jan sales or FB sales they will be

considered as individual columns um in wide format data however in Long format data we try to compact them as much as

possible and whatever we see here on the screen this can be called as a long format data so in most of the real world

scenarios long format data is preferred over wi format however for doing some analysis maybe you find that white

format is suitable then you may very well go with the white format as well so here what we are looking at how

to uh port a particular data set here we have got two functions in pandas for doing that that is pcore table and P so

if you want to use some aggregation functions along with uh pting so you might be you might be better off with

the p underscore table all right otherwise you can go for p function as well so here what I'm doing in this P

underscore table function is I will be trying to make this February January whatever the columns were there earlier

all right uh sorry um whatever was there in the month whatever the value was there in

the month column that is February Jan or any any month you listed down here here it is only two that I will try

and put it in one uh in separate columns all right that is I will make a separate column for Jan and I'll make a separate

column for Feb and I will calculate some aggregation that is in this case I'm doing a sum all right that's the reason

I get this kind of output so your first thing that is you define what is the index that you want to keep I have kept

index as a product column as a index and then columns month I am keeping is as it is then what I'm

doing in that um values I'm putting it as sales all right so whatever was there in the sales I'm putting it in value and

then um for the for the columns I'm considering month right so month where January and February that's the reason

they are looking here as individual columns because you mention here that columns what what columns you want

whatever is there in the month that we want it as a column and then you do a aggregation that is some average what

some or average whatever you want to do you can perform and then you would get an output of such kind that it will

display all the months that are there in our data it's only Fab and genen so it will display that as a individual column

and they will have their um sum of sales for product a and sum of sales for product B which are individually which

will be individually shown so if you want if you're looking really for this kind of output you may go uh with

performing P using this pcore table function all right and converting it back to what it was original we have a m

function and I've also used reset index over there now when you convert it back you need to mention what are mention

these parameters that what you are up to so in the whichever you want to um put it as a column then you have to mention

it over here idore v um that is one uh parameter where you can mention the product column name and then value what

value you want that is Jan and Feb all right value means earlier it was Jan and Feb right if it were all 12 months I

will have to put those values over there because those are the values that will be coming repeatedly in the month column

and what is the name of that variable that is month and what is the value name that is sales because ultimately you

want to look out for sales for a month and for that product here you are not doing any kind of aggregation and you

are as well trying to shrink or you can say make a data in a long format you need you want to compact the data and

that's why you are getting this kind of output the aggregation Still Remains the Same because you are Ping On You are

menting on what data set pcore data set it is not the original one here you have uh the same sales and the same output

but you want to see it in this format that's the reason you melted melted it back again to the format that uh you

preferred that is long format all right this data has been melted back it's not about the original data please note

these things don't get confused it's it's pretty simple this was the original data we did a we did a p word we we

converted it into this however uh we felt that maybe the preferred format should

be this one because this looks uh pretty good for analysis and that's why in some cases it might be that might happen that

you feel that this is the better one then you may will go with this there is no issue as such all right then U moving

ahead we have one another small data set that is nba. CSV where we have this columns name team member position age

height weight College salary and all these things and we want to do a stack so stacking again is making or

converting a particular data into uh just hold on let me run this again yeah converting a particular data

into compact form if you would see here whatever was displayed in the in the rows that is say this is the first row

this entire thing got compacted into one single index and in one column that is very important and instead of showing it

as one row it got converted into one column and in a single index okay so similarly um you can now also unstack

things to get back to there to the original format again it it matters which format are you trying to prefer

it's a wide format or it's a long format okay now as a formal definition it says that compress columns into hierarchical

index and turn them into rows that is what stag does all right compress columns into hierarchical index so this

is what was done earlier they compressed all all the columns into a single column and they turned it into a hierarchical

index that for this one what is the name every Bradley they mo is the team again the same hierarchy in the second row

again the same hierarchy in the third row and so on the entire data all right then unstack something is a reverse

operation if you want it back again uh in the format that what that it was there so you were able to achieve that

as well that is uh in the in in its original format all right so you simply use unstack for that now um let's see

what is the concept of like multi-level index so for that um what we have done is we have prepared a small data frame

over here which has got columns City ear population and area then what we have done is we have set columns City and ear

as a index so this is how the data looks like when you set it as a index now you try to stack that data if you try to

stack this is the output you get all right you use stack there is a stack function to

stack the data as we saw earlier and this is how the data looks after stacking so you would see why they're

saying as a um hierarchy or as a multi level index you would see here that new city and year then that year again is

divided into two okay and two that is 2020 and 2021 and in 2020 again we have two different things

that those were columns we converted here so in 2020 what was the population and what was the area in 2021 what was

the population what was the area and this looks like a hierarchy right because again for Los Angeles again it

has its hierarchy Chicago it will have its hierarchy the data is not there for 202 that's why it isn't showing over

here okay but that's the importance what kind of data and in what way you want to have a look at it so that it will be

helpful when you analyze it further or when you try to draw some plots or visualize it further so it's all about

what are the questions you are going to solve or what exploratory analysis you are going to perform depending on that

you have to shape your data first after shaping the data only then um you can go for your desire outputs all right then I

unstack it back again it goes to the original format that is the one displayed over here okay so that's how

stacking and unstacking works see this Concepts like we melt stack unstack wide format long format it's very important

to know because ultimately what is the what is our uh end goal to perform some kind of exploratory analysis on this and

generate insights out of it so that's all about pandas and I hope you were able to learn the concepts discussed in

this tutorial of pandas and see you in the next video Until then goodbye

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Python Pandas Basics: A Comprehensive Guide for Data Analysis

Learn the essentials of using Pandas for data analysis in Python, including DataFrames, operations, and CSV handling.

A Comprehensive Guide to Pandas DataFrames in Python

Explore pandas DataFrames: basics, importing data, indexing, and more!

Mastering Pandas DataFrames: A Comprehensive Guide

Learn how to use Pandas DataFrames effectively in Python including data import, manipulation, and more.

Understanding Pandas Series and Data Structures in Python

In this video, Gaurav explains how to work with Pandas Series in Python, including how to create, manipulate, and analyze data structures. He covers the basics of importing Pandas, creating Series from lists and dictionaries, and modifying index values.

Master Excel for Data Analysis: From Basics to Interactive Dashboards

Learn Microsoft Excel for data analysis starting from the basics to advanced features like formulas, pivot tables, and Power Query. This comprehensive guide covers data cleaning, dynamic filtering, advanced lookup functions, and building interactive dashboards for real-world business insights.

Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation