Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation

Convert to note

Comprehensive Guide to Python Pandas: Data Inspection, Cleaning, and Transformation

Introduction to Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It primarily works with two data structures:

  • DataFrame: A two-dimensional labeled data structure similar to a spreadsheet or SQL table.
  • Series: A one-dimensional labeled array, similar to a list or a single column in a table.

Installing and Importing Pandas

  • Pandas is not installed by default in Python but is pre-installed in Google Colab.
  • To install on your machine: pip install pandas
  • Import using: import pandas as pd

Data Inspection

  • Reading Data: Use pd.read_csv() for CSV files or pd.read_excel() for Excel files.
  • Preview Data: df.head() shows the first 5 rows; df.head(10) shows first 10 rows.
  • Tail Data: df.tail() shows last 5 rows.
  • Data Info: df.info() displays data types, non-null counts, and memory usage.
  • Summary Statistics: df.describe() provides statistics for numeric columns; use df.describe(include='all') to include non-numeric data.
  • Data Types: df.dtypes lists data types of each column.
  • Index and Columns: df.index shows index range; df.columns lists column names.
  • Shape: df.shape returns the number of rows and columns.
  • Null Values: df.isnull().sum() counts null values per column.
  • Unique Values: df['column'].unique() lists unique values; df['column'].nunique() counts unique values.
  • Value Counts: df['column'].value_counts() counts occurrences of each unique value.
  • Random Sample: df.sample(n=8) returns 8 random rows.

Data Selection and Indexing

  • Selecting Columns: Use df['column'] for single or df[['col1', 'col2']] for multiple columns.
  • Implicit Indexing (iloc): Zero-based positional indexing; supports negative indexing.
  • Explicit Indexing (loc): Uses the DataFrame’s index labels; can be customized.
  • Changing Explicit Index: You can reset or set a custom index using df.index = range(1, len(df)+1) or df.set_index('column').
  • Row Selection: Use df.iloc[1:10] for rows by position or df.loc[1:10] for rows by index label.
  • Conditional Selection: Filter rows using boolean masks, e.g., df[df['vot_average'] > 6].

Data Cleaning

  • Identifying Nulls: df.isnull() returns a boolean DataFrame.
  • Filling Nulls: Use df['column'].fillna(value) with mean, median, or mode.
  • Dropping Nulls: df.dropna() removes rows with nulls; df.dropna(axis=1) removes columns.
  • Dropping Columns/Rows: Use df.drop('column', axis=1) or df.drop(index).
  • Handling Duplicates: Use df.duplicated() to find duplicates and df.drop_duplicates(keep='first'/'last'/False) to remove them.

Data Transformation

  • Changing Data Types: df['column'] = df['column'].astype(float).
  • Renaming Columns: df.rename(columns={'old_name': 'new_name'}, inplace=True).
  • Adding Columns: Create new columns by calculations, e.g., df['profit'] = df['revenue'] - df['budget'].
  • Modifying Columns: Round values using df['column'] = df['column'].round(1).
  • Adding Rows: Convert a dictionary to DataFrame and concatenate with pd.concat([df, new_row_df], ignore_index=True).
  • Modifying Rows: Access rows by index and assign new values.
  • Setting and Resetting Index: Use df.set_index('column') and df.reset_index().
  • Grouping and Aggregation: Use df.groupby('column').agg({'col1':'mean', 'col2':'sum'}).
  • Applying Functions: Define a function and apply it row-wise with df.apply(func, axis=1).

Data Reshaping

  • Merging DataFrames: Use pd.merge(left, right, on='key', how='inner/left/right/outer') for SQL-like joins.
  • Wide vs Long Format: Wide format has separate columns for each variable; long format compacts variables into fewer columns.
  • Pivot Table: Use pd.pivot_table(df, index='product', columns='month', values='sales', aggfunc='sum') to convert long to wide format.
  • Melt: Use pd.melt(df, id_vars=['product'], var_name='month', value_name='sales') to convert wide to long format.
  • Stack and Unstack: Stack compresses columns into a hierarchical index; unstack reverses this operation.
  • Multi-level Indexing: Set multiple columns as index to create hierarchical indexing for complex data analysis.

Conclusion

This tutorial provides a comprehensive overview of Pandas for beginners, covering essential operations from data inspection to advanced reshaping techniques. Mastering these concepts enables efficient data analysis and manipulation in Python.

For further reading, check out Python Pandas Basics: A Comprehensive Guide for Data Analysis, A Comprehensive Guide to Pandas DataFrames in Python, and Mastering Pandas DataFrames: A Comprehensive Guide. These resources will deepen your understanding of data manipulation with Pandas.

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free
Buy us a coffee

If you found this summary useful, consider buying us a coffee. It would help us a lot!


Ready to Transform Your Learning?

Start Taking Better Notes Today

Join 12,000+ learners who have revolutionized their YouTube learning experience with LunaNotes. Get started for free, no credit card required.

Already using LunaNotes? Sign in