โ Step-by-Step Data Cleaning in Python with Live Output
If you’re working with messy data, this guide will walk you through a clear and concise data cleaning workflow using Pandas. Each step includes a before/after result and a live print statement to help you see exactly whatโs happening in real-time.
This format is ideal for beginners and data science learners who want to understand not just what to do โ but why and what effect it has on the dataset.
๐ Dataset: extended_sample_data.csv
Letโs start cleaning!
๐น Step 1: Load the Dataset
import pandas as pd
# Step 1: Load the dataset
df = pd.read_csv('extended_sample_data.csv')
print("Original Data:\n", df.head(10))
Explanation: We load the CSV file using pandas.read_csv()
and immediately display the first 10 rows to inspect the structure and values.
๐น Step 2: Check for Missing Values
# Step 2: Check for missing values
print("\nMissing Values:\n", df.isnull().sum())
Explanation: Identifying null values helps you decide whether to drop or fill them based on context.
๐น Step 3: Handle Missing Data
Option A: Drop Missing Values
# Drop rows with missing values
df_dropped = df.dropna()
print("\nData after dropping missing values:\n", df_dropped.head(10))
Option B: Fill Missing Values
# Fill missing values with 0
df_filled = df.fillna(0)
print("\nData after filling missing values with 0:\n", df_filled.head(10))
Explanation: You can either drop rows with any NaN
values or fill them with a placeholder like 0
, depending on your analysis goals.
๐น Step 4: Rename Columns
# Rename columns โ strip, lowercase, replace spaces
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("\nRenamed Columns:\n", df.columns.tolist())
Explanation: Clean, consistent column names are essential for reliable processing, especially when programmatically referencing columns later on.
๐น Step 5: Remove Duplicate Rows
# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("\nData after removing duplicates:\n", df_no_duplicates.head(10))
Explanation: Duplicates can skew your analysis. This step ensures your dataset is unique and accurate.
๐น Step 6: Convert Date Columns
# Convert 'joining_date' to datetime
df_no_duplicates['joining_date'] = pd.to_datetime(df_no_duplicates['joining_date'], errors='coerce')
print("\nData with converted date column:\n", df_no_duplicates[['joining_date']].head(10))
Explanation: Always convert date columns to datetime
objects for easier filtering, sorting, and time-based operations.
๐ฏ Final Thoughts
With just a few lines of Python, youโve learned how to:
- Load and inspect data
- Handle missing values
- Clean column names
- Remove duplicates
- Convert date columns
This type of transparent, step-by-step cleaning not only improves your data โ it builds your confidence and understanding.
๐ก Pro Tip:
Turn this script into a template for future projects. Good data cleaning is 80% of the work in data science.
๐ Next Steps
- Visualize cleaned data with Plotly
- Export cleaned data to CSV with
df.to_csv()
- Integrate into a Dash dashboard
If you found this helpful, share it and let me know in the comments! ๐ฅ Want a downloadable script version or video walkthrough? Let me know!