✅ Step-by-Step Data Cleaning in Python with Live Output

If you’re working with messy data, this guide will walk you through a clear and concise data cleaning workflow using Pandas. Each step includes a before/after result and a live print statement to help you see exactly what’s happening in real-time.

This format is ideal for beginners and data science learners who want to understand not just what to do — but why and what effect it has on the dataset.

📁 Dataset: extended_sample_data.csv

Let’s start cleaning!

🔹 Step 1: Load the Dataset

import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('extended_sample_data.csv')
print("Original Data:\n", df.head(10))

Explanation: We load the CSV file using pandas.read_csv() and immediately display the first 10 rows to inspect the structure and values.

🔹 Step 2: Check for Missing Values

# Step 2: Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

Explanation: Identifying null values helps you decide whether to drop or fill them based on context.

🔹 Step 3: Handle Missing Data

Option A: Drop Missing Values

# Drop rows with missing values
df_dropped = df.dropna()
print("\nData after dropping missing values:\n", df_dropped.head(10))

Option B: Fill Missing Values

# Fill missing values with 0
df_filled = df.fillna(0)
print("\nData after filling missing values with 0:\n", df_filled.head(10))

Explanation: You can either drop rows with any NaN values or fill them with a placeholder like 0, depending on your analysis goals.

🔹 Step 4: Rename Columns

# Rename columns — strip, lowercase, replace spaces
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("\nRenamed Columns:\n", df.columns.tolist())

Explanation: Clean, consistent column names are essential for reliable processing, especially when programmatically referencing columns later on.

🔹 Step 5: Remove Duplicate Rows

# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("\nData after removing duplicates:\n", df_no_duplicates.head(10))

Explanation: Duplicates can skew your analysis. This step ensures your dataset is unique and accurate.

🔹 Step 6: Convert Date Columns

# Convert 'joining_date' to datetime
df_no_duplicates['joining_date'] = pd.to_datetime(df_no_duplicates['joining_date'], errors='coerce')
print("\nData with converted date column:\n", df_no_duplicates[['joining_date']].head(10))

Explanation: Always convert date columns to datetime objects for easier filtering, sorting, and time-based operations.

🎯 Final Thoughts

With just a few lines of Python, you’ve learned how to:

Load and inspect data
Handle missing values
Clean column names
Remove duplicates
Convert date columns

This type of transparent, step-by-step cleaning not only improves your data — it builds your confidence and understanding.

💡 Pro Tip:

Turn this script into a template for future projects. Good data cleaning is 80% of the work in data science.

📌 Next Steps

Visualize cleaned data with Plotly
Export cleaned data to CSV with df.to_csv()
Integrate into a Dash dashboard

If you found this helpful, share it and let me know in the comments! 🎥 Want a downloadable script version or video walkthrough? Let me know!

✅ Step-by-Step Data Cleaning in Python with Live Output

📁 Dataset: extended_sample_data.csv

🔹 Step 1: Load the Dataset

🔹 Step 2: Check for Missing Values

🔹 Step 3: Handle Missing Data

Option A: Drop Missing Values

Option B: Fill Missing Values

🔹 Step 4: Rename Columns

🔹 Step 5: Remove Duplicate Rows

🔹 Step 6: Convert Date Columns

🎯 Final Thoughts

💡 Pro Tip:

📌 Next Steps

Leave a Reply Cancel reply

Related Posts

Catch the Falling Object: A Simple Game with Pygame

Data Visualization with Plotly Basics — Intro to Plotly Express

Interactive Dashboard in Plotly Dash – Build a Simple Web App

Unlocking Database Magic with SQLite in Python