Step-by-Step Data Cleaning in Python with Live Output

โœ… Step-by-Step Data Cleaning in Python with Live Output

If you’re working with messy data, this guide will walk you through a clear and concise data cleaning workflow using Pandas. Each step includes a before/after result and a live print statement to help you see exactly whatโ€™s happening in real-time.

This format is ideal for beginners and data science learners who want to understand not just what to do โ€” but why and what effect it has on the dataset.

๐Ÿ“ Dataset: extended_sample_data.csv

Letโ€™s start cleaning!


๐Ÿ”น Step 1: Load the Dataset

import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('extended_sample_data.csv')
print("Original Data:\n", df.head(10))

Explanation: We load the CSV file using pandas.read_csv() and immediately display the first 10 rows to inspect the structure and values.


๐Ÿ”น Step 2: Check for Missing Values

# Step 2: Check for missing values
print("\nMissing Values:\n", df.isnull().sum())

Explanation: Identifying null values helps you decide whether to drop or fill them based on context.


๐Ÿ”น Step 3: Handle Missing Data

Option A: Drop Missing Values

# Drop rows with missing values
df_dropped = df.dropna()
print("\nData after dropping missing values:\n", df_dropped.head(10))

Option B: Fill Missing Values

# Fill missing values with 0
df_filled = df.fillna(0)
print("\nData after filling missing values with 0:\n", df_filled.head(10))

Explanation: You can either drop rows with any NaN values or fill them with a placeholder like 0, depending on your analysis goals.


๐Ÿ”น Step 4: Rename Columns

# Rename columns โ€” strip, lowercase, replace spaces
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("\nRenamed Columns:\n", df.columns.tolist())

Explanation: Clean, consistent column names are essential for reliable processing, especially when programmatically referencing columns later on.


๐Ÿ”น Step 5: Remove Duplicate Rows

# Remove duplicates
df_no_duplicates = df.drop_duplicates()
print("\nData after removing duplicates:\n", df_no_duplicates.head(10))

Explanation: Duplicates can skew your analysis. This step ensures your dataset is unique and accurate.


๐Ÿ”น Step 6: Convert Date Columns

# Convert 'joining_date' to datetime
df_no_duplicates['joining_date'] = pd.to_datetime(df_no_duplicates['joining_date'], errors='coerce')
print("\nData with converted date column:\n", df_no_duplicates[['joining_date']].head(10))

Explanation: Always convert date columns to datetime objects for easier filtering, sorting, and time-based operations.


๐ŸŽฏ Final Thoughts

With just a few lines of Python, youโ€™ve learned how to:

  • Load and inspect data
  • Handle missing values
  • Clean column names
  • Remove duplicates
  • Convert date columns

This type of transparent, step-by-step cleaning not only improves your data โ€” it builds your confidence and understanding.

๐Ÿ’ก Pro Tip:

Turn this script into a template for future projects. Good data cleaning is 80% of the work in data science.

๐Ÿ“Œ Next Steps

  • Visualize cleaned data with Plotly
  • Export cleaned data to CSV with df.to_csv()
  • Integrate into a Dash dashboard

If you found this helpful, share it and let me know in the comments! ๐ŸŽฅ Want a downloadable script version or video walkthrough? Let me know!

Leave a Reply

Your email address will not be published. Required fields are marked *