Data cleaning = making raw data ready for analysis.
Let’s take an example dataset of daily sales:
Date | Product | Sales |
---|---|---|
01-08-25 | Prodct A | 100 |
02-08-25 | Product A | -50 |
03-08-25 | Product B | 200 |
04-08-25 | Product A | 100000 |
05-08-25 | Product C |
👉 Step 1: Check for spelling mistakes
-
“Prodct A” vs “Product A” → same product but typed wrong.
-
Excel fix: use Find & Replace or create a clean “Product List” and use Data Validation.
-
👉 Step 2: Handle missing values
-
05-08-25 has no sales value.
-
Option 1: Replace with 0 if it means “no sales.”
-
Option 2: Fill with average if it’s truly missing data.
-
👉 Step 3: Handle wrong/illogical values
-
Sales = -50 (negative sales) → maybe a refund? Need to confirm.
-
Sales = 100000 → looks like a typo (maybe it was 1000).
-
Excel fix: use Filters or Conditional Formatting to highlight outliers.
-
👉 Step 4: Consistent formatting
-
Dates should all be in one format (dd-mm-yy).
-
Product names consistent (no extra spaces).
✅ After cleaning, the table looks like this:
Date | Product | Sales |
---|---|---|
01-08-25 | Product A | 100 |
02-08-25 | Product A | 0 |
03-08-25 | Product B | 200 |
04-08-25 | Product A | 1000 |
05-08-25 | Product C | 150 |
Now it’s ready for analysis.
👉 Question for you:
If you see a row with negative sales, what would you do first — delete it, or try to understand why it’s negative?