Data Cleaning and PreprocessingActivities & Teaching Strategies
Active learning works for data cleaning because students need to wrestle with real messy data to see how decisions affect outcomes. Year 10 students remember techniques better when they debate trade-offs between deletion and imputation, plot outliers to test their assumptions, and build pipelines they can explain. This hands-on approach builds both technical skill and critical judgment they will use in later data science tasks.
Learning Objectives
- 1Identify and classify different types of data inconsistencies and missing value patterns within a given dataset.
- 2Apply imputation techniques, such as mean or median substitution, to handle missing data points in a spreadsheet or data table.
- 3Evaluate the effect of data outliers on summary statistics like the mean and median, and on correlation coefficients.
- 4Design a systematic strategy for cleaning a messy dataset, outlining the steps for handling missing values, outliers, and inconsistencies.
- 5Justify the necessity of data cleaning and preprocessing for ensuring the accuracy and reliability of data analysis results.
Want a complete lesson plan with these objectives? Generate a Mission →
Pairs Challenge: Missing Data Strategy
Provide pairs with a dataset containing 20% missing values from a sales record. Students discuss and apply two strategies, such as deletion or imputation, then compare results on summary statistics. Pairs share one key insight with the class.
Prepare & details
Design a strategy to handle missing data in a large dataset.
Facilitation Tip: During the Pairs Challenge, circulate and ask each pair to explain why they picked imputation over deletion before they touch any data.
Setup: Groups at tables with problem materials
Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric
Small Groups: Outlier Detection Lab
Groups receive a housing price dataset with planted outliers. They use box plots and z-scores to identify anomalies, decide removal or retention, and recalculate averages. Groups present their choices and rationale.
Prepare & details
Evaluate the impact of data outliers on statistical analysis.
Facilitation Tip: In the Outlier Detection Lab, require students to sketch a quick boxplot by hand first, then compare it to the digital version to spot discrepancies.
Setup: Groups at tables with problem materials
Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric
Whole Class: Inconsistency Cleanup Relay
Project a large dataset with format errors like mixed date styles. Teams take turns correcting one row or column, passing control after each fix. Class votes on the cleanest final version.
Prepare & details
Justify the importance of data cleaning before any data analysis.
Facilitation Tip: During the Inconsistency Cleanup Relay, give each group a unique typo so they experience how real-world data entry errors vary from dataset to dataset.
Setup: Groups at tables with problem materials
Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric
Individual: Preprocessing Pipeline
Students select a public dataset, document steps to clean missing values and outliers, then generate a cleaned version. They reflect on changes in a one-page report for peer review.
Prepare & details
Design a strategy to handle missing data in a large dataset.
Facilitation Tip: In the Preprocessing Pipeline, insist students write a one-sentence justification for every transformation before they run the code or calculation.
Setup: Groups at tables with problem materials
Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric
Teaching This Topic
Teachers should avoid presenting cleaning as a mechanical checklist. Instead, frame each technique as a strategic move with consequences. Research shows students grasp outliers better when they plot real data and see how a single point can pull a mean or bend a trend line. Encourage students to document their reasoning in margin notes so they can revisit and revise decisions later.
What to Expect
By the end of these activities, students will confidently choose appropriate cleaning methods, justify their choices with evidence, and evaluate how each step changes summary statistics and visual trends. They will move from asking 'How do I clean this?' to 'Why is this cleaning decision better than the alternatives?'
These activities are a starting point. A full mission is the experience.
- Complete facilitation script with teacher dialogue
- Printable student materials, ready for class
- Differentiation strategies for every learner
Watch Out for These Misconceptions
Common MisconceptionDuring the Pairs Challenge: watch for students defaulting to deletion without considering the impact on dataset size.
What to Teach Instead
Prompt pairs to calculate how many rows they would lose and what summary statistics would shift before they choose a method. Ask them to sketch two histograms side-by-side to visualize the difference.
Common MisconceptionDuring the Outlier Detection Lab: watch for students labeling any extreme value as an error without context.
What to Teach Instead
Have students read the dataset’s metadata aloud before they plot, forcing them to ask whether the extreme reflects a rare event or a data entry mistake. Require them to write a one-sentence justification for removing or keeping each outlier.
Common MisconceptionDuring the Inconsistency Cleanup Relay: watch for students fixing typos without checking if the error affects downstream analysis.
What to Teach Instead
After each typo fix, ask students to recalculate the mean and standard deviation to see if the change matters. Use a quick peer check so they compare their revised statistics with another group.
Assessment Ideas
After the Pairs Challenge, give each pair one minute to describe to the class one missing-data scenario where they would choose mean imputation instead of deletion, with a one-sentence rationale.
During the Outlier Detection Lab, pause after the first dataset. Ask students to turn to a partner and give two reasons why an outlier might be kept, then share out with the class.
After the Inconsistency Cleanup Relay, collect each group’s revised dataset and one sentence explaining the inconsistency they fixed and why it mattered for analysis.
Extensions & Scaffolding
- Challenge: Ask early finishers to design a new preprocessing step for a dataset with mixed units (e.g., temperatures in Celsius and Fahrenheit).
- Scaffolding: Provide pre-labeled histograms for students who struggle to spot inconsistencies in the relay activity.
- Deeper exploration: Have students research how data cleaning pipelines differ in industry (e.g., finance vs. healthcare) and present a one-slide comparison.
Key Vocabulary
| Missing Values | Data points that are absent from a dataset. These can occur due to errors in data collection or entry, or simply be unrecorded information. |
| Outliers | Data points that significantly differ from other observations in a dataset. They can be caused by measurement errors or represent genuine, extreme values. |
| Data Imputation | The process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data. |
| Data Consistency | Ensuring that data values within a dataset are uniform and do not contradict each other. This includes checking for correct formats, units, and logical relationships. |
| Z-score | A statistical measurement that describes a value's relationship to the mean of a group of values, measured in standard deviations. It is commonly used to identify outliers. |
Suggested Methodologies
More in Data Intelligence and Big Data
Introduction to Data Concepts
Defining data, information, and knowledge, and exploring different types of data (structured, unstructured, semi-structured).
2 methodologies
Data Collection Methods
Exploring various methods of data collection, including surveys, sensors, web scraping, and understanding their ethical implications.
2 methodologies
Relational Databases and SQL
Designing and querying relational databases to manage complex information sets with integrity.
2 methodologies
Database Design: ER Diagrams
Learning to model database structures using Entity-Relationship (ER) diagrams to represent entities, attributes, and relationships.
2 methodologies
Advanced SQL Queries
Mastering complex SQL queries including joins, subqueries, and aggregate functions to extract meaningful insights from databases.
2 methodologies
Ready to teach Data Cleaning and Preprocessing?
Generate a full mission with everything you need
Generate a Mission