Skip to content

Data Cleaning and PreprocessingActivities & Teaching Strategies

Active learning works for data cleaning because students need to wrestle with real messy data to see how decisions affect outcomes. Year 10 students remember techniques better when they debate trade-offs between deletion and imputation, plot outliers to test their assumptions, and build pipelines they can explain. This hands-on approach builds both technical skill and critical judgment they will use in later data science tasks.

Year 10Technologies4 activities30 min50 min

Learning Objectives

  1. 1Identify and classify different types of data inconsistencies and missing value patterns within a given dataset.
  2. 2Apply imputation techniques, such as mean or median substitution, to handle missing data points in a spreadsheet or data table.
  3. 3Evaluate the effect of data outliers on summary statistics like the mean and median, and on correlation coefficients.
  4. 4Design a systematic strategy for cleaning a messy dataset, outlining the steps for handling missing values, outliers, and inconsistencies.
  5. 5Justify the necessity of data cleaning and preprocessing for ensuring the accuracy and reliability of data analysis results.

Want a complete lesson plan with these objectives? Generate a Mission

Pairs Challenge: Missing Data Strategy

Provide pairs with a dataset containing 20% missing values from a sales record. Students discuss and apply two strategies, such as deletion or imputation, then compare results on summary statistics. Pairs share one key insight with the class.

Prepare & details

Design a strategy to handle missing data in a large dataset.

Facilitation Tip: During the Pairs Challenge, circulate and ask each pair to explain why they picked imputation over deletion before they touch any data.

Setup: Groups at tables with problem materials

Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management
45 min·Small Groups

Small Groups: Outlier Detection Lab

Groups receive a housing price dataset with planted outliers. They use box plots and z-scores to identify anomalies, decide removal or retention, and recalculate averages. Groups present their choices and rationale.

Prepare & details

Evaluate the impact of data outliers on statistical analysis.

Facilitation Tip: In the Outlier Detection Lab, require students to sketch a quick boxplot by hand first, then compare it to the digital version to spot discrepancies.

Setup: Groups at tables with problem materials

Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management
40 min·Whole Class

Whole Class: Inconsistency Cleanup Relay

Project a large dataset with format errors like mixed date styles. Teams take turns correcting one row or column, passing control after each fix. Class votes on the cleanest final version.

Prepare & details

Justify the importance of data cleaning before any data analysis.

Facilitation Tip: During the Inconsistency Cleanup Relay, give each group a unique typo so they experience how real-world data entry errors vary from dataset to dataset.

Setup: Groups at tables with problem materials

Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management
50 min·Individual

Individual: Preprocessing Pipeline

Students select a public dataset, document steps to clean missing values and outliers, then generate a cleaned version. They reflect on changes in a one-page report for peer review.

Prepare & details

Design a strategy to handle missing data in a large dataset.

Facilitation Tip: In the Preprocessing Pipeline, insist students write a one-sentence justification for every transformation before they run the code or calculation.

Setup: Groups at tables with problem materials

Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management

Teaching This Topic

Teachers should avoid presenting cleaning as a mechanical checklist. Instead, frame each technique as a strategic move with consequences. Research shows students grasp outliers better when they plot real data and see how a single point can pull a mean or bend a trend line. Encourage students to document their reasoning in margin notes so they can revisit and revise decisions later.

What to Expect

By the end of these activities, students will confidently choose appropriate cleaning methods, justify their choices with evidence, and evaluate how each step changes summary statistics and visual trends. They will move from asking 'How do I clean this?' to 'Why is this cleaning decision better than the alternatives?'

These activities are a starting point. A full mission is the experience.

  • Complete facilitation script with teacher dialogue
  • Printable student materials, ready for class
  • Differentiation strategies for every learner
Generate a Mission

Watch Out for These Misconceptions

Common MisconceptionDuring the Pairs Challenge: watch for students defaulting to deletion without considering the impact on dataset size.

What to Teach Instead

Prompt pairs to calculate how many rows they would lose and what summary statistics would shift before they choose a method. Ask them to sketch two histograms side-by-side to visualize the difference.

Common MisconceptionDuring the Outlier Detection Lab: watch for students labeling any extreme value as an error without context.

What to Teach Instead

Have students read the dataset’s metadata aloud before they plot, forcing them to ask whether the extreme reflects a rare event or a data entry mistake. Require them to write a one-sentence justification for removing or keeping each outlier.

Common MisconceptionDuring the Inconsistency Cleanup Relay: watch for students fixing typos without checking if the error affects downstream analysis.

What to Teach Instead

After each typo fix, ask students to recalculate the mean and standard deviation to see if the change matters. Use a quick peer check so they compare their revised statistics with another group.

Assessment Ideas

Quick Check

After the Pairs Challenge, give each pair one minute to describe to the class one missing-data scenario where they would choose mean imputation instead of deletion, with a one-sentence rationale.

Discussion Prompt

During the Outlier Detection Lab, pause after the first dataset. Ask students to turn to a partner and give two reasons why an outlier might be kept, then share out with the class.

Exit Ticket

After the Inconsistency Cleanup Relay, collect each group’s revised dataset and one sentence explaining the inconsistency they fixed and why it mattered for analysis.

Extensions & Scaffolding

  • Challenge: Ask early finishers to design a new preprocessing step for a dataset with mixed units (e.g., temperatures in Celsius and Fahrenheit).
  • Scaffolding: Provide pre-labeled histograms for students who struggle to spot inconsistencies in the relay activity.
  • Deeper exploration: Have students research how data cleaning pipelines differ in industry (e.g., finance vs. healthcare) and present a one-slide comparison.

Key Vocabulary

Missing ValuesData points that are absent from a dataset. These can occur due to errors in data collection or entry, or simply be unrecorded information.
OutliersData points that significantly differ from other observations in a dataset. They can be caused by measurement errors or represent genuine, extreme values.
Data ImputationThe process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data.
Data ConsistencyEnsuring that data values within a dataset are uniform and do not contradict each other. This includes checking for correct formats, units, and logical relationships.
Z-scoreA statistical measurement that describes a value's relationship to the mean of a group of values, measured in standard deviations. It is commonly used to identify outliers.

Ready to teach Data Cleaning and Preprocessing?

Generate a full mission with everything you need

Generate a Mission