Computer Science · 10th Grade

Active learning ideas

Data Cleaning and Preprocessing

Active learning works well for data cleaning because students need to experience the frustration of messy data to truly understand why cleaning matters. Hands-on activities make abstract concepts like outliers and missing values concrete and memorable, preparing students for real-world data work.

Common Core State StandardsCSTA: 3A-DA-10

15–35 minPairs → Whole Class4 activities

Activity 01

Gallery Walk35 min · Small Groups

Gallery Walk: The Messy Dataset Museum

Print five different messy datasets and post them around the room, each with a different type of data quality problem (duplicates, missing values, format mismatches, outliers, impossible values). Groups rotate through stations with sticky notes to identify the problem type and propose a cleaning strategy before moving on.

Explain the common types of data inconsistencies and errors.

Facilitation TipDuring the Gallery Walk, position students as curators who must explain their cleaning decisions to peers using the provided rubric.

What to look forProvide students with a small, messy dataset (e.g., a CSV snippet with errors). Ask them to identify two specific data quality issues present and suggest one cleaning step for each. Collect these as they leave class.

UnderstandApplyAnalyzeCreateRelationship SkillsSocial Awareness

Generate Complete Lesson

Activity 02

Think-Pair-Share20 min · Pairs

Think-Pair-Share: Should We Delete It?

Give students a dataset with 15% missing age values and ask them individually to decide whether to delete rows, fill with the mean, or flag the records. Pairs compare decisions and discuss trade-offs, then share cases where they disagreed and why.

Analyze the impact of dirty data on analytical results.

Facilitation TipFor Think-Pair-Share, insist that pairs produce a single list of deletion criteria and a justification before sharing with the class.

What to look forPresent students with a scenario: 'A dataset of student test scores has missing scores for 10% of students and some scores are entered as text (e.g., 'ninety').' Ask them to list three potential problems this data could cause for calculating the class average and propose one method to address each problem.

UnderstandApplyAnalyzeSelf-AwarenessRelationship Skills

Generate Complete Lesson

Activity 03

Inquiry Circle30 min · Small Groups

Inquiry Circle: Before-and-After Analysis

Small groups receive the same raw sales dataset and a pre-cleaned version. They must reverse-engineer which cleaning steps were applied by comparing the two versions, then write a short cleaning log documenting each transformation in order.

Construct a plan for cleaning a given messy dataset.

Facilitation TipIn the Collaborative Investigation, assign each group a different cleaning technique so the class can compare outcomes and discuss trade-offs.

What to look forPose the question: 'Imagine you are cleaning a dataset of product prices, and you find a price of $0.01 for a laptop and $1,000,000 for a pen. How would you decide if these are errors or valid extreme values? What factors would influence your decision?' Facilitate a class discussion on critical thinking in data cleaning.

AnalyzeEvaluateCreateSelf-ManagementSelf-Awareness

Generate Complete Lesson

Activity 04

Collaborative Problem-Solving15 min · Whole Class

Structured Discussion: The Cost of Dirty Data

Share a real case study (e.g., a hospital billing error or a census miscoding) where uncleaned data led to a costly mistake. The class discusses what preprocessing step could have caught the error, then identifies which step from their cleaning toolkit would apply.

Explain the common types of data inconsistencies and errors.

Facilitation TipDuring the Structured Discussion, provide a list of real-world consequences of dirty data to guide the conversation.

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management

Generate Complete Lesson

A few notes on teaching this unit

Teachers should model mistakes in datasets and demonstrate their own thought process when cleaning, making the invisible work visible. Avoid presenting cleaning as a checklist; instead, emphasize context and consequences. Research shows that students learn best when they see data cleaning as a detective story with multiple possible solutions rather than a single correct answer.

Students will confidently identify data errors, justify their cleaning choices, and explain how clean data supports reliable analysis. They will move beyond simple deletions to use multiple strategies and recognize cleaning as an ongoing process.

Watch Out for These Misconceptions

During the Gallery Walk, watch for students who assume all problematic rows should be deleted without considering context or consequences.
Use the Gallery Walk debrief to push students to explain why they chose deletion over other strategies like imputation or transformation, using the examples they observed.
During the Think-Pair-Share activity, listen for students who say data errors are always easy to spot through visual inspection alone.
In the pair phase, require students to use statistical summaries (min, max, unique counts) to find subtle errors before deciding on a cleaning method.
During the Collaborative Investigation, some students may treat preprocessing and analysis as separate phases that don’t overlap.
Use the before-and-after analysis to highlight how new issues often appear during analysis, requiring students to revisit their cleaning steps iteratively.

Methods used in this brief

More in Advanced Data Structures and Management

Arrays and Lists: Static vs. Dynamic

Students differentiate between static arrays and dynamic lists, understanding their memory allocation and use cases.

2 methodologies

Dictionaries and Hash Tables

Students explore key-value pair data structures, focusing on hash tables and their efficiency for data retrieval.

2 methodologies

Stacks and Queues: LIFO & FIFO

Students learn about abstract data types: stacks (Last-In, First-Out) and queues (First-In, First-Out), and their applications.

2 methodologies

Introduction to Trees and Graphs

Students are introduced to non-linear data structures like trees and graphs, understanding their basic properties and uses.

2 methodologies

Relational Database Design

Students learn the principles of relational database design, including entities, attributes, and relationships.

2 methodologies