Data Cleaning and PreprocessingActivities & Teaching Strategies
Active learning works for data cleaning and preprocessing because students must engage directly with messy datasets to understand why clean data matters. When students manipulate real-world data, they experience firsthand how poor data quality distorts insights, making abstract concepts like outliers and missing values tangible and memorable.
Learning Objectives
- 1Identify types of data errors, including missing values, inconsistencies, and outliers, within a given dataset.
- 2Compare and contrast different data cleaning techniques, such as imputation, outlier removal, and data standardization, for suitability to specific error types.
- 3Construct a step-by-step plan to preprocess a provided dataset, justifying the chosen cleaning methods.
- 4Evaluate the impact of 'dirty data' on the accuracy and reliability of analytical results using a case study.
- 5Demonstrate the application of at least two data cleaning techniques using a spreadsheet or data analysis tool.
Want a complete lesson plan with these objectives? Generate a Mission →
Gallery Walk: The Good, The Bad, and The Misleading
Display various data visualizations around the room, including some that are intentionally misleading (e.g., truncated y-axes). Students move in groups to identify what each graph is trying to say and any 'tricks' used to distort the data.
Prepare & details
Explain the impact of 'dirty data' on the accuracy of analytical results.
Facilitation Tip: During the Gallery Walk, circulate and ask students to explain why they grouped charts as 'good,' 'bad,' or 'misleading' to uncover their reasoning processes.
Setup: Wall space or tables arranged around room perimeter
Materials: Large paper/poster boards, Markers, Sticky notes for feedback
Inquiry Circle: Data Makeover
Give groups a boring table of raw data and a poorly chosen chart. They must work together to create three different visualizations of that same data, explaining which one is most effective for a specific target audience, such as a local council.
Prepare & details
Differentiate between various data cleaning techniques and their appropriate uses.
Facilitation Tip: For Data Makeover, provide a rubric with clear criteria for clean data and effective visualizations to guide student revisions.
Setup: Groups at tables with access to source materials
Materials: Source material collection, Inquiry cycle worksheet, Question generation protocol, Findings presentation template
Think-Pair-Share: Outlier Detective
Show a scatter plot with several clear outliers. In pairs, students discuss what might have caused these outliers (data error vs. interesting phenomenon) and whether they should be included or removed from the final analysis.
Prepare & details
Construct a plan to preprocess a given dataset for analysis.
Facilitation Tip: In Outlier Detective, give students a limited time to analyze outliers before discussing how context determines whether an outlier is meaningful or erroneous.
Setup: Standard classroom seating; students turn to a neighbor
Materials: Discussion prompt (projected or printed), Optional: recording sheet for pairs
Teaching This Topic
Experienced teachers approach this topic by focusing on real-world data rather than textbook examples, as students engage more when they see the relevance of their work. Avoid starting with theory; instead, let students encounter data problems organically through activities, then guide them to discover solutions collaboratively. Research shows that when students experience the frustration of working with 'dirty' data, they develop a deeper appreciation for the importance of preprocessing steps like handling missing values or correcting errors.
What to Expect
Successful learning looks like students confidently identifying data issues, justifying their cleaning choices, and selecting appropriate visualizations to communicate findings clearly. They should also articulate why certain chart types or cleaning methods are more effective than others for specific datasets.
These activities are a starting point. A full mission is the experience.
- Complete facilitation script with teacher dialogue
- Printable student materials, ready for class
- Differentiation strategies for every learner
Watch Out for These Misconceptions
Common MisconceptionDuring the Gallery Walk, watch for students who assume colorful or visually complex charts are inherently better, even when they obscure the data.
What to Teach Instead
Use the Gallery Walk as a chance to redirect students to the rubric, asking them to compare charts based on clarity, not aesthetics. For example, if a student praises a 3D pie chart, ask them to explain how the extra dimension affects their ability to read the data.
Common MisconceptionDuring Data Makeover, watch for students who believe all outliers must be removed to make the data 'correct.'
What to Teach Instead
During the activity, challenge students to research their outlier before deleting it. Ask them to consider whether the outlier represents a true error or an important trend, using the dataset’s context to guide their decision.
Assessment Ideas
After the Gallery Walk, provide students with a small, intentionally flawed dataset (e.g., a table of student heights with missing entries and unrealistic values). Ask them to identify at least three specific data quality issues and propose one method to address each.
During Data Makeover, present students with two versions of an analysis report: one based on raw, 'dirty' data and another based on cleaned data. Facilitate a class discussion using these questions: 'What differences do you observe in the conclusions drawn from each report?' and 'How did the data cleaning process influence the final results?'
After Outlier Detective, ask students to define 'outlier' in their own words and describe one scenario where an outlier might be intentionally kept rather than removed. Also, ask them to list one common method for handling missing data.
Extensions & Scaffolding
- Challenge: Ask students to find their own flawed dataset online, clean it, and present a before-and-after comparison with an explanation of their choices.
- Scaffolding: Provide a partially cleaned dataset for students who struggle with the initial steps, so they can focus on identifying remaining issues.
- Deeper exploration: Have students research how data cleaning is used in a specific career field (e.g., healthcare, finance) and present their findings to the class.
Key Vocabulary
| Dirty Data | Refers to data that contains errors, inaccuracies, or inconsistencies, making it unreliable for analysis. |
| Missing Values | Data points that are absent in a dataset. These can be handled through imputation or removal. |
| Outliers | Data points that significantly differ from other observations in a dataset. They may indicate errors or unusual events. |
| Data Imputation | The process of replacing missing data values with substituted values, such as the mean, median, or mode of the dataset. |
| Data Standardization | The process of scaling data to a common range, often between 0 and 1, or with a mean of 0 and a standard deviation of 1, to ensure fair comparison. |
Suggested Methodologies
More in Data Analytics and Visualization
Data Collection Methods
Understanding various methods of data collection, including surveys, sensors, and web scraping, and their appropriate uses.
2 methodologies
Organising Data in Tables
Students will learn to organise data into tables with rows and columns, understanding primary keys and simple relationships between tables.
2 methodologies
Structured Data and Databases
Introduction to relational data modeling and using query languages to extract specific information.
2 methodologies
Basic Statistical Concepts
Introduction to basic statistical measures (mean, median, mode, range) and their use in understanding data distributions.
2 methodologies
Data Visualization Fundamentals
Transforming raw datasets into basic charts and graphs to communicate findings and trends effectively.
2 methodologies
Ready to teach Data Cleaning and Preprocessing?
Generate a full mission with everything you need
Generate a Mission