Data Cleaning and PreprocessingActivities & Teaching Strategies
Active learning works for data cleaning and preprocessing because students need to experience firsthand how messy data can obscure or reveal stories. When they see a poorly designed visualization fail to communicate, they understand why cleaning and preprocessing matter.
Learning Objectives
- 1Identify common data errors such as missing values, duplicates, and incorrect data types in a given dataset.
- 2Compare and contrast different strategies for handling missing data, including deletion and imputation.
- 3Evaluate the impact of outliers on statistical measures and visualization, and select appropriate removal or transformation techniques.
- 4Design a step-by-step plan to clean a provided messy dataset, documenting each cleaning action.
- 5Critique the quality of a dataset based on its potential biases and inaccuracies before and after cleaning.
Want a complete lesson plan with these objectives? Generate a Mission →
Gallery Walk: The Good, The Bad, and The Misleading
Post various charts and infographics around the room. Students use a checklist to identify 'design wins' and 'design sins,' such as truncated Y-axes or confusing color schemes.
Prepare & details
Explain how to handle missing or corrupted data in a large dataset.
Facilitation Tip: For the Gallery Walk, circulate with a checklist of common design flaws so you can redirect groups to specific issues like inconsistent scales or misleading axes.
Setup: Wall space or tables arranged around room perimeter
Materials: Large paper/poster boards, Markers, Sticky notes for feedback
Inquiry Circle: Data Makeover
Give groups a poorly designed chart and the raw data it represents. They must work together to create a new, more accurate and persuasive visualization for a specific target audience.
Prepare & details
Differentiate between various data cleaning techniques (e.g., imputation, outlier removal).
Facilitation Tip: During the Data Makeover, assign each group one messy dataset and one clear question it should answer, so their makeover has a measurable goal.
Setup: Groups at tables with access to source materials
Materials: Source material collection, Inquiry cycle worksheet, Question generation protocol, Findings presentation template
Think-Pair-Share: Color and Perception
Show two versions of the same map: one using a 'stoplight' (red/green) scale and one using a blue/orange scale. Students discuss which is better for accessibility and how color changes the 'mood' of the data.
Prepare & details
Construct a plan for cleaning a given messy dataset.
Facilitation Tip: In the Color and Perception activity, provide identical data to both partners but different color palettes so they can directly compare how perception changes with design choices.
Setup: Standard classroom seating; students turn to a neighbor
Materials: Discussion prompt (projected or printed), Optional: recording sheet for pairs
Teaching This Topic
Start by teaching students the mantra: 'The data is not the visualization, and the visualization is not the truth.' Avoid overwhelming them with rules—instead, let them discover principles through critique and revision. Research shows that students learn data ethics best when they create misleading visualizations themselves, then reflect on why clarity matters.
What to Expect
Successful learning looks like students recognizing design flaws in visualizations, suggesting clear corrections, and explaining why their changes improve clarity. They should also articulate how human choices in color, scale, and chart type influence interpretation.
These activities are a starting point. A full mission is the experience.
- Complete facilitation script with teacher dialogue
- Printable student materials, ready for class
- Differentiation strategies for every learner
Watch Out for These Misconceptions
Common MisconceptionDuring Gallery Walk: Watch for students assuming all charts are objective. Redirect them by asking, 'What story does this chart tell, and who benefits from that story?'
What to Teach Instead
Direct students to compare two charts of the same data side by side. Ask them to identify which chart aligns with the data and which alters the narrative, then explain their reasoning using specific design choices.
Common MisconceptionDuring Data Makeover: Watch for students prioritizing aesthetics over clarity. Redirect them by asking, 'Does this change make the data easier or harder to understand?'
What to Teach Instead
Have students present their cleaned visualization to the class and justify each design choice. Peers should vote on whether the chart answers its intended question, forcing students to defend their clarity-focused decisions.
Assessment Ideas
After the Gallery Walk, show students a new chart with a deliberate error (e.g., truncated y-axis, inconsistent color coding). Ask them to list the errors they notice and suggest one correction for each.
During Data Makeover, circulate and ask groups to explain their cleaning steps and chart choices. Listen for whether they reference specific design principles (e.g., 'We chose a bar chart because the categories are discrete') and whether they address potential biases in their dataset.
After Color and Perception, provide a short scenario where color choices could mislead (e.g., 'A bar chart uses red for positive growth and green for negative'). Ask students to write how they would adjust the colors to improve clarity and explain their reasoning.
Extensions & Scaffolding
- Challenge: Ask students to redesign a chart they critiqued in the Gallery Walk using the same data but a different chart type. They present both versions and explain which communicates better and why.
- Scaffolding: Provide a partially cleaned dataset with guided prompts (e.g., 'Identify the missing values and suggest an imputation method').
- Deeper exploration: Introduce students to real-world datasets with ethical dilemmas (e.g., income inequality) and ask them to clean the data while considering how their choices might influence policy discussions.
Key Vocabulary
| Missing Data | Values that are not recorded or present in a dataset. This can occur due to errors in data collection or entry. |
| Data Imputation | The process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data. |
| Outlier | A data point that significantly differs from other observations in a dataset. Outliers can skew results and affect model performance. |
| Data Transformation | The process of changing the format, structure, or values of data. This can include scaling, normalization, or encoding categorical variables. |
| Data Validation | The process of checking data for accuracy and completeness. It ensures that data conforms to predefined rules and constraints. |
Suggested Methodologies
More in Data Intelligence and Visualization
Data Collection Methods and Bias
Students will explore techniques for gathering data and analyze how bias in data collection can lead to inaccurate conclusions.
2 methodologies
Ethical Data Scraping and Privacy
Students will discuss the ethical considerations of scraping data from public websites and privacy implications.
2 methodologies
Correlation vs. Causation
Students will analyze why correlation does not necessarily imply a causal relationship.
2 methodologies
Identifying Trends in Data
Students will use computational tools to identify patterns and trends within datasets.
2 methodologies
Evaluating Data-Driven Conclusions
Students will learn to critically evaluate conclusions drawn from data, considering limitations and potential biases.
2 methodologies
Ready to teach Data Cleaning and Preprocessing?
Generate a full mission with everything you need
Generate a Mission