Computer Science · 9th Grade · Data Intelligence and Visualization · Weeks 28-36

Data Cleaning and Preprocessing

Students will learn the necessity of cleaning data to ensure accuracy and handle missing or corrupted data.

TL;DR:Active learning works for data cleaning and preprocessing because students need to experience firsthand how messy data can obscure or reveal stories. When they see a poorly designed visualization fail to communicate, they understand why cleaning and preprocessing matter.

Common Core State StandardsCSTA: 3A-DA-11

About This Topic

Visualizing complex information is about turning raw numbers into a story that humans can understand. For 9th graders, this topic covers the principles of effective design, such as choosing the right chart type and using color and scale responsibly. This aligns with CSTA standards for communicating about data. Students learn that a well-designed visualization can make a complex trend instantly clear, while a poor one can hide the truth.

This topic also has a strong media literacy component. Students analyze how visualizations can be used to mislead or persuade an audience by manipulating the axes or omitting data. This skill is vital for being an informed citizen. Students grasp this concept faster through gallery walks where they critique real-world infographics and suggest improvements.

Key Questions

Explain how to handle missing or corrupted data in a large dataset.
Differentiate between various data cleaning techniques (e.g., imputation, outlier removal).
Construct a plan for cleaning a given messy dataset.

Learning Objectives

Identify common data errors such as missing values, duplicates, and incorrect data types in a given dataset.
Compare and contrast different strategies for handling missing data, including deletion and imputation.
Evaluate the impact of outliers on statistical measures and visualization, and select appropriate removal or transformation techniques.
Design a step-by-step plan to clean a provided messy dataset, documenting each cleaning action.
Critique the quality of a dataset based on its potential biases and inaccuracies before and after cleaning.

Before You Start

Introduction to Data Types and Structures

Why: Students need to understand basic data types (numerical, categorical, boolean) and simple structures like tables to identify inconsistencies.

Basic Spreadsheet Operations

Why: Familiarity with sorting, filtering, and identifying patterns in spreadsheets provides a foundation for recognizing data issues.

Key Vocabulary

Missing Data	Values that are not recorded or present in a dataset. This can occur due to errors in data collection or entry.
Data Imputation	The process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data.
Outlier	A data point that significantly differs from other observations in a dataset. Outliers can skew results and affect model performance.
Data Transformation	The process of changing the format, structure, or values of data. This can include scaling, normalization, or encoding categorical variables.
Data Validation	The process of checking data for accuracy and completeness. It ensures that data conforms to predefined rules and constraints.

Watch Out for These Misconceptions

Common MisconceptionCharts are always objective and 'true.'

What to Teach Instead

Charts are interpretations of data made by humans. By creating their own 'misleading' charts, students learn how easy it is to change the narrative of the data without changing the numbers.

Common MisconceptionThe goal of a chart is to look pretty.

What to Teach Instead

The primary goal is clarity and accuracy; beauty is secondary. Peer critiques help students focus on whether the chart actually answers the question it was designed for.

Active Learning Ideas

See all activities→

Gallery Walk

The Good, The Bad, and The Misleading

Post various charts and infographics around the room. Students use a checklist to identify 'design wins' and 'design sins,' such as truncated Y-axes or confusing color schemes.

35 min·Individual

Inquiry Circle

Data Makeover

Give groups a poorly designed chart and the raw data it represents. They must work together to create a new, more accurate and persuasive visualization for a specific target audience.

45 min·Small Groups

Think-Pair-Share

Color and Perception

Show two versions of the same map: one using a 'stoplight' (red/green) scale and one using a blue/orange scale. Students discuss which is better for accessibility and how color changes the 'mood' of the data.

20 min·Pairs

Real-World Connections

Data scientists at Netflix analyze viewing data to identify missing user ratings or incorrect viewing times. They use imputation techniques to fill these gaps, ensuring accurate recommendations for movies and shows.
Financial analysts at investment firms clean large datasets of stock prices and trading volumes. They identify and handle outliers or missing entries that could distort performance metrics and investment strategies.
Public health researchers preparing data for disease outbreak analysis must clean patient records. They address missing demographic information or incorrect symptom entries to ensure accurate tracking and response to health crises.

Assessment Ideas

Quick Check

Present students with a small table containing 5-7 rows of sample data with clear errors (e.g., a missing age, a text value in a numerical column, a duplicate entry). Ask them to list the specific errors they find and suggest one method for correcting each.

Discussion Prompt

Pose the scenario: 'Imagine you are cleaning a dataset of student test scores, and you find that 10% of the scores are missing. What are at least two different approaches you could take to handle these missing scores, and what are the pros and cons of each approach?' Facilitate a class discussion on their responses.

Exit Ticket

Provide students with a brief description of a messy dataset (e.g., 'customer purchase history with some missing product IDs and inconsistent date formats'). Ask them to write down three specific cleaning steps they would perform on this data and the order in which they would perform them.

Frequently Asked Questions

When should I use a bar chart vs. a line chart?

Use a bar chart to compare different categories (like apple sales vs. orange sales). Use a line chart to show how something changes over time (like temperature throughout the day).

How can a chart be misleading?

A chart can be misleading if the vertical axis doesn't start at zero, if the scale is inconsistent, or if the designer uses 3D effects that distort the size of the data points.

What is multidimensional data?

This is data that has many different variables. For example, a dataset about cars might include price, fuel efficiency, horsepower, and safety rating. Visualizing all of these at once requires creative design choices.

What are the best hands-on strategies for teaching data visualization?

The 'Data Makeover' approach is very effective. When students have to take a 'broken' visualization and fix it, they are forced to apply design principles in a practical way. This active problem-solving helps them internalize the rules of clarity and honesty in data communication.

More in Data Intelligence and Visualization

Data Collection Methods and Bias

Students will explore techniques for gathering data and analyze how bias in data collection can lead to inaccurate conclusions.

8 methodologies

Ethical Data Scraping and Privacy

Students will discuss the ethical considerations of scraping data from public websites and privacy implications.

8 methodologies

Correlation vs. Causation

Students will analyze why correlation does not necessarily imply a causal relationship.

8 methodologies

Identifying Trends in Data

Students will use computational tools to identify patterns and trends within datasets.

8 methodologies

Evaluating Data-Driven Conclusions

Students will learn to critically evaluate conclusions drawn from data, considering limitations and potential biases.

8 methodologies

Ethical Implications of Algorithmic Predictions

Students will discuss the dangers of over-relying on algorithmic predictions for social issues.

8 methodologies

Edited by Adriana Perusin, Editor-in-Chief, Flip Education