Skip to content
Computer Science · 9th Grade · Data Intelligence and Visualization · Weeks 28-36

Data Cleaning and Preprocessing

Students will learn the necessity of cleaning data to ensure accuracy and handle missing or corrupted data.

Common Core State StandardsCSTA: 3A-DA-11

About This Topic

Visualizing complex information is about turning raw numbers into a story that humans can understand. For 9th graders, this topic covers the principles of effective design, such as choosing the right chart type and using color and scale responsibly. This aligns with CSTA standards for communicating about data. Students learn that a well-designed visualization can make a complex trend instantly clear, while a poor one can hide the truth.

This topic also has a strong media literacy component. Students analyze how visualizations can be used to mislead or persuade an audience by manipulating the axes or omitting data. This skill is vital for being an informed citizen. Students grasp this concept faster through gallery walks where they critique real-world infographics and suggest improvements.

Key Questions

  1. Explain how to handle missing or corrupted data in a large dataset.
  2. Differentiate between various data cleaning techniques (e.g., imputation, outlier removal).
  3. Construct a plan for cleaning a given messy dataset.

Learning Objectives

  • Identify common data errors such as missing values, duplicates, and incorrect data types in a given dataset.
  • Compare and contrast different strategies for handling missing data, including deletion and imputation.
  • Evaluate the impact of outliers on statistical measures and visualization, and select appropriate removal or transformation techniques.
  • Design a step-by-step plan to clean a provided messy dataset, documenting each cleaning action.
  • Critique the quality of a dataset based on its potential biases and inaccuracies before and after cleaning.

Before You Start

Introduction to Data Types and Structures

Why: Students need to understand basic data types (numerical, categorical, boolean) and simple structures like tables to identify inconsistencies.

Basic Spreadsheet Operations

Why: Familiarity with sorting, filtering, and identifying patterns in spreadsheets provides a foundation for recognizing data issues.

Key Vocabulary

Missing DataValues that are not recorded or present in a dataset. This can occur due to errors in data collection or entry.
Data ImputationThe process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data.
OutlierA data point that significantly differs from other observations in a dataset. Outliers can skew results and affect model performance.
Data TransformationThe process of changing the format, structure, or values of data. This can include scaling, normalization, or encoding categorical variables.
Data ValidationThe process of checking data for accuracy and completeness. It ensures that data conforms to predefined rules and constraints.

Watch Out for These Misconceptions

Common MisconceptionCharts are always objective and 'true.'

What to Teach Instead

Charts are interpretations of data made by humans. By creating their own 'misleading' charts, students learn how easy it is to change the narrative of the data without changing the numbers.

Common MisconceptionThe goal of a chart is to look pretty.

What to Teach Instead

The primary goal is clarity and accuracy; beauty is secondary. Peer critiques help students focus on whether the chart actually answers the question it was designed for.

Active Learning Ideas

See all activities

Real-World Connections

  • Data scientists at Netflix analyze viewing data to identify missing user ratings or incorrect viewing times. They use imputation techniques to fill these gaps, ensuring accurate recommendations for movies and shows.
  • Financial analysts at investment firms clean large datasets of stock prices and trading volumes. They identify and handle outliers or missing entries that could distort performance metrics and investment strategies.
  • Public health researchers preparing data for disease outbreak analysis must clean patient records. They address missing demographic information or incorrect symptom entries to ensure accurate tracking and response to health crises.

Assessment Ideas

Quick Check

Present students with a small table containing 5-7 rows of sample data with clear errors (e.g., a missing age, a text value in a numerical column, a duplicate entry). Ask them to list the specific errors they find and suggest one method for correcting each.

Discussion Prompt

Pose the scenario: 'Imagine you are cleaning a dataset of student test scores, and you find that 10% of the scores are missing. What are at least two different approaches you could take to handle these missing scores, and what are the pros and cons of each approach?' Facilitate a class discussion on their responses.

Exit Ticket

Provide students with a brief description of a messy dataset (e.g., 'customer purchase history with some missing product IDs and inconsistent date formats'). Ask them to write down three specific cleaning steps they would perform on this data and the order in which they would perform them.

Frequently Asked Questions

When should I use a bar chart vs. a line chart?
Use a bar chart to compare different categories (like apple sales vs. orange sales). Use a line chart to show how something changes over time (like temperature throughout the day).
How can a chart be misleading?
A chart can be misleading if the vertical axis doesn't start at zero, if the scale is inconsistent, or if the designer uses 3D effects that distort the size of the data points.
What is multidimensional data?
This is data that has many different variables. For example, a dataset about cars might include price, fuel efficiency, horsepower, and safety rating. Visualizing all of these at once requires creative design choices.
What are the best hands-on strategies for teaching data visualization?
The 'Data Makeover' approach is very effective. When students have to take a 'broken' visualization and fix it, they are forced to apply design principles in a practical way. This active problem-solving helps them internalize the rules of clarity and honesty in data communication.