Data Cleaning and Preprocessing
Students will learn the necessity of cleaning data to ensure accuracy and handle missing or corrupted data.
About This Topic
Visualizing complex information is about turning raw numbers into a story that humans can understand. For 9th graders, this topic covers the principles of effective design, such as choosing the right chart type and using color and scale responsibly. This aligns with CSTA standards for communicating about data. Students learn that a well-designed visualization can make a complex trend instantly clear, while a poor one can hide the truth.
This topic also has a strong media literacy component. Students analyze how visualizations can be used to mislead or persuade an audience by manipulating the axes or omitting data. This skill is vital for being an informed citizen. Students grasp this concept faster through gallery walks where they critique real-world infographics and suggest improvements.
Key Questions
- Explain how to handle missing or corrupted data in a large dataset.
- Differentiate between various data cleaning techniques (e.g., imputation, outlier removal).
- Construct a plan for cleaning a given messy dataset.
Learning Objectives
- Identify common data errors such as missing values, duplicates, and incorrect data types in a given dataset.
- Compare and contrast different strategies for handling missing data, including deletion and imputation.
- Evaluate the impact of outliers on statistical measures and visualization, and select appropriate removal or transformation techniques.
- Design a step-by-step plan to clean a provided messy dataset, documenting each cleaning action.
- Critique the quality of a dataset based on its potential biases and inaccuracies before and after cleaning.
Before You Start
Why: Students need to understand basic data types (numerical, categorical, boolean) and simple structures like tables to identify inconsistencies.
Why: Familiarity with sorting, filtering, and identifying patterns in spreadsheets provides a foundation for recognizing data issues.
Key Vocabulary
| Missing Data | Values that are not recorded or present in a dataset. This can occur due to errors in data collection or entry. |
| Data Imputation | The process of replacing missing data points with substituted values. Common methods include using the mean, median, or mode of the existing data. |
| Outlier | A data point that significantly differs from other observations in a dataset. Outliers can skew results and affect model performance. |
| Data Transformation | The process of changing the format, structure, or values of data. This can include scaling, normalization, or encoding categorical variables. |
| Data Validation | The process of checking data for accuracy and completeness. It ensures that data conforms to predefined rules and constraints. |
Watch Out for These Misconceptions
Common MisconceptionCharts are always objective and 'true.'
What to Teach Instead
Charts are interpretations of data made by humans. By creating their own 'misleading' charts, students learn how easy it is to change the narrative of the data without changing the numbers.
Common MisconceptionThe goal of a chart is to look pretty.
What to Teach Instead
The primary goal is clarity and accuracy; beauty is secondary. Peer critiques help students focus on whether the chart actually answers the question it was designed for.
Active Learning Ideas
See all activitiesGallery Walk: The Good, The Bad, and The Misleading
Post various charts and infographics around the room. Students use a checklist to identify 'design wins' and 'design sins,' such as truncated Y-axes or confusing color schemes.
Inquiry Circle: Data Makeover
Give groups a poorly designed chart and the raw data it represents. They must work together to create a new, more accurate and persuasive visualization for a specific target audience.
Think-Pair-Share: Color and Perception
Show two versions of the same map: one using a 'stoplight' (red/green) scale and one using a blue/orange scale. Students discuss which is better for accessibility and how color changes the 'mood' of the data.
Real-World Connections
- Data scientists at Netflix analyze viewing data to identify missing user ratings or incorrect viewing times. They use imputation techniques to fill these gaps, ensuring accurate recommendations for movies and shows.
- Financial analysts at investment firms clean large datasets of stock prices and trading volumes. They identify and handle outliers or missing entries that could distort performance metrics and investment strategies.
- Public health researchers preparing data for disease outbreak analysis must clean patient records. They address missing demographic information or incorrect symptom entries to ensure accurate tracking and response to health crises.
Assessment Ideas
Present students with a small table containing 5-7 rows of sample data with clear errors (e.g., a missing age, a text value in a numerical column, a duplicate entry). Ask them to list the specific errors they find and suggest one method for correcting each.
Pose the scenario: 'Imagine you are cleaning a dataset of student test scores, and you find that 10% of the scores are missing. What are at least two different approaches you could take to handle these missing scores, and what are the pros and cons of each approach?' Facilitate a class discussion on their responses.
Provide students with a brief description of a messy dataset (e.g., 'customer purchase history with some missing product IDs and inconsistent date formats'). Ask them to write down three specific cleaning steps they would perform on this data and the order in which they would perform them.
Frequently Asked Questions
When should I use a bar chart vs. a line chart?
How can a chart be misleading?
What is multidimensional data?
What are the best hands-on strategies for teaching data visualization?
More in Data Intelligence and Visualization
Data Collection Methods and Bias
Students will explore techniques for gathering data and analyze how bias in data collection can lead to inaccurate conclusions.
2 methodologies
Ethical Data Scraping and Privacy
Students will discuss the ethical considerations of scraping data from public websites and privacy implications.
2 methodologies
Correlation vs. Causation
Students will analyze why correlation does not necessarily imply a causal relationship.
2 methodologies
Identifying Trends in Data
Students will use computational tools to identify patterns and trends within datasets.
2 methodologies
Evaluating Data-Driven Conclusions
Students will learn to critically evaluate conclusions drawn from data, considering limitations and potential biases.
2 methodologies
Ethical Implications of Algorithmic Predictions
Students will discuss the dangers of over-relying on algorithmic predictions for social issues.
2 methodologies