Skip to content
Computer Science · Grade 9 · Data and Digital Representation · Term 2

Data Cleaning and Preprocessing

Students will learn about the importance of cleaning and preparing data for analysis.

Ontario Curriculum ExpectationsCS.HS.DA.4CS.HS.S.2

About This Topic

Data collection and visualization are about turning raw numbers into stories. In the Ontario Grade 9 curriculum, students learn to gather data responsibly and use tools to create charts and graphs that reveal hidden trends. This topic is central to the Software Development and Computer Environments strands, as it connects technical skills with critical thinking.

Students also explore the ethics of data, including how historical data collection in Canada has sometimes been used to marginalize groups, such as through the residential school system's record-keeping. By learning to visualize data accurately, students can advocate for social change and better understand the world around them. This topic comes alive when students can collect their own data from the school community and present it in a gallery walk.

Key Questions

  1. Explain why data cleaning is a crucial step before data analysis.
  2. Analyze common types of data errors and inconsistencies.
  3. Design a strategy to address missing or erroneous data in a given dataset.

Learning Objectives

  • Explain the necessity of data cleaning for accurate and reliable data analysis.
  • Identify common data errors such as missing values, outliers, and inconsistent formats.
  • Design a systematic approach to detect and correct errors in a given dataset.
  • Evaluate different strategies for handling missing data, considering potential biases.

Before You Start

Introduction to Data Types

Why: Students need to understand different data types (numerical, categorical) to identify relevant errors and apply appropriate cleaning methods.

Basic Spreadsheet Operations

Why: Familiarity with spreadsheets is helpful for practical application of data cleaning techniques, such as sorting and filtering.

Key Vocabulary

Data CleaningThe process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It ensures data quality for analysis.
Missing DataValues that are not recorded or present in a dataset. Handling missing data is crucial to avoid skewed results.
OutlierA data point that differs significantly from other observations. Outliers can be due to measurement error or represent genuine extreme values.
Data InconsistencyWhen data values that should be the same are different, such as variations in spelling or formatting for the same category.
Data ValidationThe process of ensuring data is accurate, complete, and conforms to defined rules or constraints before analysis.

Watch Out for These Misconceptions

Common MisconceptionData is always objective and 'true'.

What to Teach Instead

Data is collected by people and can contain bias. Structured debates about how a survey question is worded help students see how the collection process itself can influence the results.

Common MisconceptionAny chart can work for any data.

What to Teach Instead

Different data types require different visualizations (e.g., pie charts for parts of a whole, line graphs for trends over time). Peer review sessions where students justify their choice of chart help reinforce this.

Active Learning Ideas

See all activities

Real-World Connections

  • Public health researchers at Health Canada clean datasets of reported illnesses to accurately track disease outbreaks and inform public health policies. Inaccurate data could lead to misallocation of resources or delayed responses.
  • Financial analysts at major banks meticulously clean transaction data to detect fraudulent activities and ensure the integrity of financial reports. Errors in this data could result in significant financial losses.
  • Urban planners use cleaned demographic data from Statistics Canada to design effective public services and infrastructure. Inconsistent or missing data could lead to services not meeting community needs.

Assessment Ideas

Exit Ticket

Provide students with a small, messy dataset (e.g., a list of student heights with some missing values and inconsistent units). Ask them to list two specific problems they observe and propose one method to address each problem.

Quick Check

Present students with a scenario: 'A survey collected responses about favorite colors, but some entries are 'blue', 'Blue', and 'blu'. What type of data error is this, and how would you standardize it?' Gauge understanding of inconsistency and standardization.

Discussion Prompt

Facilitate a class discussion using the prompt: 'Imagine you are cleaning data for a survey on student opinions about school lunches. One question asks for a rating from 1 to 5, but some students wrote 'good' or 'great'. What are the implications of these non-numeric responses for your analysis, and what are your options for handling them?'

Frequently Asked Questions

What is data visualization?
Data visualization is the graphical representation of information. By using visual elements like charts, graphs, and maps, it helps people see and understand patterns, outliers, and trends in data.
Why is data ethics part of computer science?
Because data can affect people's lives. In Ontario, we teach students to consider privacy, consent, and bias to ensure that the technology they build or use is fair and responsible.
How can active learning help students understand data visualization?
Active learning allows students to become 'data detectives.' By creating their own visualizations and critiquing others, they move from being passive consumers of information to active analysts who can spot manipulation and bias.
What tools are best for Grade 9 data projects?
Spreadsheet software like Google Sheets or Excel is great for basics. For more creative projects, students can use online infographic tools or even physical materials like string and beads to create 'analog' visualizations.