Technologies · Year 9 · Data Analytics and Visualization · Term 2

Data Cleaning and Preprocessing

Techniques for identifying and handling missing, inconsistent, or erroneous data to ensure data quality for analysis.

TL;DR:Active learning works for data cleaning and preprocessing because students must engage directly with messy datasets to understand why clean data matters. When students manipulate real-world data, they experience firsthand how poor data quality distorts insights, making abstract concepts like outliers and missing values tangible and memorable.

ACARA Content DescriptionsAC9DT10P01

About This Topic

Data visualization is the art and science of making data understandable. For Year 9 students, this topic involves more than just making charts; it is about choosing the right visual representation to communicate a specific message or trend. This aligns with AC9DT10P01, where students are required to interpret and visualize data to create information. They explore how different visualizations can highlight outliers, show correlations, or even manipulate an audience's perception.

Being able to critically evaluate a graph is a vital life skill. Students look at real-world datasets, such as Australian census data or climate trends in the Asia-Pacific region, to practice their skills. This topic is highly effective when students can engage in peer critique and collaborative design. Students grasp this concept faster through structured discussion and peer explanation of their visual choices.

Key Questions

Explain the impact of 'dirty data' on the accuracy of analytical results.
Differentiate between various data cleaning techniques and their appropriate uses.
Construct a plan to preprocess a given dataset for analysis.

Learning Objectives

Identify types of data errors, including missing values, inconsistencies, and outliers, within a given dataset.
Compare and contrast different data cleaning techniques, such as imputation, outlier removal, and data standardization, for suitability to specific error types.
Construct a step-by-step plan to preprocess a provided dataset, justifying the chosen cleaning methods.
Evaluate the impact of 'dirty data' on the accuracy and reliability of analytical results using a case study.
Demonstrate the application of at least two data cleaning techniques using a spreadsheet or data analysis tool.

Before You Start

Introduction to Data Types and Structures

Why: Students need to understand basic data organization (tables, columns, rows) and different data types (numerical, categorical) to identify inconsistencies.

Basic Spreadsheet Operations

Why: Familiarity with sorting, filtering, and basic formulas in spreadsheet software is beneficial for practical application of cleaning techniques.

Key Vocabulary

Dirty Data	Refers to data that contains errors, inaccuracies, or inconsistencies, making it unreliable for analysis.
Missing Values	Data points that are absent in a dataset. These can be handled through imputation or removal.
Outliers	Data points that significantly differ from other observations in a dataset. They may indicate errors or unusual events.
Data Imputation	The process of replacing missing data values with substituted values, such as the mean, median, or mode of the dataset.
Data Standardization	The process of scaling data to a common range, often between 0 and 1, or with a mean of 0 and a standard deviation of 1, to ensure fair comparison.

Watch Out for These Misconceptions

Common MisconceptionThe goal of a chart is to look 'cool'.

What to Teach Instead

The goal is clarity. Through peer feedback, students learn that excessive 'chart junk' (like 3D effects or unnecessary colors) often obscures the data rather than helping the viewer understand it.

Common MisconceptionGraphs always tell the truth.

What to Teach Instead

Graphs are interpretations. By creating their own 'misleading' graphs in a controlled activity, students become much more skeptical and critical consumers of data in the real world.

Active Learning Ideas

See all activities→

Gallery Walk

The Good, The Bad, and The Misleading

Display various data visualizations around the room, including some that are intentionally misleading (e.g., truncated y-axes). Students move in groups to identify what each graph is trying to say and any 'tricks' used to distort the data.

40 min·Small Groups

Inquiry Circle

Data Makeover

Give groups a boring table of raw data and a poorly chosen chart. They must work together to create three different visualizations of that same data, explaining which one is most effective for a specific target audience, such as a local council.

50 min·Small Groups

Think-Pair-Share

Outlier Detective

Show a scatter plot with several clear outliers. In pairs, students discuss what might have caused these outliers (data error vs. interesting phenomenon) and whether they should be included or removed from the final analysis.

20 min·Pairs

Real-World Connections

Data scientists at financial institutions like the Commonwealth Bank of Australia use data cleaning techniques daily to ensure the accuracy of fraud detection algorithms and customer spending analyses.
Market researchers at Nielsen analyze consumer survey data, meticulously cleaning responses to remove duplicates or illogical entries before reporting on product trends and advertising effectiveness.
Epidemiologists working for the World Health Organization (WHO) clean vast datasets of disease outbreaks, identifying and correcting errors in patient records or geographical locations to accurately track disease spread and inform public health interventions.

Assessment Ideas

Quick Check

Provide students with a small, intentionally flawed dataset (e.g., a table of student heights with some missing entries and unrealistic values). Ask them to identify at least three specific data quality issues and propose one method to address each.

Discussion Prompt

Present students with two versions of an analysis report: one based on raw, 'dirty' data and another based on cleaned data. Facilitate a class discussion using these questions: 'What differences do you observe in the conclusions drawn from each report?', 'How did the data cleaning process influence the final results?'

Exit Ticket

On an exit ticket, ask students to define 'outlier' in their own words and describe one scenario where an outlier might be intentionally kept rather than removed. Also, ask them to list one common method for handling missing data.

Frequently Asked Questions

What makes a data visualization 'effective' for Year 9 students?

An effective visualization is one that accurately represents the data while being easy for the intended audience to interpret. It should have clear labels, an appropriate scale, and use the right type of chart (e.g., a line graph for trends over time).

How does data visualization connect to other subjects?

Data visualization is a cross-curricular skill, heavily linked to Mathematics (statistics) and Science (interpreting experimental results). In Digital Technologies, the focus is on using digital tools to create interactive and dynamic representations.

Which tools should Year 9 students use for visualization?

Students can start with advanced features in Excel or Google Sheets, but can also move to more specialized tools like Tableau Public, Canva for infographics, or even coding libraries like Matplotlib if they are advanced in Python.

How can active learning help students understand data visualization?

Active learning encourages students to justify their design choices. When a student has to explain to a peer why they chose a pie chart over a bar graph, they are forced to think deeply about the data's structure and the message they want to convey.

More in Data Analytics and Visualization

Data Collection Methods

Understanding various methods of data collection, including surveys, sensors, and web scraping, and their appropriate uses.

8 methodologies

Organising Data in Tables

Students will learn to organise data into tables with rows and columns, understanding primary keys and simple relationships between tables.

8 methodologies

Structured Data and Databases

Introduction to relational data modeling and using query languages to extract specific information.

8 methodologies

Basic Statistical Concepts

Introduction to basic statistical measures (mean, median, mode, range) and their use in understanding data distributions.

8 methodologies

Data Visualization Fundamentals

Transforming raw datasets into basic charts and graphs to communicate findings and trends effectively.

8 methodologies

Advanced Data Visualization

Exploring interactive visualizations and dashboards to present complex data stories and allow for deeper exploration.

8 methodologies

Edited by Adriana Perusin, Editor-in-Chief, Flip Education