Computer Science · 9th Grade · Data Intelligence and Visualization · Weeks 28-36

Ethical Data Scraping and Privacy

Students will discuss the ethical considerations of scraping data from public websites and privacy implications.

Common Core State StandardsCSTA: 3A-DA-11CSTA: 3A-IC-24

About This Topic

Statistical analysis and modeling allow students to find meaning in large amounts of data. In 9th grade, the focus is on using computational tools to identify correlations and build predictive models. This aligns with CSTA standards for using data analysis to support a claim. Students learn that while a model can predict a trend, it is rarely 100% certain.

A major theme in this topic is the distinction between correlation and causation. Students explore how two things can appear related without one causing the other. This critical thinking skill is essential for navigating a world filled with algorithmic predictions. Students grasp this concept faster through simulations where they test their models against new data and see where they succeed or fail.

Key Questions

Critique the ethical considerations of scraping data from public websites.
Justify the importance of data privacy in the context of data collection.
Predict the potential negative impacts of unauthorized data collection.

Learning Objectives

Critique the ethical considerations and potential harms of scraping data from public websites.
Evaluate the importance of data privacy principles when collecting and using personal information.
Justify the legal and societal implications of unauthorized data collection.
Predict potential negative consequences of data breaches resulting from unethical scraping practices.

Before You Start

Introduction to Data and Information

Why: Students need a foundational understanding of what data is and how it is organized before discussing its collection and privacy.

Basic Internet and Web Concepts

Why: Understanding how websites are structured and accessed is necessary to comprehend data scraping.

Key Vocabulary

Data Scraping	The automated process of extracting large amounts of data from websites. This can be done for various purposes, both legitimate and unethical.
Personally Identifiable Information (PII)	Any data that could potentially identify a specific individual. This includes names, addresses, email addresses, social security numbers, and more.
Data Privacy	The practice of protecting sensitive personal data from unauthorized access, use, disclosure, alteration, or destruction.
Terms of Service (ToS)	A legal agreement between a service provider and a user that outlines the rules and restrictions for using a website or service.
Ethical Hacking	The practice of using hacking skills to identify vulnerabilities in systems with permission, often to improve security. This is distinct from malicious hacking or unauthorized scraping.

Watch Out for These Misconceptions

Common MisconceptionIf two things are correlated, one must cause the other.

What to Teach Instead

Correlation just means they move together; it doesn't explain why. Using 'spurious correlation' examples helps students see that a third factor (like the weather) is often the real cause.

Common MisconceptionA model that works on old data will always work on new data.

What to Teach Instead

Models can 'overfit' to specific data and fail when things change. Testing models against 'unseen' data in class simulations helps students understand the need for generalizable models.

Active Learning Ideas

See all activities

Simulation Game: The Mystery Predictor

Give students a 'training' dataset (e.g., shoe size vs. reading level in elementary students). They build a simple 'model' to predict one from the other, then test it against a 'hidden' dataset to see if their prediction holds up.

40 min·Small Groups

Formal Debate: Correlation vs. Causation

Present several 'spurious correlations' (e.g., ice cream sales and shark attacks). Groups must argue whether there is a causal link, a hidden third variable, or if it is just a coincidence.

30 min·Small Groups

Think-Pair-Share: Model Ethics

Students read a short case study about an algorithm used to predict which students might drop out of school. They discuss the benefits and the potential dangers of relying on such a model.

25 min·Pairs

Real-World Connections

News organizations and researchers sometimes scrape public websites to gather data for investigative journalism or academic studies, such as analyzing public sentiment on social media platforms or tracking economic trends.
Companies like Google use web crawlers, a form of data scraping, to index public web pages for search engine results. However, they adhere to robots.txt protocols and respect website terms of service.
The Cambridge Analytica scandal involved the improper harvesting of personal data from millions of Facebook users, highlighting the severe privacy violations and ethical breaches that can occur with unauthorized data collection.

Assessment Ideas

Discussion Prompt

Pose the following scenario: 'A student wants to build a website that aggregates job postings from various company career pages. What ethical questions should they consider before they start scraping these sites? What are the potential privacy risks for job applicants?' Facilitate a class discussion around their responses.

Quick Check

Present students with two hypothetical scenarios: Scenario A involves scraping publicly available, non-personal data like weather patterns. Scenario B involves scraping user profiles from a social media site without explicit consent. Ask students to write one sentence explaining which scenario raises more significant privacy concerns and why.

Exit Ticket

Ask students to define 'Personally Identifiable Information (PII)' in their own words and list two examples. Then, have them write one sentence explaining why protecting PII is crucial when collecting data.

Frequently Asked Questions

What is a computational model?

A computational model is a program that uses mathematical rules and data to simulate a real-world process or predict what might happen in the future, like a weather forecast or a population growth model.

What is the difference between correlation and causation?

Correlation means two things happen at the same time or follow the same pattern. Causation means one thing actually makes the other thing happen. For example, rain is correlated with people carrying umbrellas, and the rain actually causes them to do it.

How do we know if a model is 'good'?

A good model is accurate when tested against new data that it hasn't seen before. It should also be as simple as possible while still providing useful predictions.

How can active learning help students understand statistical modeling?

Active learning allows students to 'break' their models. When students build a prediction and then immediately see it fail on a new set of data, they learn the importance of testing and refinement. This iterative process is much more effective than just hearing about model accuracy in a lecture.

More in Data Intelligence and Visualization

Data Collection Methods and Bias

Students will explore techniques for gathering data and analyze how bias in data collection can lead to inaccurate conclusions.

2 methodologies

Data Cleaning and Preprocessing

Students will learn the necessity of cleaning data to ensure accuracy and handle missing or corrupted data.

2 methodologies

Correlation vs. Causation

Students will analyze why correlation does not necessarily imply a causal relationship.

2 methodologies

Identifying Trends in Data

Students will use computational tools to identify patterns and trends within datasets.

2 methodologies

Evaluating Data-Driven Conclusions

Students will learn to critically evaluate conclusions drawn from data, considering limitations and potential biases.

2 methodologies

Ethical Implications of Algorithmic Predictions

Students will discuss the dangers of over-relying on algorithmic predictions for social issues.

2 methodologies