Skip to content
Computer Science · 9th Grade · Data Intelligence and Visualization · Weeks 28-36

Ethical Data Scraping and Privacy

Students will discuss the ethical considerations of scraping data from public websites and privacy implications.

Common Core State StandardsCSTA: 3A-DA-11CSTA: 3A-IC-24

About This Topic

Statistical analysis and modeling allow students to find meaning in large amounts of data. In 9th grade, the focus is on using computational tools to identify correlations and build predictive models. This aligns with CSTA standards for using data analysis to support a claim. Students learn that while a model can predict a trend, it is rarely 100% certain.

A major theme in this topic is the distinction between correlation and causation. Students explore how two things can appear related without one causing the other. This critical thinking skill is essential for navigating a world filled with algorithmic predictions. Students grasp this concept faster through simulations where they test their models against new data and see where they succeed or fail.

Key Questions

  1. Critique the ethical considerations of scraping data from public websites.
  2. Justify the importance of data privacy in the context of data collection.
  3. Predict the potential negative impacts of unauthorized data collection.

Learning Objectives

  • Critique the ethical considerations and potential harms of scraping data from public websites.
  • Evaluate the importance of data privacy principles when collecting and using personal information.
  • Justify the legal and societal implications of unauthorized data collection.
  • Predict potential negative consequences of data breaches resulting from unethical scraping practices.

Before You Start

Introduction to Data and Information

Why: Students need a foundational understanding of what data is and how it is organized before discussing its collection and privacy.

Basic Internet and Web Concepts

Why: Understanding how websites are structured and accessed is necessary to comprehend data scraping.

Key Vocabulary

Data ScrapingThe automated process of extracting large amounts of data from websites. This can be done for various purposes, both legitimate and unethical.
Personally Identifiable Information (PII)Any data that could potentially identify a specific individual. This includes names, addresses, email addresses, social security numbers, and more.
Data PrivacyThe practice of protecting sensitive personal data from unauthorized access, use, disclosure, alteration, or destruction.
Terms of Service (ToS)A legal agreement between a service provider and a user that outlines the rules and restrictions for using a website or service.
Ethical HackingThe practice of using hacking skills to identify vulnerabilities in systems with permission, often to improve security. This is distinct from malicious hacking or unauthorized scraping.

Watch Out for These Misconceptions

Common MisconceptionIf two things are correlated, one must cause the other.

What to Teach Instead

Correlation just means they move together; it doesn't explain why. Using 'spurious correlation' examples helps students see that a third factor (like the weather) is often the real cause.

Common MisconceptionA model that works on old data will always work on new data.

What to Teach Instead

Models can 'overfit' to specific data and fail when things change. Testing models against 'unseen' data in class simulations helps students understand the need for generalizable models.

Active Learning Ideas

See all activities

Real-World Connections

  • News organizations and researchers sometimes scrape public websites to gather data for investigative journalism or academic studies, such as analyzing public sentiment on social media platforms or tracking economic trends.
  • Companies like Google use web crawlers, a form of data scraping, to index public web pages for search engine results. However, they adhere to robots.txt protocols and respect website terms of service.
  • The Cambridge Analytica scandal involved the improper harvesting of personal data from millions of Facebook users, highlighting the severe privacy violations and ethical breaches that can occur with unauthorized data collection.

Assessment Ideas

Discussion Prompt

Pose the following scenario: 'A student wants to build a website that aggregates job postings from various company career pages. What ethical questions should they consider before they start scraping these sites? What are the potential privacy risks for job applicants?' Facilitate a class discussion around their responses.

Quick Check

Present students with two hypothetical scenarios: Scenario A involves scraping publicly available, non-personal data like weather patterns. Scenario B involves scraping user profiles from a social media site without explicit consent. Ask students to write one sentence explaining which scenario raises more significant privacy concerns and why.

Exit Ticket

Ask students to define 'Personally Identifiable Information (PII)' in their own words and list two examples. Then, have them write one sentence explaining why protecting PII is crucial when collecting data.

Frequently Asked Questions

What is a computational model?
A computational model is a program that uses mathematical rules and data to simulate a real-world process or predict what might happen in the future, like a weather forecast or a population growth model.
What is the difference between correlation and causation?
Correlation means two things happen at the same time or follow the same pattern. Causation means one thing actually makes the other thing happen. For example, rain is correlated with people carrying umbrellas, and the rain actually causes them to do it.
How do we know if a model is 'good'?
A good model is accurate when tested against new data that it hasn't seen before. It should also be as simple as possible while still providing useful predictions.
How can active learning help students understand statistical modeling?
Active learning allows students to 'break' their models. When students build a prediction and then immediately see it fail on a new set of data, they learn the importance of testing and refinement. This iterative process is much more effective than just hearing about model accuracy in a lecture.