Ethical Data Scraping and Privacy
Students will discuss the ethical considerations of scraping data from public websites and privacy implications.
About This Topic
Statistical analysis and modeling allow students to find meaning in large amounts of data. In 9th grade, the focus is on using computational tools to identify correlations and build predictive models. This aligns with CSTA standards for using data analysis to support a claim. Students learn that while a model can predict a trend, it is rarely 100% certain.
A major theme in this topic is the distinction between correlation and causation. Students explore how two things can appear related without one causing the other. This critical thinking skill is essential for navigating a world filled with algorithmic predictions. Students grasp this concept faster through simulations where they test their models against new data and see where they succeed or fail.
Key Questions
- Critique the ethical considerations of scraping data from public websites.
- Justify the importance of data privacy in the context of data collection.
- Predict the potential negative impacts of unauthorized data collection.
Learning Objectives
- Critique the ethical considerations and potential harms of scraping data from public websites.
- Evaluate the importance of data privacy principles when collecting and using personal information.
- Justify the legal and societal implications of unauthorized data collection.
- Predict potential negative consequences of data breaches resulting from unethical scraping practices.
Before You Start
Why: Students need a foundational understanding of what data is and how it is organized before discussing its collection and privacy.
Why: Understanding how websites are structured and accessed is necessary to comprehend data scraping.
Key Vocabulary
| Data Scraping | The automated process of extracting large amounts of data from websites. This can be done for various purposes, both legitimate and unethical. |
| Personally Identifiable Information (PII) | Any data that could potentially identify a specific individual. This includes names, addresses, email addresses, social security numbers, and more. |
| Data Privacy | The practice of protecting sensitive personal data from unauthorized access, use, disclosure, alteration, or destruction. |
| Terms of Service (ToS) | A legal agreement between a service provider and a user that outlines the rules and restrictions for using a website or service. |
| Ethical Hacking | The practice of using hacking skills to identify vulnerabilities in systems with permission, often to improve security. This is distinct from malicious hacking or unauthorized scraping. |
Watch Out for These Misconceptions
Common MisconceptionIf two things are correlated, one must cause the other.
What to Teach Instead
Correlation just means they move together; it doesn't explain why. Using 'spurious correlation' examples helps students see that a third factor (like the weather) is often the real cause.
Common MisconceptionA model that works on old data will always work on new data.
What to Teach Instead
Models can 'overfit' to specific data and fail when things change. Testing models against 'unseen' data in class simulations helps students understand the need for generalizable models.
Active Learning Ideas
See all activitiesSimulation Game: The Mystery Predictor
Give students a 'training' dataset (e.g., shoe size vs. reading level in elementary students). They build a simple 'model' to predict one from the other, then test it against a 'hidden' dataset to see if their prediction holds up.
Formal Debate: Correlation vs. Causation
Present several 'spurious correlations' (e.g., ice cream sales and shark attacks). Groups must argue whether there is a causal link, a hidden third variable, or if it is just a coincidence.
Think-Pair-Share: Model Ethics
Students read a short case study about an algorithm used to predict which students might drop out of school. They discuss the benefits and the potential dangers of relying on such a model.
Real-World Connections
- News organizations and researchers sometimes scrape public websites to gather data for investigative journalism or academic studies, such as analyzing public sentiment on social media platforms or tracking economic trends.
- Companies like Google use web crawlers, a form of data scraping, to index public web pages for search engine results. However, they adhere to robots.txt protocols and respect website terms of service.
- The Cambridge Analytica scandal involved the improper harvesting of personal data from millions of Facebook users, highlighting the severe privacy violations and ethical breaches that can occur with unauthorized data collection.
Assessment Ideas
Pose the following scenario: 'A student wants to build a website that aggregates job postings from various company career pages. What ethical questions should they consider before they start scraping these sites? What are the potential privacy risks for job applicants?' Facilitate a class discussion around their responses.
Present students with two hypothetical scenarios: Scenario A involves scraping publicly available, non-personal data like weather patterns. Scenario B involves scraping user profiles from a social media site without explicit consent. Ask students to write one sentence explaining which scenario raises more significant privacy concerns and why.
Ask students to define 'Personally Identifiable Information (PII)' in their own words and list two examples. Then, have them write one sentence explaining why protecting PII is crucial when collecting data.
Frequently Asked Questions
What is a computational model?
What is the difference between correlation and causation?
How do we know if a model is 'good'?
How can active learning help students understand statistical modeling?
More in Data Intelligence and Visualization
Data Collection Methods and Bias
Students will explore techniques for gathering data and analyze how bias in data collection can lead to inaccurate conclusions.
2 methodologies
Data Cleaning and Preprocessing
Students will learn the necessity of cleaning data to ensure accuracy and handle missing or corrupted data.
2 methodologies
Correlation vs. Causation
Students will analyze why correlation does not necessarily imply a causal relationship.
2 methodologies
Identifying Trends in Data
Students will use computational tools to identify patterns and trends within datasets.
2 methodologies
Evaluating Data-Driven Conclusions
Students will learn to critically evaluate conclusions drawn from data, considering limitations and potential biases.
2 methodologies
Ethical Implications of Algorithmic Predictions
Students will discuss the dangers of over-relying on algorithmic predictions for social issues.
2 methodologies