Introduction to Data Science Workflow
Students learn the end-to-end process of data science, from data acquisition and cleaning to analysis and communication of results.
About This Topic
Big Data and pattern recognition involve the processing of datasets so massive that traditional methods fail. In 12th grade, students learn to use statistical tools and computational models to find hidden trends in fields like medicine, economics, and social media. This topic covers the 'Four Vs' of Big Data: Volume, Velocity, Variety, and Veracity. Students move from simple data entry to analyzing real-world datasets, such as climate records or public health statistics, to make evidence-based predictions.
A major focus of this unit is identifying bias. Students examine how historical data can reinforce existing inequalities if not handled carefully. This aligns with CSTA standards for using data to highlight relationships and for evaluating how bias in data collection affects the results of a model. This topic particularly benefits from hands-on, student-centered approaches where students can debate the ethics of data usage and collaboratively visualize complex information.
Key Questions
- Explain the iterative nature of the data science workflow and its key stages.
- Analyze the importance of data cleaning and preprocessing in ensuring reliable insights.
- Design a basic data science project plan for a given real-world problem.
Learning Objectives
- Describe the sequential stages of the data science workflow, including data acquisition, cleaning, analysis, and communication.
- Evaluate the impact of data quality issues, such as missing values and outliers, on the reliability of analytical results.
- Design a project plan for a data science initiative, identifying key steps, potential challenges, and necessary resources for a given scenario.
- Critique the ethical implications of data collection and usage in a specific real-world context.
- Synthesize findings from a data analysis into a clear and concise report or presentation suitable for a non-technical audience.
Before You Start
Why: Students need foundational programming skills to manipulate data and implement algorithms used in data science.
Why: Understanding concepts like mean, median, mode, and basic probability is essential for data analysis and interpretation.
Why: Students should be familiar with different ways to represent data (tables, charts) to effectively explore and communicate findings.
Key Vocabulary
| Data Acquisition | The process of gathering raw data from various sources, such as databases, APIs, or surveys, for analysis. |
| Data Cleaning | The process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in datasets to improve data quality. |
| Exploratory Data Analysis (EDA) | An approach to analyzing datasets to summarize their main characteristics, often with visual methods, to uncover patterns and identify anomalies. |
| Feature Engineering | The process of using domain knowledge to create new input variables (features) from existing raw data to improve the performance of machine learning models. |
| Model Deployment | The process of making a trained machine learning model available for use in a production environment to make predictions on new data. |
Watch Out for These Misconceptions
Common MisconceptionMore data always leads to more accurate results.
What to Teach Instead
Explain that if the data is low quality or biased, having more of it just makes the error more 'certain.' Use a peer discussion about 'garbage in, garbage out' to show how the quality of data collection is more important than the sheer volume.
Common MisconceptionData is neutral and objective.
What to Teach Instead
Clarify that data is collected by humans who make choices about what to measure and how to categorize it. A hands-on activity where students try to categorize 'ambiguous' items will show how human judgment is baked into every dataset.
Active Learning Ideas
See all activitiesInquiry Circle: Bias in the Data
Provide groups with a dataset used for a fictional 'college admissions AI' that contains historical biases (e.g., favoring certain zip codes). Students must find the patterns that lead to unfair outcomes and propose a way to 'clean' or adjust the data to ensure equity.
Gallery Walk: Data Visualizations
Students take a raw dataset and create a visualization (chart, map, or infographic) that tells a specific story. They display their work around the room, and peers use a 'See-Think-Wonder' protocol to evaluate what the data is saying and what might be missing.
Think-Pair-Share: Correlation vs. Causation
Present students with 'spurious correlations' (e.g., ice cream sales and shark attacks). Students work in pairs to explain why these two things are correlated but not causal, and then share their own examples of how Big Data might lead to false conclusions if not interpreted correctly.
Real-World Connections
- Data scientists at Netflix analyze viewing habits and user ratings to recommend movies and shows, influencing content creation and platform development.
- Epidemiologists use data science workflows to track disease outbreaks, clean public health records, and analyze transmission patterns to inform public health interventions, as seen during global health crises.
- Financial analysts at investment firms employ data science techniques to clean historical market data, identify trading patterns, and build predictive models for stock market forecasting.
Assessment Ideas
Present students with a short, messy dataset (e.g., a CSV with inconsistent formatting, missing entries). Ask them to identify at least three specific cleaning steps needed and explain why each step is important for accurate analysis.
Pose the scenario: 'A city wants to use data from traffic cameras to optimize traffic light timing.' Ask students to discuss: What types of data would be acquired? What are potential ethical concerns regarding privacy? How would they communicate their findings to city officials?
On an index card, have students list the four main stages of the data science workflow in order. For each stage, ask them to write one sentence describing a key activity or challenge associated with it.
Frequently Asked Questions
How can active learning help students understand Big Data?
What are the 'Four Vs' of Big Data?
Why is bias such a big concern in data science?
What skills do students need for data science?
More in Data Science and Intelligent Systems
Big Data Concepts and Pattern Recognition
Students analyze massive datasets to find hidden trends, using statistical libraries to process and visualize complex information sets.
2 methodologies
Data Visualization and Interpretation
Students learn to create effective data visualizations to communicate insights and identify patterns in complex datasets.
2 methodologies
Fundamentals of Machine Learning: Supervised Learning
Students are introduced to supervised learning, exploring concepts like regression and classification and how models learn from labeled data.
2 methodologies
Fundamentals of Machine Learning: Unsupervised Learning
Students explore unsupervised learning techniques like clustering and dimensionality reduction to find hidden structures in unlabeled data.
2 methodologies
Neural Networks and Deep Learning (Conceptual)
Students conceptually explore how neural networks are structured, how they learn from experience, and the basics of deep learning.
2 methodologies
Evaluating Machine Learning Models
Students learn various metrics and techniques for evaluating the performance and robustness of machine learning models.
2 methodologies