Skip to content
Computer Science · 12th Grade · Data Science and Intelligent Systems · Weeks 19-27

Introduction to Data Science Workflow

Students learn the end-to-end process of data science, from data acquisition and cleaning to analysis and communication of results.

Common Core State StandardsCSTA: 3B-DA-05CCSS.ELA-LITERACY.RST.11-12.7

About This Topic

Big Data and pattern recognition involve the processing of datasets so massive that traditional methods fail. In 12th grade, students learn to use statistical tools and computational models to find hidden trends in fields like medicine, economics, and social media. This topic covers the 'Four Vs' of Big Data: Volume, Velocity, Variety, and Veracity. Students move from simple data entry to analyzing real-world datasets, such as climate records or public health statistics, to make evidence-based predictions.

A major focus of this unit is identifying bias. Students examine how historical data can reinforce existing inequalities if not handled carefully. This aligns with CSTA standards for using data to highlight relationships and for evaluating how bias in data collection affects the results of a model. This topic particularly benefits from hands-on, student-centered approaches where students can debate the ethics of data usage and collaboratively visualize complex information.

Key Questions

  1. Explain the iterative nature of the data science workflow and its key stages.
  2. Analyze the importance of data cleaning and preprocessing in ensuring reliable insights.
  3. Design a basic data science project plan for a given real-world problem.

Learning Objectives

  • Describe the sequential stages of the data science workflow, including data acquisition, cleaning, analysis, and communication.
  • Evaluate the impact of data quality issues, such as missing values and outliers, on the reliability of analytical results.
  • Design a project plan for a data science initiative, identifying key steps, potential challenges, and necessary resources for a given scenario.
  • Critique the ethical implications of data collection and usage in a specific real-world context.
  • Synthesize findings from a data analysis into a clear and concise report or presentation suitable for a non-technical audience.

Before You Start

Introduction to Programming Concepts

Why: Students need foundational programming skills to manipulate data and implement algorithms used in data science.

Basic Statistical Concepts

Why: Understanding concepts like mean, median, mode, and basic probability is essential for data analysis and interpretation.

Data Representation and Visualization

Why: Students should be familiar with different ways to represent data (tables, charts) to effectively explore and communicate findings.

Key Vocabulary

Data AcquisitionThe process of gathering raw data from various sources, such as databases, APIs, or surveys, for analysis.
Data CleaningThe process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in datasets to improve data quality.
Exploratory Data Analysis (EDA)An approach to analyzing datasets to summarize their main characteristics, often with visual methods, to uncover patterns and identify anomalies.
Feature EngineeringThe process of using domain knowledge to create new input variables (features) from existing raw data to improve the performance of machine learning models.
Model DeploymentThe process of making a trained machine learning model available for use in a production environment to make predictions on new data.

Watch Out for These Misconceptions

Common MisconceptionMore data always leads to more accurate results.

What to Teach Instead

Explain that if the data is low quality or biased, having more of it just makes the error more 'certain.' Use a peer discussion about 'garbage in, garbage out' to show how the quality of data collection is more important than the sheer volume.

Common MisconceptionData is neutral and objective.

What to Teach Instead

Clarify that data is collected by humans who make choices about what to measure and how to categorize it. A hands-on activity where students try to categorize 'ambiguous' items will show how human judgment is baked into every dataset.

Active Learning Ideas

See all activities

Real-World Connections

  • Data scientists at Netflix analyze viewing habits and user ratings to recommend movies and shows, influencing content creation and platform development.
  • Epidemiologists use data science workflows to track disease outbreaks, clean public health records, and analyze transmission patterns to inform public health interventions, as seen during global health crises.
  • Financial analysts at investment firms employ data science techniques to clean historical market data, identify trading patterns, and build predictive models for stock market forecasting.

Assessment Ideas

Quick Check

Present students with a short, messy dataset (e.g., a CSV with inconsistent formatting, missing entries). Ask them to identify at least three specific cleaning steps needed and explain why each step is important for accurate analysis.

Discussion Prompt

Pose the scenario: 'A city wants to use data from traffic cameras to optimize traffic light timing.' Ask students to discuss: What types of data would be acquired? What are potential ethical concerns regarding privacy? How would they communicate their findings to city officials?

Exit Ticket

On an index card, have students list the four main stages of the data science workflow in order. For each stage, ask them to write one sentence describing a key activity or challenge associated with it.

Frequently Asked Questions

How can active learning help students understand Big Data?
Big Data can feel overwhelmingly abstract. Active learning strategies, like 'data physicalization' (using physical objects to represent data points) or collaborative 'bias hunting' in real datasets, make the concepts tangible. These activities allow students to move from being passive consumers of information to critical analysts who understand how data is shaped and manipulated.
What are the 'Four Vs' of Big Data?
They are Volume (the amount of data), Velocity (the speed at which it's generated), Variety (the different types of data), and Veracity (the accuracy and trustworthiness of the data).
Why is bias such a big concern in data science?
Because computers learn from the data we give them. If that data reflects past human prejudices, the computer will learn to repeat those prejudices, often making them look like 'objective' facts.
What skills do students need for data science?
Students need a mix of math (statistics), programming (to process the data), and critical thinking (to interpret what the results actually mean in a real-world context).