Skip to content
Computer Science · 11th Grade · Artificial Intelligence and Ethics · Weeks 19-27

Training Data and Model Evaluation

Understanding the importance of data quality, feature engineering, and metrics for model performance.

Common Core State StandardsCSTA: 3B-DA-07

About This Topic

A machine learning model is only as good as the data it was trained on and the care taken in evaluating its performance. Training data quality, feature selection, and evaluation metrics are the scaffolding behind every ML application, and understanding them is what separates students who can use ML tools from those who understand what the tools are actually doing.

Key concepts in this topic include the train/test split (holding out data to evaluate generalization), overfitting (memorizing training data rather than learning general patterns), underfitting (a model too simple to capture the true patterns), and standard evaluation metrics like accuracy, precision, recall, and F1 score. Students who understand these concepts can read an AI performance report critically, which is an increasingly important civic skill.

Active learning is valuable here because these concepts involve statistical reasoning that benefits from concrete examples and peer discussion. Activities that let students directly observe overfitting by training a model on fewer and fewer examples, and watching test accuracy diverge from training accuracy, create direct experience that anchors the concept. Structured critiques of published AI claims help students apply their understanding beyond the classroom.

Key Questions

  1. Explain the critical role of training data in machine learning model development.
  2. Analyze various metrics used to evaluate the performance of AI models (e.g., accuracy, precision, recall).
  3. Critique the potential pitfalls of overfitting and underfitting in model training.

Learning Objectives

  • Explain the critical role of training data quality in the development of reliable machine learning models.
  • Analyze and compare common metrics such as accuracy, precision, and recall for evaluating AI model performance.
  • Critique the consequences of overfitting and underfitting on a model's ability to generalize to new data.
  • Design a simple experiment to demonstrate the impact of data quantity on model performance.

Before You Start

Introduction to Machine Learning Concepts

Why: Students need a foundational understanding of what machine learning models are and how they learn from data before evaluating their performance.

Basic Data Analysis and Visualization

Why: Understanding data distributions and patterns is essential for comprehending feature engineering and the impact of data quality.

Key Vocabulary

Training DataThe dataset used to teach a machine learning model patterns and relationships. Its quality directly impacts the model's effectiveness.
Feature EngineeringThe process of selecting, transforming, and creating features from raw data to improve model performance and accuracy.
AccuracyA metric that measures the proportion of correct predictions made by a model out of the total number of predictions.
PrecisionA metric that measures the proportion of true positive predictions among all positive predictions made by the model. It answers, 'Of all the times the model predicted X, how often was it correct?'
RecallA metric that measures the proportion of true positive predictions among all actual positive instances. It answers, 'Of all the actual X cases, how many did the model correctly identify?'
OverfittingA phenomenon where a machine learning model learns the training data too well, including its noise and outliers, leading to poor performance on unseen data.

Watch Out for These Misconceptions

Common MisconceptionAccuracy is the best metric for evaluating any classification model.

What to Teach Instead

Accuracy measures the percentage of correct predictions overall, but it's misleading for imbalanced datasets. A model that predicts the majority class for every input can achieve high accuracy while being useless. Precision (what fraction of positive predictions were correct) and recall (what fraction of actual positives were caught) provide a fuller picture, especially for high-stakes applications like medical diagnosis or fraud detection.

Common MisconceptionOverfitting happens because the model is too smart.

What to Teach Instead

Overfitting happens when a model is too complex relative to the amount of training data, it learns the specific examples it saw, including their noise, rather than the underlying pattern. It's not intelligence; it's memorization. The fix is typically more data, simpler models, or regularization techniques that penalize overly complex solutions.

Common MisconceptionMore features always improve model performance.

What to Teach Instead

Adding irrelevant or redundant features can hurt model performance by introducing noise and making it harder to identify the features that actually matter. Feature selection and engineering, choosing and transforming inputs thoughtfully, often improve performance more than adding raw data columns. The goal is informative features, not more features.

Active Learning Ideas

See all activities

Think-Pair-Share: Accuracy Isn't Everything

Present a scenario: a disease affects 1% of the population, and a diagnostic AI claims 99% accuracy by always predicting 'healthy.' Ask partners to explain why this is misleading and what metric would be better. After sharing, introduce precision and recall as tools for understanding model behavior on imbalanced datasets.

20 min·Pairs

Overfitting Experiment

Students train a simple model (using a provided notebook) on progressively smaller subsets of training data while testing on the same fixed test set. They plot training vs. test accuracy as sample size decreases and observe overfitting emerge. Pairs write a paragraph describing what they observed and predicting what would happen with even less data.

45 min·Pairs

Gallery Walk: Critique the AI Claim

Post five printed AI headlines or marketing claims ('Our model achieves 98% accuracy!', 'AI outperforms doctors in diagnosis'). Student groups annotate each with questions they'd need answered before accepting the claim: What is the test set? Is accuracy the right metric? What population was tested? How were edge cases handled? Class discusses which claims hold up to scrutiny.

30 min·Small Groups

Feature Engineering Challenge

Give teams a raw dataset (e.g., raw text strings, timestamps) and ask them to engineer three new features they think would help a model predict a given outcome. Teams present their features and justify why they might be predictive. Class votes on which features they think would most improve the model, then test predictions using a provided script.

35 min·Small Groups

Real-World Connections

  • Medical diagnostic AI systems rely heavily on high-quality, diverse training data to accurately identify diseases from scans. Errors in training data can lead to misdiagnoses, impacting patient care at hospitals like the Mayo Clinic.
  • Financial fraud detection models are continuously trained on transaction data. If the training data is biased or incomplete, the model may fail to identify new types of fraudulent activity, costing companies like Visa or Mastercard significant losses.

Assessment Ideas

Exit Ticket

Provide students with a scenario where an AI model for recommending movies performed poorly. Ask them to identify two potential issues with the training data or model evaluation and suggest one specific step to address each issue.

Quick Check

Present students with a confusion matrix for a binary classification model. Ask them to calculate and explain the precision and recall for the positive class, identifying which metric might be more important given a specific application (e.g., spam detection).

Discussion Prompt

Facilitate a class discussion using the prompt: 'Imagine you are building a facial recognition system. What are the ethical implications of using a training dataset that is not representative of all demographic groups, and how might this lead to underfitting or biased performance?'

Frequently Asked Questions

What is overfitting in machine learning and why does it matter?
Overfitting occurs when a model learns the training data so specifically, including its noise and random variations, that it performs poorly on new, unseen data. The model has memorized rather than generalized. It matters because the whole point of a machine learning model is to perform well on real-world inputs it hasn't seen before. A model that overfits will appear excellent during development but underperform in deployment.
What is the difference between precision and recall?
Precision measures how often the model is right when it predicts a positive, out of all the times it said 'yes,' what fraction was actually correct? Recall measures how many of the actual positives the model caught, out of all the real positives, what fraction did the model identify? Both matter, but which matters more depends on context. In fraud detection, high recall is critical (don't miss real fraud). In content recommendation, high precision matters (don't recommend junk).
How does active learning help students understand model evaluation metrics?
Evaluation metrics like precision and recall are counterintuitive until students encounter a concrete scenario where accuracy misleads. Think-Pair-Share activities using imbalanced class examples, where a model that always predicts the majority class gets 99% accuracy, create the problem before presenting the solution. Students who feel the inadequacy of accuracy first are far more motivated to understand what better metrics are actually measuring.
What does feature engineering mean in machine learning?
Feature engineering is the process of selecting, transforming, or creating input variables (features) to improve a model's ability to find relevant patterns. Raw data often isn't in a useful form, a timestamp might need to be converted into 'day of week' and 'hour,' or a raw text field might need to be converted into word counts. Good feature engineering requires domain knowledge about what information is actually predictive for the problem at hand.