Training Data and Model Evaluation
Understanding the importance of data quality, feature engineering, and metrics for model performance.
About This Topic
A machine learning model is only as good as the data it was trained on and the care taken in evaluating its performance. Training data quality, feature selection, and evaluation metrics are the scaffolding behind every ML application, and understanding them is what separates students who can use ML tools from those who understand what the tools are actually doing.
Key concepts in this topic include the train/test split (holding out data to evaluate generalization), overfitting (memorizing training data rather than learning general patterns), underfitting (a model too simple to capture the true patterns), and standard evaluation metrics like accuracy, precision, recall, and F1 score. Students who understand these concepts can read an AI performance report critically, which is an increasingly important civic skill.
Active learning is valuable here because these concepts involve statistical reasoning that benefits from concrete examples and peer discussion. Activities that let students directly observe overfitting by training a model on fewer and fewer examples, and watching test accuracy diverge from training accuracy, create direct experience that anchors the concept. Structured critiques of published AI claims help students apply their understanding beyond the classroom.
Key Questions
- Explain the critical role of training data in machine learning model development.
- Analyze various metrics used to evaluate the performance of AI models (e.g., accuracy, precision, recall).
- Critique the potential pitfalls of overfitting and underfitting in model training.
Learning Objectives
- Explain the critical role of training data quality in the development of reliable machine learning models.
- Analyze and compare common metrics such as accuracy, precision, and recall for evaluating AI model performance.
- Critique the consequences of overfitting and underfitting on a model's ability to generalize to new data.
- Design a simple experiment to demonstrate the impact of data quantity on model performance.
Before You Start
Why: Students need a foundational understanding of what machine learning models are and how they learn from data before evaluating their performance.
Why: Understanding data distributions and patterns is essential for comprehending feature engineering and the impact of data quality.
Key Vocabulary
| Training Data | The dataset used to teach a machine learning model patterns and relationships. Its quality directly impacts the model's effectiveness. |
| Feature Engineering | The process of selecting, transforming, and creating features from raw data to improve model performance and accuracy. |
| Accuracy | A metric that measures the proportion of correct predictions made by a model out of the total number of predictions. |
| Precision | A metric that measures the proportion of true positive predictions among all positive predictions made by the model. It answers, 'Of all the times the model predicted X, how often was it correct?' |
| Recall | A metric that measures the proportion of true positive predictions among all actual positive instances. It answers, 'Of all the actual X cases, how many did the model correctly identify?' |
| Overfitting | A phenomenon where a machine learning model learns the training data too well, including its noise and outliers, leading to poor performance on unseen data. |
Watch Out for These Misconceptions
Common MisconceptionAccuracy is the best metric for evaluating any classification model.
What to Teach Instead
Accuracy measures the percentage of correct predictions overall, but it's misleading for imbalanced datasets. A model that predicts the majority class for every input can achieve high accuracy while being useless. Precision (what fraction of positive predictions were correct) and recall (what fraction of actual positives were caught) provide a fuller picture, especially for high-stakes applications like medical diagnosis or fraud detection.
Common MisconceptionOverfitting happens because the model is too smart.
What to Teach Instead
Overfitting happens when a model is too complex relative to the amount of training data, it learns the specific examples it saw, including their noise, rather than the underlying pattern. It's not intelligence; it's memorization. The fix is typically more data, simpler models, or regularization techniques that penalize overly complex solutions.
Common MisconceptionMore features always improve model performance.
What to Teach Instead
Adding irrelevant or redundant features can hurt model performance by introducing noise and making it harder to identify the features that actually matter. Feature selection and engineering, choosing and transforming inputs thoughtfully, often improve performance more than adding raw data columns. The goal is informative features, not more features.
Active Learning Ideas
See all activitiesThink-Pair-Share: Accuracy Isn't Everything
Present a scenario: a disease affects 1% of the population, and a diagnostic AI claims 99% accuracy by always predicting 'healthy.' Ask partners to explain why this is misleading and what metric would be better. After sharing, introduce precision and recall as tools for understanding model behavior on imbalanced datasets.
Overfitting Experiment
Students train a simple model (using a provided notebook) on progressively smaller subsets of training data while testing on the same fixed test set. They plot training vs. test accuracy as sample size decreases and observe overfitting emerge. Pairs write a paragraph describing what they observed and predicting what would happen with even less data.
Gallery Walk: Critique the AI Claim
Post five printed AI headlines or marketing claims ('Our model achieves 98% accuracy!', 'AI outperforms doctors in diagnosis'). Student groups annotate each with questions they'd need answered before accepting the claim: What is the test set? Is accuracy the right metric? What population was tested? How were edge cases handled? Class discusses which claims hold up to scrutiny.
Feature Engineering Challenge
Give teams a raw dataset (e.g., raw text strings, timestamps) and ask them to engineer three new features they think would help a model predict a given outcome. Teams present their features and justify why they might be predictive. Class votes on which features they think would most improve the model, then test predictions using a provided script.
Real-World Connections
- Medical diagnostic AI systems rely heavily on high-quality, diverse training data to accurately identify diseases from scans. Errors in training data can lead to misdiagnoses, impacting patient care at hospitals like the Mayo Clinic.
- Financial fraud detection models are continuously trained on transaction data. If the training data is biased or incomplete, the model may fail to identify new types of fraudulent activity, costing companies like Visa or Mastercard significant losses.
Assessment Ideas
Provide students with a scenario where an AI model for recommending movies performed poorly. Ask them to identify two potential issues with the training data or model evaluation and suggest one specific step to address each issue.
Present students with a confusion matrix for a binary classification model. Ask them to calculate and explain the precision and recall for the positive class, identifying which metric might be more important given a specific application (e.g., spam detection).
Facilitate a class discussion using the prompt: 'Imagine you are building a facial recognition system. What are the ethical implications of using a training dataset that is not representative of all demographic groups, and how might this lead to underfitting or biased performance?'
Frequently Asked Questions
What is overfitting in machine learning and why does it matter?
What is the difference between precision and recall?
How does active learning help students understand model evaluation metrics?
What does feature engineering mean in machine learning?
More in Artificial Intelligence and Ethics
Introduction to Artificial Intelligence
Students will define AI, explore its history, and differentiate between strong and weak AI.
2 methodologies
Machine Learning Fundamentals
Introduction to how computers learn from data through supervised and unsupervised learning.
2 methodologies
Supervised Learning: Classification and Regression
Exploring algorithms that learn from labeled data to make predictions.
2 methodologies
Unsupervised Learning: Clustering
Discovering patterns and structures in unlabeled data using algorithms like K-Means.
2 methodologies
AI Applications: Image and Speech Recognition
Exploring how AI is used in practical applications like recognizing images and understanding speech.
2 methodologies
Algorithmic Bias and Fairness
Investigating how human prejudices can be encoded into automated decision-making tools.
3 methodologies