Evaluating Machine Learning Models
Students learn various metrics and techniques for evaluating the performance and robustness of machine learning models.
About This Topic
Building a machine learning model is only half the work, understanding whether it actually solves the problem is equally important. In US 12th-grade CS, students learn that accuracy alone is an incomplete measure of model performance, particularly when classes are imbalanced. A model that labels every patient as 'disease-free' in a dataset where 95% are healthy achieves 95% accuracy while being completely useless for detecting the disease it was built to find.
Students work with the confusion matrix as a foundational tool: the four cells (true positives, false positives, true negatives, false negatives) give rise to precision (of positive predictions, how many were correct?), recall (of actual positives, how many did the model catch?), and F1-score (their harmonic mean). The choice of metric is not technical, it reflects which type of error is more costly in the specific application.
Two failure modes also receive attention: overfitting, where a model performs well on training data but poorly on new examples because it memorized rather than generalized, and underfitting, where the model is too simple to capture the pattern. Active learning through deliberate model-breaking exercises, testing classifiers on adversarial or out-of-distribution examples, helps students develop evaluative intuition that is hard to build from metrics alone.
Key Questions
- How do we measure the success or failure of an intelligent system using appropriate metrics?
- Differentiate between overfitting and underfitting in machine learning models.
- Justify the selection of specific evaluation metrics based on the problem context.
Learning Objectives
- Calculate and interpret precision, recall, and F1-score from a given confusion matrix for a binary classification problem.
- Differentiate between overfitting and underfitting by comparing model performance on training and validation datasets.
- Justify the selection of an appropriate evaluation metric (e.g., precision, recall) for a given machine learning application scenario.
- Critique the performance of a machine learning model by analyzing its performance across various evaluation metrics and identifying potential biases or failure modes.
Before You Start
Why: Students need a basic understanding of what a machine learning model does and the concept of training data before evaluating its performance.
Why: Understanding the fundamental task of assigning data points to predefined categories is essential for interpreting classification metrics.
Key Vocabulary
| Confusion Matrix | A table that summarizes the performance of a classification model, showing true positives, true negatives, false positives, and false negatives. |
| Precision | The proportion of true positive predictions among all positive predictions made by the model; it answers, 'Of all the instances predicted as positive, how many were actually positive?' |
| Recall (Sensitivity) | The proportion of actual positive instances that were correctly identified by the model; it answers, 'Of all the actual positive instances, how many did the model find?' |
| F1-Score | The harmonic mean of precision and recall, providing a single score that balances both metrics, useful when class distribution is uneven. |
| Overfitting | A phenomenon where a machine learning model learns the training data too well, including noise and outliers, leading to poor generalization on unseen data. |
| Underfitting | A phenomenon where a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and unseen data. |
Watch Out for These Misconceptions
Common MisconceptionHigh accuracy means a machine learning model is good.
What to Teach Instead
Accuracy is misleading when classes are imbalanced. A model that always predicts the majority class can achieve high accuracy without learning anything useful. Students who calculate accuracy on an imbalanced dataset and then compute precision and recall discover immediately that these metrics tell a very different story.
Common MisconceptionOverfitting is only a problem for very large, complex models.
What to Teach Instead
Overfitting can occur even with relatively simple models when training data is limited or when the model is tuned extensively using the test set. The true guard against overfitting is held-out evaluation data that the model never sees during training or tuning. Learning curve labs let students observe overfitting empirically across model sizes.
Common MisconceptionOnce a model performs well on the test set, it is ready for deployment.
What to Teach Instead
Test set performance estimates generalization to data from the same distribution as the training data. Deployed models often encounter distribution shift, inputs that look different from training data due to population changes, seasonal effects, or adversarial users. Evaluation in deployment requires ongoing monitoring, not a one-time test set result.
Active Learning Ideas
See all activitiesThink-Pair-Share: Which Metric Matters Here?
Present three scenarios: a spam filter, a cancer screening test, and a fraud detection system. For each, pairs must decide whether precision or recall is the higher-priority metric and justify the choice by describing the real-world cost of each type of error. The debrief shows that metric selection is a domain judgment, not a mathematical one.
Collaborative Problem-Solving: Breaking a Classifier
Students train a simple text classifier on a balanced dataset, record its metrics, then feed it deliberately tricky inputs: unusual phrasings, out-of-domain examples, and adversarial examples designed to flip predictions. They document which inputs fail and hypothesize why the model fails on each. This activity reframes evaluation as an adversarial process, not just a number-reporting exercise.
Gallery Walk: Confusion Matrix Interpretation
Post four confusion matrices from different real-world models (medical diagnosis, spam filter, image classifier, loan approval). For each, groups calculate precision and recall, identify which type of error is more frequent, and write a recommendation about whether the model is ready to deploy given the stated use case. Groups compare recommendations during debrief.
Simulation Activity: Learning Curve Analysis
Students train the same model on increasing amounts of data, 10%, 25%, 50%, 75%, 100% of the training set, and plot training accuracy versus test accuracy at each point. They identify where the model overfits (training and test accuracy diverge) and discuss what strategies (more data, regularization, simpler model) would help in each case.
Real-World Connections
- In medical diagnostics, a model predicting disease presence must prioritize high recall to ensure few actual cases are missed, even if it means more false positives requiring further testing. Doctors at Mayo Clinic use such models to flag potential conditions for review.
- Financial institutions like Visa use machine learning to detect fraudulent transactions. They often prioritize precision to minimize the number of legitimate transactions incorrectly flagged as fraud, which can frustrate customers.
Assessment Ideas
Present students with a scenario: 'A spam detection model has 90% precision and 70% recall. Explain what each of these numbers means in the context of identifying spam emails. Which metric might be more important if the cost of a false positive (a legitimate email in spam) is high?'
Provide students with a small, pre-calculated confusion matrix for a binary classifier. Ask them to calculate precision, recall, and the F1-score. Then, ask them to identify whether the model is likely overfitting or underfitting based on hypothetical training vs. validation scores (e.g., Training Accuracy: 98%, Validation Accuracy: 65%).
Facilitate a class discussion using the prompt: 'Imagine you are building a model to identify endangered species from camera trap images. Discuss which evaluation metric (precision, recall, or F1-score) would be most critical and why. Consider the consequences of both false positives and false negatives in this specific context.'
Frequently Asked Questions
What is a confusion matrix and what does it tell you about a classifier?
What is the difference between precision and recall in machine learning?
What is the difference between overfitting and underfitting?
How does active learning help students understand machine learning evaluation?
More in Data Science and Intelligent Systems
Introduction to Data Science Workflow
Students learn the end-to-end process of data science, from data acquisition and cleaning to analysis and communication of results.
2 methodologies
Big Data Concepts and Pattern Recognition
Students analyze massive datasets to find hidden trends, using statistical libraries to process and visualize complex information sets.
2 methodologies
Data Visualization and Interpretation
Students learn to create effective data visualizations to communicate insights and identify patterns in complex datasets.
2 methodologies
Fundamentals of Machine Learning: Supervised Learning
Students are introduced to supervised learning, exploring concepts like regression and classification and how models learn from labeled data.
2 methodologies
Fundamentals of Machine Learning: Unsupervised Learning
Students explore unsupervised learning techniques like clustering and dimensionality reduction to find hidden structures in unlabeled data.
2 methodologies
Neural Networks and Deep Learning (Conceptual)
Students conceptually explore how neural networks are structured, how they learn from experience, and the basics of deep learning.
2 methodologies