Skip to content
Computer Science · 12th Grade · Data Science and Intelligent Systems · Weeks 19-27

Evaluating Machine Learning Models

Students learn various metrics and techniques for evaluating the performance and robustness of machine learning models.

Common Core State StandardsCSTA: 3B-AP-09CSTA: 3B-DA-06

About This Topic

Building a machine learning model is only half the work, understanding whether it actually solves the problem is equally important. In US 12th-grade CS, students learn that accuracy alone is an incomplete measure of model performance, particularly when classes are imbalanced. A model that labels every patient as 'disease-free' in a dataset where 95% are healthy achieves 95% accuracy while being completely useless for detecting the disease it was built to find.

Students work with the confusion matrix as a foundational tool: the four cells (true positives, false positives, true negatives, false negatives) give rise to precision (of positive predictions, how many were correct?), recall (of actual positives, how many did the model catch?), and F1-score (their harmonic mean). The choice of metric is not technical, it reflects which type of error is more costly in the specific application.

Two failure modes also receive attention: overfitting, where a model performs well on training data but poorly on new examples because it memorized rather than generalized, and underfitting, where the model is too simple to capture the pattern. Active learning through deliberate model-breaking exercises, testing classifiers on adversarial or out-of-distribution examples, helps students develop evaluative intuition that is hard to build from metrics alone.

Key Questions

  1. How do we measure the success or failure of an intelligent system using appropriate metrics?
  2. Differentiate between overfitting and underfitting in machine learning models.
  3. Justify the selection of specific evaluation metrics based on the problem context.

Learning Objectives

  • Calculate and interpret precision, recall, and F1-score from a given confusion matrix for a binary classification problem.
  • Differentiate between overfitting and underfitting by comparing model performance on training and validation datasets.
  • Justify the selection of an appropriate evaluation metric (e.g., precision, recall) for a given machine learning application scenario.
  • Critique the performance of a machine learning model by analyzing its performance across various evaluation metrics and identifying potential biases or failure modes.

Before You Start

Introduction to Machine Learning Concepts

Why: Students need a basic understanding of what a machine learning model does and the concept of training data before evaluating its performance.

Data Classification

Why: Understanding the fundamental task of assigning data points to predefined categories is essential for interpreting classification metrics.

Key Vocabulary

Confusion MatrixA table that summarizes the performance of a classification model, showing true positives, true negatives, false positives, and false negatives.
PrecisionThe proportion of true positive predictions among all positive predictions made by the model; it answers, 'Of all the instances predicted as positive, how many were actually positive?'
Recall (Sensitivity)The proportion of actual positive instances that were correctly identified by the model; it answers, 'Of all the actual positive instances, how many did the model find?'
F1-ScoreThe harmonic mean of precision and recall, providing a single score that balances both metrics, useful when class distribution is uneven.
OverfittingA phenomenon where a machine learning model learns the training data too well, including noise and outliers, leading to poor generalization on unseen data.
UnderfittingA phenomenon where a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and unseen data.

Watch Out for These Misconceptions

Common MisconceptionHigh accuracy means a machine learning model is good.

What to Teach Instead

Accuracy is misleading when classes are imbalanced. A model that always predicts the majority class can achieve high accuracy without learning anything useful. Students who calculate accuracy on an imbalanced dataset and then compute precision and recall discover immediately that these metrics tell a very different story.

Common MisconceptionOverfitting is only a problem for very large, complex models.

What to Teach Instead

Overfitting can occur even with relatively simple models when training data is limited or when the model is tuned extensively using the test set. The true guard against overfitting is held-out evaluation data that the model never sees during training or tuning. Learning curve labs let students observe overfitting empirically across model sizes.

Common MisconceptionOnce a model performs well on the test set, it is ready for deployment.

What to Teach Instead

Test set performance estimates generalization to data from the same distribution as the training data. Deployed models often encounter distribution shift, inputs that look different from training data due to population changes, seasonal effects, or adversarial users. Evaluation in deployment requires ongoing monitoring, not a one-time test set result.

Active Learning Ideas

See all activities

Think-Pair-Share: Which Metric Matters Here?

Present three scenarios: a spam filter, a cancer screening test, and a fraud detection system. For each, pairs must decide whether precision or recall is the higher-priority metric and justify the choice by describing the real-world cost of each type of error. The debrief shows that metric selection is a domain judgment, not a mathematical one.

15 min·Pairs

Collaborative Problem-Solving: Breaking a Classifier

Students train a simple text classifier on a balanced dataset, record its metrics, then feed it deliberately tricky inputs: unusual phrasings, out-of-domain examples, and adversarial examples designed to flip predictions. They document which inputs fail and hypothesize why the model fails on each. This activity reframes evaluation as an adversarial process, not just a number-reporting exercise.

40 min·Pairs

Gallery Walk: Confusion Matrix Interpretation

Post four confusion matrices from different real-world models (medical diagnosis, spam filter, image classifier, loan approval). For each, groups calculate precision and recall, identify which type of error is more frequent, and write a recommendation about whether the model is ready to deploy given the stated use case. Groups compare recommendations during debrief.

25 min·Small Groups

Simulation Activity: Learning Curve Analysis

Students train the same model on increasing amounts of data, 10%, 25%, 50%, 75%, 100% of the training set, and plot training accuracy versus test accuracy at each point. They identify where the model overfits (training and test accuracy diverge) and discuss what strategies (more data, regularization, simpler model) would help in each case.

35 min·Pairs

Real-World Connections

  • In medical diagnostics, a model predicting disease presence must prioritize high recall to ensure few actual cases are missed, even if it means more false positives requiring further testing. Doctors at Mayo Clinic use such models to flag potential conditions for review.
  • Financial institutions like Visa use machine learning to detect fraudulent transactions. They often prioritize precision to minimize the number of legitimate transactions incorrectly flagged as fraud, which can frustrate customers.

Assessment Ideas

Exit Ticket

Present students with a scenario: 'A spam detection model has 90% precision and 70% recall. Explain what each of these numbers means in the context of identifying spam emails. Which metric might be more important if the cost of a false positive (a legitimate email in spam) is high?'

Quick Check

Provide students with a small, pre-calculated confusion matrix for a binary classifier. Ask them to calculate precision, recall, and the F1-score. Then, ask them to identify whether the model is likely overfitting or underfitting based on hypothetical training vs. validation scores (e.g., Training Accuracy: 98%, Validation Accuracy: 65%).

Discussion Prompt

Facilitate a class discussion using the prompt: 'Imagine you are building a model to identify endangered species from camera trap images. Discuss which evaluation metric (precision, recall, or F1-score) would be most critical and why. Consider the consequences of both false positives and false negatives in this specific context.'

Frequently Asked Questions

What is a confusion matrix and what does it tell you about a classifier?
A confusion matrix is a table showing how many predictions fell into each of four categories: true positives (correctly predicted positive), false positives (predicted positive but actually negative), true negatives (correctly predicted negative), and false negatives (predicted negative but actually positive). It provides a complete picture of where the model succeeds and fails for each class, which raw accuracy does not.
What is the difference between precision and recall in machine learning?
Precision measures how accurate the model's positive predictions are: of all the items the model flagged as positive, what fraction actually were? Recall measures coverage: of all the truly positive items, what fraction did the model find? These metrics trade off against each other, a model can increase recall by predicting positive more aggressively, but this typically lowers precision.
What is the difference between overfitting and underfitting?
Overfitting occurs when a model performs well on training data but poorly on new data, because it memorized specific training examples rather than learning generalizable patterns. Underfitting occurs when the model is too simple to capture the underlying patterns, performing poorly on both training and test data. The ideal model learns enough to generalize without memorizing.
How does active learning help students understand machine learning evaluation?
Deliberately breaking classifiers, feeding adversarial or edge-case inputs and watching the model fail, builds evaluative intuition that metric calculations alone cannot provide. When students choose which metric matters for a specific problem domain by reasoning through the cost of different error types, they experience evaluation as a judgment process. This active reasoning is what develops real-world model assessment skills.