Skip to content
Computer Science · 12th Grade

Active learning ideas

Evaluating Machine Learning Models

Machine learning model evaluation is abstract until students confront real data where accuracy misleads. Active learning makes these concepts concrete by having students manipulate models, inspect outputs, and argue about trade-offs using the same tools professionals use.

Common Core State StandardsCSTA: 3B-AP-09CSTA: 3B-DA-06
15–40 minPairs → Whole Class4 activities

Activity 01

Think-Pair-Share15 min · Pairs

Think-Pair-Share: Which Metric Matters Here?

Present three scenarios: a spam filter, a cancer screening test, and a fraud detection system. For each, pairs must decide whether precision or recall is the higher-priority metric and justify the choice by describing the real-world cost of each type of error. The debrief shows that metric selection is a domain judgment, not a mathematical one.

How do we measure the success or failure of an intelligent system using appropriate metrics?

Facilitation TipDuring Think-Pair-Share: Which Metric Matters Here?, ask students to defend their metric choice with evidence from their partner discussion before sharing with the class.

What to look forPresent students with a scenario: 'A spam detection model has 90% precision and 70% recall. Explain what each of these numbers means in the context of identifying spam emails. Which metric might be more important if the cost of a false positive (a legitimate email in spam) is high?'

UnderstandApplyAnalyzeSelf-AwarenessRelationship Skills
Generate Complete Lesson

Activity 02

Collaborative Problem-Solving: Breaking a Classifier

Students train a simple text classifier on a balanced dataset, record its metrics, then feed it deliberately tricky inputs: unusual phrasings, out-of-domain examples, and adversarial examples designed to flip predictions. They document which inputs fail and hypothesize why the model fails on each. This activity reframes evaluation as an adversarial process, not just a number-reporting exercise.

Differentiate between overfitting and underfitting in machine learning models.

Facilitation TipDuring Lab: Breaking a Classifier, circulate and ask groups to articulate why a simple baseline model fails before they iterate toward better solutions.

What to look forProvide students with a small, pre-calculated confusion matrix for a binary classifier. Ask them to calculate precision, recall, and the F1-score. Then, ask them to identify whether the model is likely overfitting or underfitting based on hypothetical training vs. validation scores (e.g., Training Accuracy: 98%, Validation Accuracy: 65%).

ApplyAnalyzeEvaluateCreateRelationship SkillsDecision-MakingSelf-Management
Generate Complete Lesson

Activity 03

Gallery Walk25 min · Small Groups

Gallery Walk: Confusion Matrix Interpretation

Post four confusion matrices from different real-world models (medical diagnosis, spam filter, image classifier, loan approval). For each, groups calculate precision and recall, identify which type of error is more frequent, and write a recommendation about whether the model is ready to deploy given the stated use case. Groups compare recommendations during debrief.

Justify the selection of specific evaluation metrics based on the problem context.

Facilitation TipDuring Gallery Walk: Confusion Matrix Interpretation, assign each student a different matrix to present so that the class collectively covers precision, recall, and F1 across multiple scenarios.

What to look forFacilitate a class discussion using the prompt: 'Imagine you are building a model to identify endangered species from camera trap images. Discuss which evaluation metric (precision, recall, or F1-score) would be most critical and why. Consider the consequences of both false positives and false negatives in this specific context.'

UnderstandApplyAnalyzeCreateRelationship SkillsSocial Awareness
Generate Complete Lesson

Activity 04

Case Study Analysis35 min · Pairs

Simulation Activity: Learning Curve Analysis

Students train the same model on increasing amounts of data, 10%, 25%, 50%, 75%, 100% of the training set, and plot training accuracy versus test accuracy at each point. They identify where the model overfits (training and test accuracy diverge) and discuss what strategies (more data, regularization, simpler model) would help in each case.

How do we measure the success or failure of an intelligent system using appropriate metrics?

Facilitation TipDuring Simulation Activity: Learning Curve Analysis, have students sketch expected curves for underfitting and overfitting on the board before running the simulation to anchor their predictions.

What to look forPresent students with a scenario: 'A spam detection model has 90% precision and 70% recall. Explain what each of these numbers means in the context of identifying spam emails. Which metric might be more important if the cost of a false positive (a legitimate email in spam) is high?'

AnalyzeEvaluateCreateDecision-MakingSelf-Management
Generate Complete Lesson

A few notes on teaching this unit

Teachers know students often accept high accuracy as proof of a good model without questioning the data distribution. To counter this, use side-by-side comparisons: show a model with 95% accuracy on imbalanced data next to one with 80% accuracy that correctly flags the minority class. This contrast reveals why metrics must align with problem goals. Avoid rushing to formulas; instead, anchor discussions in tangible consequences like missed diagnoses or wasted resources.

Students will explain why accuracy alone is insufficient, interpret precision and recall in context, and identify overfitting from learning curves. They will justify metric choices by connecting consequences to real-world decisions.


Watch Out for These Misconceptions

  • During Think-Pair-Share: Which Metric Matters Here?, watch for students who default to accuracy without considering class imbalance.

    Use the activity’s provided imbalanced datasets to prompt students to calculate precision and recall, then ask them to explain why accuracy is misleading in each case.

  • During Lab: Breaking a Classifier, students may assume overfitting only happens with large models.

    Have students compare training and validation performance across different model sizes, then ask them to explain why even a simple model can overfit when trained on limited data.

  • During Simulation Activity: Learning Curve Analysis, students might believe a single high test score guarantees deployment readiness.

    Use the activity’s distribution shift scenario to guide students to recognize that test set performance does not account for future data changes, then discuss ongoing monitoring needs.


Methods used in this brief