Evaluating Machine Learning ModelsActivities & Teaching Strategies
Machine learning model evaluation is abstract until students confront real data where accuracy misleads. Active learning makes these concepts concrete by having students manipulate models, inspect outputs, and argue about trade-offs using the same tools professionals use.
Learning Objectives
- 1Calculate and interpret precision, recall, and F1-score from a given confusion matrix for a binary classification problem.
- 2Differentiate between overfitting and underfitting by comparing model performance on training and validation datasets.
- 3Justify the selection of an appropriate evaluation metric (e.g., precision, recall) for a given machine learning application scenario.
- 4Critique the performance of a machine learning model by analyzing its performance across various evaluation metrics and identifying potential biases or failure modes.
Want a complete lesson plan with these objectives? Generate a Mission →
Think-Pair-Share: Which Metric Matters Here?
Present three scenarios: a spam filter, a cancer screening test, and a fraud detection system. For each, pairs must decide whether precision or recall is the higher-priority metric and justify the choice by describing the real-world cost of each type of error. The debrief shows that metric selection is a domain judgment, not a mathematical one.
Prepare & details
How do we measure the success or failure of an intelligent system using appropriate metrics?
Facilitation Tip: During Think-Pair-Share: Which Metric Matters Here?, ask students to defend their metric choice with evidence from their partner discussion before sharing with the class.
Setup: Standard classroom seating; students turn to a neighbor
Materials: Discussion prompt (projected or printed), Optional: recording sheet for pairs
Collaborative Problem-Solving: Breaking a Classifier
Students train a simple text classifier on a balanced dataset, record its metrics, then feed it deliberately tricky inputs: unusual phrasings, out-of-domain examples, and adversarial examples designed to flip predictions. They document which inputs fail and hypothesize why the model fails on each. This activity reframes evaluation as an adversarial process, not just a number-reporting exercise.
Prepare & details
Differentiate between overfitting and underfitting in machine learning models.
Facilitation Tip: During Lab: Breaking a Classifier, circulate and ask groups to articulate why a simple baseline model fails before they iterate toward better solutions.
Setup: Groups at tables with problem materials
Materials: Problem packet, Role cards (facilitator, recorder, timekeeper, reporter), Problem-solving protocol sheet, Solution evaluation rubric
Gallery Walk: Confusion Matrix Interpretation
Post four confusion matrices from different real-world models (medical diagnosis, spam filter, image classifier, loan approval). For each, groups calculate precision and recall, identify which type of error is more frequent, and write a recommendation about whether the model is ready to deploy given the stated use case. Groups compare recommendations during debrief.
Prepare & details
Justify the selection of specific evaluation metrics based on the problem context.
Facilitation Tip: During Gallery Walk: Confusion Matrix Interpretation, assign each student a different matrix to present so that the class collectively covers precision, recall, and F1 across multiple scenarios.
Setup: Wall space or tables arranged around room perimeter
Materials: Large paper/poster boards, Markers, Sticky notes for feedback
Simulation Activity: Learning Curve Analysis
Students train the same model on increasing amounts of data, 10%, 25%, 50%, 75%, 100% of the training set, and plot training accuracy versus test accuracy at each point. They identify where the model overfits (training and test accuracy diverge) and discuss what strategies (more data, regularization, simpler model) would help in each case.
Prepare & details
How do we measure the success or failure of an intelligent system using appropriate metrics?
Facilitation Tip: During Simulation Activity: Learning Curve Analysis, have students sketch expected curves for underfitting and overfitting on the board before running the simulation to anchor their predictions.
Setup: Groups at tables with case materials
Materials: Case study packet (3-5 pages), Analysis framework worksheet, Presentation template
Teaching This Topic
Teachers know students often accept high accuracy as proof of a good model without questioning the data distribution. To counter this, use side-by-side comparisons: show a model with 95% accuracy on imbalanced data next to one with 80% accuracy that correctly flags the minority class. This contrast reveals why metrics must align with problem goals. Avoid rushing to formulas; instead, anchor discussions in tangible consequences like missed diagnoses or wasted resources.
What to Expect
Students will explain why accuracy alone is insufficient, interpret precision and recall in context, and identify overfitting from learning curves. They will justify metric choices by connecting consequences to real-world decisions.
These activities are a starting point. A full mission is the experience.
- Complete facilitation script with teacher dialogue
- Printable student materials, ready for class
- Differentiation strategies for every learner
Watch Out for These Misconceptions
Common MisconceptionDuring Think-Pair-Share: Which Metric Matters Here?, watch for students who default to accuracy without considering class imbalance.
What to Teach Instead
Use the activity’s provided imbalanced datasets to prompt students to calculate precision and recall, then ask them to explain why accuracy is misleading in each case.
Common MisconceptionDuring Lab: Breaking a Classifier, students may assume overfitting only happens with large models.
What to Teach Instead
Have students compare training and validation performance across different model sizes, then ask them to explain why even a simple model can overfit when trained on limited data.
Common MisconceptionDuring Simulation Activity: Learning Curve Analysis, students might believe a single high test score guarantees deployment readiness.
What to Teach Instead
Use the activity’s distribution shift scenario to guide students to recognize that test set performance does not account for future data changes, then discuss ongoing monitoring needs.
Assessment Ideas
After Think-Pair-Share: Which Metric Matters Here?, ask students to write a short paragraph explaining which metric they would prioritize for a medical test and why, using evidence from their discussion.
During Lab: Breaking a Classifier, circulate and ask each group to explain whether their model is overfitting or underfitting based on training vs. validation scores, then have them adjust hyperparameters accordingly.
During Gallery Walk: Confusion Matrix Interpretation, facilitate a class discussion where students justify their metric choices for the endangered species scenario, linking their decisions to consequences such as false alarms or missed detections.
Extensions & Scaffolding
- Challenge: Ask students to design a model that maximizes recall for a fraud detection system, then defend their approach in a one-minute pitch using precision-recall trade-offs.
- Scaffolding: Provide a partially completed confusion matrix template for students to fill in during the Gallery Walk before calculating metrics independently.
- Deeper exploration: Invite students to research adversarial attacks on deployed models and present findings on how distribution shift affects real-world performance.
Key Vocabulary
| Confusion Matrix | A table that summarizes the performance of a classification model, showing true positives, true negatives, false positives, and false negatives. |
| Precision | The proportion of true positive predictions among all positive predictions made by the model; it answers, 'Of all the instances predicted as positive, how many were actually positive?' |
| Recall (Sensitivity) | The proportion of actual positive instances that were correctly identified by the model; it answers, 'Of all the actual positive instances, how many did the model find?' |
| F1-Score | The harmonic mean of precision and recall, providing a single score that balances both metrics, useful when class distribution is uneven. |
| Overfitting | A phenomenon where a machine learning model learns the training data too well, including noise and outliers, leading to poor generalization on unseen data. |
| Underfitting | A phenomenon where a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and unseen data. |
Suggested Methodologies
More in Data Science and Intelligent Systems
Introduction to Data Science Workflow
Students learn the end-to-end process of data science, from data acquisition and cleaning to analysis and communication of results.
2 methodologies
Big Data Concepts and Pattern Recognition
Students analyze massive datasets to find hidden trends, using statistical libraries to process and visualize complex information sets.
2 methodologies
Data Visualization and Interpretation
Students learn to create effective data visualizations to communicate insights and identify patterns in complex datasets.
2 methodologies
Fundamentals of Machine Learning: Supervised Learning
Students are introduced to supervised learning, exploring concepts like regression and classification and how models learn from labeled data.
2 methodologies
Fundamentals of Machine Learning: Unsupervised Learning
Students explore unsupervised learning techniques like clustering and dimensionality reduction to find hidden structures in unlabeled data.
2 methodologies
Ready to teach Evaluating Machine Learning Models?
Generate a full mission with everything you need
Generate a Mission