Fundamentals of Machine Learning: Supervised Learning
Students are introduced to supervised learning, exploring concepts like regression and classification and how models learn from labeled data.
About This Topic
Supervised learning is the foundation of most deployed machine learning systems in use today. Students in US 12th-grade CS learn that in supervised learning, a model is trained on a labeled dataset, pairs of input features and correct outputs, to learn a mapping function that generalizes to new, unseen inputs. The term 'supervised' reflects that the training process is guided by known correct answers.
Two major task types fall under supervised learning: classification, where the output is a category (spam or not spam, tumor type, digit label), and regression, where the output is a continuous value (house price, temperature forecast, credit score). Both tasks follow the same pipeline: collect labeled data, choose a model architecture, train by minimizing a loss function, evaluate on held-out test data, and iterate. Students also learn why splitting data into training and test sets is essential, using the same data for both produces inflated performance estimates that do not predict real-world behavior.
Active learning approaches are productive here because students can experience the training feedback loop directly using tools like Teachable Machine or scikit-learn, building intuition for what 'learning from data' actually means rather than treating it as a black box.
Key Questions
- How does a machine learning model differ from a traditional rule-based program?
- Differentiate between classification and regression tasks in supervised learning.
- Explain the process of training and evaluating a supervised learning model.
Learning Objectives
- Compare and contrast classification and regression tasks within supervised machine learning.
- Explain the fundamental process of training a supervised learning model using labeled data.
- Evaluate the performance of a trained supervised learning model using appropriate metrics.
- Design a simple supervised learning experiment to predict a categorical or numerical outcome.
Before You Start
Why: Students need basic programming skills to implement and experiment with machine learning models.
Why: Understanding how to structure and process data is essential before applying machine learning algorithms.
Key Vocabulary
| Labeled Data | A dataset where each data point is paired with a correct output or 'label', used to train supervised learning models. |
| Classification | A supervised learning task that predicts a discrete category or class label, such as 'spam' or 'not spam'. |
| Regression | A supervised learning task that predicts a continuous numerical value, such as a house price or temperature. |
| Training Set | The portion of labeled data used to teach the machine learning model by adjusting its parameters. |
| Test Set | A separate portion of labeled data, unseen during training, used to evaluate the model's generalization ability. |
Watch Out for These Misconceptions
Common MisconceptionMore training data always produces a better model.
What to Teach Instead
More data helps, but data quality and relevance matter more than volume. A large dataset with systematic labeling errors or missing important features can train a confident but wrong model. Having students deliberately corrupt a portion of their training labels and observe the effect makes this concrete.
Common MisconceptionHigh accuracy on the training set means the model is good.
What to Teach Instead
A model can memorize training examples without learning generalizable patterns, a problem called overfitting. Evaluating on a separate test set is essential. Students who train on the full dataset and then 'test' on the same data regularly see near-100% accuracy, and experiencing the drop when they apply their model to new examples is a lesson that sticks.
Common MisconceptionSupervised learning models understand the meaning of the data they process.
What to Teach Instead
Models learn statistical associations between input features and outputs. They do not understand context, causation, or meaning. A spam classifier that achieves 98% accuracy has no idea what spam is, it has found patterns that correlate with the label. This distinction matters enormously when discussing model failures and AI ethics.
Active Learning Ideas
See all activitiesHands-On Lab: Train Your First Classifier
Students use Google's Teachable Machine or a simple scikit-learn notebook to train an image or text classifier on a dataset they collect themselves. They deliberately include mislabeled examples and observe how this degrades accuracy. The lab closes with each pair reporting their accuracy and one insight about what made their training data better or worse.
Think-Pair-Share: Classification or Regression?
Present eight real-world prediction problems and ask pairs to categorize each as classification or regression and justify the choice. Include ambiguous cases like predicting customer satisfaction (score 1-10 versus positive/negative). Whole-class discussion reveals that the distinction sometimes depends on how you frame the business problem, not just the data.
Socratic Seminar: What Does 'Learning' Mean?
Open with the question: 'Is a model that scores 99% accuracy on training data but 60% on new data actually learning?' Students draw on their lab experience to discuss generalization, memorization, and the purpose of the train/test split. The teacher facilitates without providing answers, letting student reasoning drive the conversation toward overfitting.
Gallery Walk: Algorithm Comparison
Post four posters around the room, linear regression, decision trees, k-nearest neighbors, and naive Bayes, each with a brief description, a sample use case, and a blank section labeled 'when this would struggle.' Groups rotate, add sticky notes to the struggle section, then rotate again to critique and extend each other's entries.
Real-World Connections
- Financial analysts use classification models to predict loan default risk, helping banks decide whether to approve applications for individuals in cities like New York or Chicago.
- Medical researchers employ regression models to forecast patient recovery times based on various health indicators, aiding treatment planning in hospitals worldwide.
- E-commerce platforms like Amazon utilize classification algorithms to categorize products and recommend items to customers based on their past purchases and browsing history.
Assessment Ideas
Provide students with two scenarios: one describing predicting house prices and another describing identifying images of cats or dogs. Ask them to identify which scenario is a classification task and which is a regression task, and to briefly explain why.
Present students with a small, pre-labeled dataset (e.g., fruit type and color). Ask them to verbally explain how they would use this data to train a model to identify new fruits, focusing on the role of the labels.
Pose the question: 'Why is it crucial to evaluate a supervised learning model on data it has not seen during training?' Facilitate a discussion where students explain the concept of overfitting and the importance of the test set for assessing real-world performance.
Frequently Asked Questions
What is supervised learning and how is it different from regular programming?
What is the difference between classification and regression in machine learning?
Why do you need separate training and test datasets in supervised learning?
How does active learning help students understand supervised machine learning?
More in Data Science and Intelligent Systems
Introduction to Data Science Workflow
Students learn the end-to-end process of data science, from data acquisition and cleaning to analysis and communication of results.
2 methodologies
Big Data Concepts and Pattern Recognition
Students analyze massive datasets to find hidden trends, using statistical libraries to process and visualize complex information sets.
2 methodologies
Data Visualization and Interpretation
Students learn to create effective data visualizations to communicate insights and identify patterns in complex datasets.
2 methodologies
Fundamentals of Machine Learning: Unsupervised Learning
Students explore unsupervised learning techniques like clustering and dimensionality reduction to find hidden structures in unlabeled data.
2 methodologies
Neural Networks and Deep Learning (Conceptual)
Students conceptually explore how neural networks are structured, how they learn from experience, and the basics of deep learning.
2 methodologies
Evaluating Machine Learning Models
Students learn various metrics and techniques for evaluating the performance and robustness of machine learning models.
2 methodologies