Skip to content
Computer Science · 11th Grade · Artificial Intelligence and Ethics · Weeks 19-27

Unsupervised Learning: Clustering

Discovering patterns and structures in unlabeled data using algorithms like K-Means.

Common Core State StandardsCSTA: 3B-AP-09CSTA: 3B-DA-07

About This Topic

Unsupervised learning deals with data that has no labels, no pre-assigned categories or target values. The goal is to find structure, patterns, or groupings that exist naturally in the data. Clustering is the most widely used unsupervised technique, and K-Means is the most common clustering algorithm, making it the natural entry point for this unit.

K-Means works by assigning each data point to the nearest of K cluster centers, then recalculating the centers as the mean of all points assigned to each cluster. This repeats until assignments stabilize. The algorithm is elegant in its simplicity and visually intuitive, students can trace it by hand on a small dataset or watch animated visualizations that make the convergence process concrete.

Contrasting unsupervised with supervised learning is an important conceptual move. Students often struggle at first with the idea of finding patterns in data without knowing what to look for. Active learning activities that ask students to cluster data by hand, before introducing the algorithm, help them see that structure can be discovered without labels, which makes the algorithmic approach feel like a formalization of something they already did intuitively.

Key Questions

  1. Explain how unsupervised learning identifies patterns without explicit labels.
  2. Analyze the purpose and mechanics of clustering algorithms like K-Means.
  3. Differentiate between supervised and unsupervised learning applications.

Learning Objectives

  • Explain the fundamental difference between supervised and unsupervised learning, citing examples of each.
  • Analyze the iterative process of the K-Means clustering algorithm, including centroid initialization and reassignment.
  • Calculate the mean of a small dataset to determine a cluster centroid.
  • Classify data points into distinct clusters based on proximity to centroids.
  • Evaluate the effectiveness of K-Means clustering on a given dataset, considering the choice of K.

Before You Start

Introduction to Data Visualization

Why: Students need to be able to interpret scatter plots to visually identify potential groupings in data.

Basic Statistics: Mean and Averages

Why: The K-Means algorithm relies on calculating the mean to determine cluster centroids.

Key Vocabulary

Unsupervised LearningA type of machine learning where algorithms learn patterns from data that has not been labeled or classified. The goal is to find inherent structure in the data.
ClusteringAn unsupervised learning technique used to group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
K-Means AlgorithmA popular clustering algorithm that aims to partition 'n' observations into 'k' clusters in which each observation belongs to the cluster with the nearest mean (cluster centroid).
CentroidThe center of a cluster, calculated as the mean of all data points assigned to that cluster. It is used to determine which cluster a data point belongs to.
IterationA single pass through the K-Means algorithm, involving the reassignment of data points to centroids and the recalculation of centroids.

Watch Out for These Misconceptions

Common MisconceptionUnsupervised learning is less useful than supervised learning because there are no labels.

What to Teach Instead

Unsupervised learning is often more practical precisely because labels are expensive or impossible to obtain. Most real-world data is unlabeled. Clustering has been used to discover cancer subtypes, segment customers, detect fraud, and compress images. The absence of labels doesn't make the technique weaker, it makes it applicable to a much larger set of problems.

Common MisconceptionK-Means always finds the correct clusters.

What to Teach Instead

K-Means finds clusters that minimize within-cluster variance given the starting positions and the value of K, but those clusters may not match any meaningful real-world groupings. The algorithm is sensitive to initialization and can converge to local optima. It also assumes clusters are roughly spherical and similar in size, which isn't always true. Evaluating whether clusters are meaningful is always a human judgment.

Common MisconceptionYou have to pick K before running the algorithm, so you need to know the answer in advance.

What to Teach Instead

Choosing K is part of the process, not a prerequisite for knowing the answer. Techniques like the elbow method, silhouette analysis, and domain knowledge help select a reasonable K. Running the algorithm at multiple K values and comparing results is standard practice. The 'right' K is often the one that produces clusters that make sense given what you know about the data.

Active Learning Ideas

See all activities

Real-World Connections

  • Retail companies use clustering to segment customers based on purchasing behavior, allowing for targeted marketing campaigns. For example, Amazon might group shoppers who frequently buy electronics and books separately from those who buy groceries.
  • Biologists employ clustering to identify distinct groups of genes with similar functions or expression patterns, aiding in the understanding of biological processes. This can help researchers find patterns in DNA sequences for disease research.
  • Social media platforms utilize clustering to group users with similar interests or network connections, which can inform content recommendations and friend suggestions. This helps platforms like TikTok or Instagram personalize user feeds.

Assessment Ideas

Quick Check

Present students with a small 2D dataset (e.g., 6-8 points) and ask them to manually perform one iteration of the K-Means algorithm. They should draw the initial centroids, assign points, and calculate the new centroid locations.

Discussion Prompt

Pose the question: 'Imagine you are a data scientist for a streaming service. How could you use clustering to improve user experience without knowing exactly what each user likes beforehand? What challenges might you face?'

Exit Ticket

Ask students to write down one key difference between supervised and unsupervised learning and provide a real-world example of where clustering is applied, explaining briefly why it's unsupervised.

Frequently Asked Questions

What is the difference between supervised and unsupervised learning?
Supervised learning trains a model on labeled data, each training example has an input and a known correct output. The model learns to predict outputs for new inputs. Unsupervised learning works with unlabeled data, there are no correct outputs. Instead, the algorithm finds structure, patterns, or groupings that exist in the data. Clustering, dimensionality reduction, and anomaly detection are all unsupervised tasks.
How does the K-Means clustering algorithm work?
K-Means starts by placing K cluster centers at random positions. It then assigns each data point to the nearest center, recalculates each center as the mean of all points assigned to it, and repeats until assignments stop changing. The result is K groups where points within a group are closer to each other than to points in other groups. The number K must be chosen before running the algorithm.
How can active learning help students understand unsupervised learning?
The key challenge in unsupervised learning is the conceptual shift from 'finding the right answer' to 'finding meaningful structure.' Activities that ask students to cluster data by hand, before seeing the algorithm, let them discover that groupings can be found without labels. Tracing K-Means step by step on a small printed dataset makes the algorithm's mechanics visible in a way that code output alone doesn't.
What are real-world applications of clustering?
Customer segmentation is one of the most common applications, grouping customers by purchasing behavior to tailor marketing. Clustering is also used to group similar news articles, identify gene expression patterns in biology, detect anomalous network traffic for cybersecurity, and segment medical images. In each case, the goal is to discover naturally occurring groups in unlabeled data that inform decisions or further analysis.