AI Project | Portf

Teaching a Machine to Ask Smart Questions

The Big Idea: What is Active Learning?

Imagine you're a student preparing for an exam. You have two choices:

Passive Learning: Reread the entire textbook from start to finish, spending equal time on every page.
Active Learning: Take a practice test, find the questions you're most unsure about, and then focus your study time on those specific topics.

Which method is more efficient? The second one, right?

Active Learning in machine learning is the same principle. Instead of training a model on a huge, fully labeled dataset (which can be expensive and time-consuming to create), an active learning system starts with a few labeled examples and then intelligently chooses new, unlabeled examples to be labeled. It specifically picks the ones it's most "confused" about, as these are the examples it will learn the most from.

My Goal: Train a great model for 1D-MNIST by only labeling a small fraction of the data, and you get to choose which data points get a label.

Step 1 : Preparing the Data MNIST

The MNIST dataset contains images of handwritten digits. The '1D' version simply means we'll "unroll" each 28x28 pixel image into a single flat vector of 784 numbers (28times28=784). torchvision does this for us easily.

We need to split our data very carefully:

Unlabeled Pool (U) : The vast majority of our training data. We pretend we don't know the labels for these.
Initial Labeled Set (L) : A tiny, randomly selected set of data points from the training set to train our very first, "seed" model.
Test Set: A completely separate set of data that the model never sees during training. We use this at the very end to evaluate how well our model performs in the real world.

Step 2: The Query Strategy - Uncertainty Sampling

This is the core of active learning. How does the model choose which data point to ask for a label? We'll use a simple and very effective method called 'Least Confidence Sampling'.

Here's how it works:

Our model will look at an unlabeled data point and predict the probability for each class (0 through 9). For example: [0.1, 0.05, 0.3, 0.1, 0.05, 0.15, 0.05, 0.1, 0.05, 0.05]
The model's confidence is the highest probability in that list. In the example above, the highest probability is 0.3 for the digit '2'.
The model repeats this for all unlabeled data points.
It then finds the data point where its confidence was the lowest. This is the data point it's most "confused" about.
This is the one we choose to label!

Step 3 : The Active Learning Loop

This is the main algorithm. We will repeat this process several times, adding a few new labels in each "query" step.

Start:

We have our tiny labeled set L (100 samples) and our huge unlabeled pool U (59,900 samples)

The Loop:

Train: Train our neural network model using only the data in L.
Predict: Use the trained model to predict probabilities for every sample in the unlabeled pool U.
Query: Using our "Least Confidence" strategy, find the 10 most uncertain samples from U.
Label & Update: "Ask the oracle" for their labels (in our simulation, we just look up the true labels). Move these 10 newly labeled samples from U to L.
Evaluate (Optional but Recommended): Check the model's accuracy on the separate test set. This tells us how our model is improving with each query.
Repeat: Go back to Step 1 and retrain the model on the now slightly larger labeled set L.

Repeated this loop for 50 times. At the end, we'll have trained a model on just 100+(50times10)=600 labels, but because we chose them intelligently, our model should be surprisingly accurate!

Step 4: The Code - Putting it all Together

Screenshot 2025-07-23 at 11.22.28 PM.png

The Model: First, let's define a simple neural network. A small Multi-Layer Perceptron (MLP) is perfect for MNIST

Output

Screenshot 2025-07-23 at 11.31.15 PM.png

Part 1: The Sharp Rise

The section from 100 to about 250 labeled samples.

The accuracy jumps dramatically, from under 40% to nearly 80%. This is the steepest part of the curve.
In the beginning, the model is very ignorant. Every single sample it asks for is highly informative. It's learning the most fundamental differences between digits (e.g., what makes a '1' different from an '8'). By intelligently choosing the most "confusing" examples, it learns the most important lessons very quickly. This is where active learning provides the biggest return on your "labeling budget."

Part 2: The Plateau

Now look at the section from 300 to 600 labeled samples.

The curve becomes much flatter. The accuracy is still climbing, but much more slowly (from roughly 82% to 88%).
By this point, the model has already learned the basic, easy-to-distinguish features of the digits. The examples it's now querying are more nuanced and difficult—perhaps a strangely written '7' that looks a bit like a '1', or a '5' that looks like a '6'. While these new labels still help, they offer less new information than the initial ones did. This is a classic case of diminishing returns. The model is now fine-tuning its knowledge rather than learning broad new concepts.