Semi-Supervised Learning Explained: Pseudo-Labeling, FixMatch, and More

Q: What's the difference between semi-supervised and self-supervised learning?

Semi-supervised learning uses a small labeled dataset plus a large unlabeled dataset to solve a specific downstream task directly. Self-supervised learning uses only unlabeled data and invents pretext labels from data structure (masking, contrastive pairs) to produce a reusable pretrained backbone that is later fine-tuned on a downstream task with labels. Semi-supervised is a training strategy for a task; self-supervised is a pretraining strategy for representations.

Q: How many labeled samples do I need for semi-supervised learning?

FixMatch-class methods produce large gains with as few as 4 to 40 labeled examples per class for image classification. Gains diminish once about 10 percent of your dataset is labeled. For NLP and tabular data the curve is similar but usually requires slightly more labels per class due to greater input variability.

Q: When does semi-supervised learning hurt rather than help?

SSL can hurt when the unlabeled data distribution differs materially from the labeled data distribution, when the unlabeled set contains novel classes not in the labeled set, when class imbalance biases pseudo-labels, or when the smoothness, cluster, or manifold assumptions do not hold. Always measure against a strong supervised baseline on a held-out set that reflects deployment conditions.

Q: Can I combine semi-supervised learning with transfer learning?

Yes. Initialize with a pretrained backbone (ImageNet, CLIP, domain-specific) and apply FixMatch or Mean Teacher on top. Pretrained features mean the confidence threshold is reached earlier in training and pseudo-labels are more reliable. This combination is close to the default recipe in modern practice.

Last updated: May 27, 2026

By kongastral

Published April 21, 2026 · Updated May 27, 2026 · 32 min read

Summary

What this post covers: A detailed examination of semi-supervised learning (SSL), from classical methods through modern consistency-based approaches, with a full PyTorch implementation of FixMatch that enables a model to match supervised accuracy using 10 to 100 times fewer labels.

Key insights:

Modern SSL methods like FixMatch can match fully-supervised performance with 10x to 100x fewer labels by combining weak augmentation, confidence thresholding (tau = 0.95), and strong-augmentation consistency.
Semi-supervised learning is not self-supervised learning: SSL uses some task labels plus unlabeled data, while self-supervised invents labels from data structure and produces a pretrained backbone.
SSL only works when the smoothness, cluster, manifold, or low-density assumption holds; applying it blindly across distribution shift between labeled and unlabeled splits will silently destroy accuracy.
The confidence-gated pseudo-label is a natural curriculum: early in training most unlabeled examples fall below threshold and are ignored, so the model is not poisoned by its own bad predictions.
FixMatch’s effectiveness comes mostly from strong augmentation (RandAugment + Cutout) and high confidence thresholds, not from complex architectures, which is why it generalizes across vision, audio, NLP, and medical imaging.

Main topics: The Promise of Learning from Almost-Free Data, What Semi-Supervised Learning Is (and Isn’t), Semi-Supervised vs Self-Supervised: The Critical Distinction, The Four Assumptions That Make SSL Work, Classical Semi-Supervised Methods, The Deep Learning Era of SSL, FixMatch in Detail: How the Method Works, Full PyTorch Implementation of FixMatch, Real-World Applications Across Domains, Paradigm Comparison: SSL, Self-SSL, Transfer, Active, Practical Guide: Thresholds, Data Ratios, Pitfalls, Connections to Transfer, Active, and Domain Adaptation, Frequently Asked Questions, References and Further Reading.

The Promise of Learning from Almost-Free Data

Consider a setting with 1,000 labelled medical images and 100,000 unlabelled ones. Training only on the labelled portion yields 78% accuracy. Adding the unlabelled data through semi-supervised learning raises that figure to 93%, with no additional labels required.

That single observation explains why semi-supervised learning has quietly become one of the most consequential ideas in modern machine learning. Labels are expensive. A radiologist annotating a chest X-ray represents both real cost and real time. A crowd worker labelling toxic comments must read each one carefully. An engineer hand-segmenting pedestrians in a video frame may require ten minutes per frame. The raw data, however, is largely free: unlabelled X-rays accumulate on hospital servers, billions of comments sit on social platforms, and petabytes of driving footage occupy onboard storage.

Semi-supervised learning (SSL) refers to the set of techniques that train models using both kinds of data simultaneously: a small set of labelled examples and a much larger set of unlabelled ones. When SSL succeeds, the gains can be substantial. Modern methods such as FixMatch match fully supervised performance with 10 to 100 times fewer labels. When SSL fails, the causes are typically subtle—confirmation bias, distribution shift, and class imbalance—and are examined in detail below.

Important Disambiguation: This post concerns semi-supervised learning. It does not concern self-supervised learning, even though both are sometimes abbreviated as “SSL.” The two are distinct paradigms addressing distinct problems. Readers seeking the self-supervised treatment (pretext tasks, contrastive learning, masked image modelling) should consult the dedicated guide to self-supervised learning. The distinction is examined in detail in the next section, as the difference is consequential.

By the end of the article, a reader should understand the full arc: why SSL works in theory, how the classical methods of the 1960s evolved into today’s recent best, how FixMatch became the default, and how to implement it from scratch in PyTorch. The article also identifies cases in which SSL should not be applied, since applying it without consideration of distribution shift between labelled and unlabelled splits can quietly degrade accuracy.

What Semi-Supervised Learning Is (and Isn’t)

The formal definition is straightforward. Semi-supervised learning involves two datasets:

A labeled set D_L = {(x₁, y₁), (x₂, y₂),…, (x_n, y_n)}, typically small.
An unlabeled set D_U = {x_n+1, x_n+2,…, x_n+m}, typically large—often m is 10 to 1000 times larger than n.

The labels correspond to the same target task of interest (for example, “cat” or “dog” or “pneumonia”). The unlabelled data is drawn from approximately the same distribution as the labelled data, but lacks annotations. The objective is to train a model that performs well on that target task, with the expectation that the unlabelled data, used judiciously, improves performance beyond what the labelled data alone would permit.

It sits on a spectrum of supervision:

Fully supervised: every example has a label. The default. Expensive.
Semi-supervised: some examples labeled, most not. Solves the downstream task directly.
Self-supervised: no human labels at all. Invents labels from data structure (predict masked pixels, predict next token, match augmented views). Usually produces a backbone that’s then fine-tuned.
Unsupervised: no labels, no downstream task, just clustering, density estimation, dimensionality reduction.
Weakly supervised: labels exist but are noisy, imprecise, or indirect (e.g., image-level labels used for segmentation).

Semi-Supervised vs Self-Supervised: The Critical Distinction

The two paradigms are frequently conflated, partly because of the shared “SSL” abbreviation and partly because both involve unlabelled data. They are nonetheless distinct, and a clear separation prevents considerable downstream confusion.

Self-supervised learning uses no human-provided labels at training time. It generates labels from the structure of the data itself. A common pattern is to mask 15% of tokens in a sentence and predict them (BERT). Another is to crop two patches of an image and train the network to identify which pair came from the same image (contrastive learning). A third is to predict whether a rotated image was rotated 0°, 90°, 180°, or 270°. The “label” is generated automatically. The output of self-supervised learning is typically not a task-solving model but a pretrained backbone that is subsequently fine-tuned on a downstream task with labels.

Semi-supervised learning uses some human-provided labels together with unlabelled data. The labels correspond directly to the downstream task (“cat” versus “dog,” “malignant” versus “benign,” “spam” versus “ham”). The output is a model that solves that task. There is no pretext task. Unlabelled data is used to enforce consistency, propagate labels, or minimise entropy, but the objective is always tied back to the labelled task.

Aspect	Semi-Supervised	Self-Supervised
Goal	Solve downstream task directly	Learn general representations (pretraining)
Human labels used	Yes, a small number	None during pretraining
Label source	Humans (partial coverage)	Invented from data (masking, pairs, rotations)
Typical methods	FixMatch, Mean Teacher, MixMatch, pseudo-labeling	MAE, SimCLR, MoCo, DINO, BERT, GPT
Output artifact	Task-ready classifier/regressor	Frozen backbone to be fine-tuned later
When to use	You have some labels and can’t afford more	You have substantial unlabeled corpora and want reusable features
Example	250 labeled CIFAR-10 + 50k unlabeled → 94% accuracy	Pretrain on 1B images → fine-tune on ImageNet

A useful summary: self-supervised learning produces backbones; semi-supervised learning produces task solvers. The two can be combined: pretrain with self-supervision, then fine-tune with semi-supervised learning. In practice, this combination underlies many of the strongest current pipelines. For the self-supervised half of that combination, the self-supervised learning guide covers masked image modelling, contrastive learning, and the DINO family in detail.

The Four Assumptions That Make SSL Work

Semi-supervised learning does not succeed unconditionally. If the unlabelled data were unrelated to the labelled data, no algorithmic refinement would help. SSL relies on structural assumptions about the relationship between inputs and labels. Four assumptions are most commonly cited:

Smoothness: if two points are close in input space, their labels should be similar. This is what enables consistency regularization—perturb the input slightly, and the prediction shouldn’t change.
Cluster assumption: data naturally forms clusters, and points in the same cluster share labels. Decision boundaries should run between clusters, not through them.
Low-density separation: the optimal decision boundary lies in a low-density region of the input space. This is the cluster assumption restated in terms of density, semi-supervised SVMs (S³VM) directly encode it.
Manifold assumption: high-dimensional data actually lies on a lower-dimensional manifold, and the relevant variation for labels happens along the manifold. Graph-based methods exploit this by defining similarity along the data manifold.

Key Takeaway: When SSL produces strong gains, it is generally because one or more of these assumptions hold approximately for the data. When SSL fails silently, the typical cause is that the unlabelled data violates the cluster or manifold assumption: for example, the unlabelled set contains classes absent from the labelled set, or originates from a different sensor or population.

Classical Semi-Supervised Methods

Before deep learning, researchers developed a substantial body of semi-supervised algorithms. Many remain useful, and their ideas recur in modern deep methods.

Self-Training (Pseudo-Labelling)

This is the oldest approach, dating to Scudder in 1965 and popularised for deep learning by Dong-Hyun Lee in 2013. The procedure is simple:

Train a model on the labeled set.
Predict labels for the unlabeled set.
Keep the predictions where the model is very confident (softmax > threshold).
Add those pseudo-labeled examples to the training set.
Retrain. Optionally iterate.

The principal risk is confirmation bias: if the model’s initial predictions are biased, retraining on those biased predictions reinforces the bias. Pseudo-labelling alone is rarely the strongest method, but it forms the backbone of every modern approach, including FixMatch.

Co-Training

Blum and Mitchell (1998) proposed training two classifiers on two different “views” of the input, such as the URL of a web page and the text on the page. Each classifier labels the unlabelled examples on which it is most confident, and those pseudo-labels are used to train the other classifier. The underlying assumption is that the two views are conditionally independent given the label. When this assumption holds, co-training can substantially reduce the number of labels required.

Label Propagation

The procedure constructs a k-nearest-neighbour graph over all examples (labelled and unlabelled). Labels propagate through the graph, with each node’s label becoming a weighted average of its neighbours’ labels. Iteration continues until convergence. Labelled nodes remain pinned to their true labels; unlabelled nodes absorb labels from their neighbourhood. This represents a direct implementation of the manifold assumption and pairs naturally with graph neural networks. See the graph attention networks (GAT) guide for the modern deep counterpart.

Transductive SVM (S³VM)

A standard SVM finds the maximum-margin hyperplane separating labelled points. A transductive SVM considers both labelled and unlabelled points, and seeks a hyperplane that (i) separates labels correctly and (ii) passes through a low-density region of the unlabelled data. The optimisation is non-convex and difficult, but the underlying idea—that decision boundaries should avoid data-dense regions—is central.

Generative Methods

The approach fits a generative model (a Gaussian mixture, a naive Bayes model, a variational autoencoder) jointly on labelled and unlabelled data. EM-style updates treat unlabelled examples as having latent class labels. Provided the generative model is well-specified, unlabelled data tightens parameter estimates and improves the classifier. If the model is misspecified—for example, if the data is not Gaussian—unlabelled data can degrade performance.

Entropy Minimisation

Grandvalet and Bengio (2005) observed that if the cluster assumption holds, the model should make confident predictions on unlabelled data. Their approach adds a term to the loss that minimises the entropy of predictions on unlabelled inputs:

L_total = L_supervised + lambda * H(p_model(y | x_unlabeled))

This term encourages the model to avoid decision boundaries that run through unlabelled data. Entropy minimisation is a small but pervasive component of nearly every modern method. FixMatch implements it indirectly through confidence thresholding and pseudo-labelling.

The Deep Learning Era of SSL

Deep networks transformed SSL in two principal ways. First, they made representation learning on unlabelled data genuinely useful, whereas shallow models gain little from unlabelled data once the feature space is fixed. Second, they made consistency regularisation, a powerful tool, practical at scale.

Consistency Regularisation

The central idea is that predictions should be invariant to small perturbations of the input. Flipping an image horizontally, cropping it, adding a small amount of noise, or applying different dropout masks should not materially change the output probability distribution. This constraint can be enforced directly in the loss, and importantly it can be applied to unlabelled examples, because stability under noise does not require a label.

Π-model (Laine and Aila, 2017). For each unlabelled example, two forward passes are run with different stochastic augmentations and dropout masks. The squared difference between the two softmax outputs is minimised. Combined with standard cross-entropy on the labelled data, this constitutes a complete SSL algorithm.

Temporal Ensembling. The Π-model’s two predictions are noisy. Temporal Ensembling replaces one of them with an exponential moving average of predictions across epochs, producing a smoother and more stable target. The drawback is memory consumption: running predictions must be stored for every unlabelled example.

Mean Teacher (Tarvainen and Valpola, 2017). Rather than averaging predictions over time, the method averages model weights over time. Two networks are maintained: a “student” trained via SGD, and a “teacher” whose weights are an exponential moving average of the student’s weights. The teacher produces the target for the consistency loss. Mean Teacher is more stable and more memory-efficient than Temporal Ensembling, and it remains an excellent baseline, particularly for regression and segmentation tasks.

Pseudo-Labelling, Revisited

Noisy Student (Xie et al., 2020). This method returned pseudo-labelling to the front rank of techniques. The procedure trains a teacher on labelled ImageNet, uses it to pseudo-label 300 million unlabelled images from JFT, and trains a larger student on the combined set under heavy noise (RandAugment, dropout, stochastic depth). The noisy student generalises better than its teacher; iteration follows, with each student becoming the next teacher. Noisy Student raised ImageNet accuracy beyond what fully supervised models had achieved.

Hybrid Methods

MixMatch (Berthelot et al., 2019). Combines (a) K augmented predictions averaged and sharpened into a soft pseudo-label, (b) MixUp between labelled and unlabelled batches, and (c) consistency. The method was strong at the time of publication.

ReMixMatch. Adds distribution alignment (the unlabelled pseudo-label distribution should match the labelled class distribution) and augmentation anchoring (predictions are anchored to weakly-augmented copies, not averages).

FixMatch (Sohn et al., 2020). The current default. It strips away most of MixMatch’s complexity and retains only what works: weak augmentation for pseudo-labels, strong augmentation for the consistency target, and a confidence threshold. The method is implemented from scratch later in this article.

FlexMatch. Replaces FixMatch’s single global threshold with per-class dynamic thresholds that reflect each class’s learning difficulty. It is helpful on imbalanced or curriculum-style problems.

Graph-Based Deep SSL

When data naturally lives on a graph—citation networks, molecular graphs, social networks—semi-supervised node classification with a Graph Convolutional Network or Graph Attention Network is the canonical approach. A handful of labelled nodes coexist with millions of unlabelled ones, and information flows along edges. The GAT architecture is, in effect, learned label propagation with attention-weighted edges.

FixMatch in Detail: How the Method Works

FixMatch warrants close examination. The method is simple, highly effective, and offers a useful mental model for what “modern SSL” entails.

The Idea in One Sentence

For every unlabelled example, if the model produces a confident prediction for a particular class from a weakly augmented version of the image, the model is then required to predict that class from a strongly augmented version of the same image.

Ingredients

A backbone network f (ResNet, WideResNet, etc.) with a classification head.
A weak augmentation α: typically random horizontal flip and random crop.
A strong augmentation A: RandAugment or CTAugment (color, rotation, shear, contrast), followed by Cutout.
A labeled batch of size B and an unlabeled batch of size μB (usually μ = 7, so 7× more unlabeled per step).
A confidence threshold τ, commonly 0.95.
A loss weight λ for the unsupervised term, commonly 1.0.

The Loss

On each training step, compute two losses:

Supervised loss on the labeled batch:

L_s = (1/B) * sum over labeled examples of CE(y_b, f(alpha(x_b)))

Unsupervised loss on the unlabeled batch:

# For each unlabeled example x_u:
q_u    = softmax(f(alpha(x_u)))        # weak-aug prediction
p_hat  = argmax(q_u)                   # pseudo-label
mask   = 1 if max(q_u) >= tau else 0   # confidence gate
L_u   += mask * CE(p_hat, f(A(x_u)))   # strong-aug prediction vs pseudo-label

The total loss is L = L_s + λ · L_u.

Two practical subtleties matter:

The weak-augmentation forward pass uses torch.no_grad(), or gradients are otherwise stopped on q_u. Backpropagation through the pseudo-label target is not permitted.
The confidence mask is applied element-wise. Early in training, most unlabelled examples fall below the threshold and are ignored. As the model improves, an increasing fraction of examples receive pseudo-labels. This produces a natural curriculum.

Full PyTorch Implementation of FixMatch

The following is a complete, runnable FixMatch implementation on CIFAR-10. It uses a simple WideResNet-style backbone and follows the original paper’s recipe closely enough to reach approximately 90%+ accuracy with 250 labels given sufficient training (the paper reports 94.93%). For illustration, the training loop is kept short; extending the number of epochs and iterations is required for full results.

Tip: FixMatch requires many iterations: the original paper trains for 1,048,576 steps (2²⁰). Results are not visible in 10 epochs. Compute should be planned accordingly, or a faster dataset such as MNIST may be used for prototyping.

import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
from torchvision.transforms import RandAugment

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ---------- 1. Dataset split: labeled + unlabeled ----------

def split_labeled_unlabeled(dataset, n_labeled_per_class=25, n_classes=10):
    """Create a small labeled subset and treat the rest as unlabeled."""
    labels = np.array(dataset.targets)
    labeled_idx, unlabeled_idx = [], []
    for c in range(n_classes):
        idx = np.where(labels == c)[0]
        np.random.shuffle(idx)
        labeled_idx.extend(idx[:n_labeled_per_class])
        unlabeled_idx.extend(idx[n_labeled_per_class:])
    return labeled_idx, unlabeled_idx

# ---------- 2. Weak and strong augmentation ----------

CIFAR_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR_STD  = (0.2470, 0.2435, 0.2616)

class WeakAug:
    def __init__(self):
        self.t = transforms.Compose([
            transforms.RandomHorizontalFlip(),
            transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
            transforms.ToTensor(),
            transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
        ])
    def __call__(self, x): return self.t(x)

class StrongAug:
    """Weak flip/crop + RandAugment + Cutout."""
    def __init__(self):
        self.base = transforms.Compose([
            transforms.RandomHorizontalFlip(),
            transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
            RandAugment(num_ops=2, magnitude=10),
            transforms.ToTensor(),
            transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
        ])
    def __call__(self, x):
        img = self.base(x)
        # Cutout: random 16x16 zero patch
        _, H, W = img.shape
        y, x_ = np.random.randint(H), np.random.randint(W)
        y1, y2 = max(0, y-8), min(H, y+8)
        x1, x2 = max(0, x_-8), min(W, x_+8)
        img[:, y1:y2, x1:x2] = 0
        return img

class LabeledDataset(Dataset):
    def __init__(self, base, idx):
        self.base, self.idx, self.aug = base, idx, WeakAug()
    def __len__(self): return len(self.idx)
    def __getitem__(self, i):
        img, y = self.base[self.idx[i]]
        return self.aug(img), y

class UnlabeledDataset(Dataset):
    """Returns (weak_aug, strong_aug) pair."""
    def __init__(self, base, idx):
        self.base, self.idx = base, idx
        self.weak, self.strong = WeakAug(), StrongAug()
    def __len__(self): return len(self.idx)
    def __getitem__(self, i):
        img, _ = self.base[self.idx[i]]
        return self.weak(img), self.strong(img)

# ---------- 3. Simple WideResNet-ish backbone ----------

class BasicBlock(nn.Module):
    def __init__(self, cin, cout, stride=1):
        super().__init__()
        self.bn1 = nn.BatchNorm2d(cin)
        self.conv1 = nn.Conv2d(cin, cout, 3, stride, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(cout)
        self.conv2 = nn.Conv2d(cout, cout, 3, 1, 1, bias=False)
        self.shortcut = (nn.Conv2d(cin, cout, 1, stride, bias=False)
                         if stride != 1 or cin != cout else nn.Identity())
    def forward(self, x):
        h = self.conv1(F.relu(self.bn1(x)))
        h = self.conv2(F.relu(self.bn2(h)))
        return h + self.shortcut(x)

class WideResNet(nn.Module):
    def __init__(self, num_classes=10, widen=2):
        super().__init__()
        n = 16
        self.stem = nn.Conv2d(3, n, 3, 1, 1, bias=False)
        widths = [n, n*widen, n*2*widen, n*4*widen]
        layers = []
        for i in range(3):
            stride = 1 if i == 0 else 2
            layers.append(BasicBlock(widths[i], widths[i+1], stride))
            layers.append(BasicBlock(widths[i+1], widths[i+1], 1))
        self.blocks = nn.Sequential(*layers)
        self.bn = nn.BatchNorm2d(widths[-1])
        self.fc = nn.Linear(widths[-1], num_classes)
    def forward(self, x):
        h = self.blocks(self.stem(x))
        h = F.relu(self.bn(h))
        h = F.adaptive_avg_pool2d(h, 1).flatten(1)
        return self.fc(h)

# ---------- 4. Data pipeline ----------

raw = datasets.CIFAR10("./data", train=True, download=True)
test = datasets.CIFAR10("./data", train=False, download=True,
                        transform=transforms.Compose([
                            transforms.ToTensor(),
                            transforms.Normalize(CIFAR_MEAN, CIFAR_STD)]))

lab_idx, unlab_idx = split_labeled_unlabeled(raw, n_labeled_per_class=25)
lab_ds   = LabeledDataset(raw, lab_idx)           # 250 images
unlab_ds = UnlabeledDataset(raw, unlab_idx)       # ~49,750 images

B, mu = 64, 7
lab_loader   = DataLoader(lab_ds,   batch_size=B,    shuffle=True,
                          num_workers=2, drop_last=True)
unlab_loader = DataLoader(unlab_ds, batch_size=B*mu, shuffle=True,
                          num_workers=2, drop_last=True)
test_loader  = DataLoader(test, batch_size=256, num_workers=2)

# ---------- 5. FixMatch training loop ----------

model = WideResNet(num_classes=10, widen=2).to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.03,
                      momentum=0.9, nesterov=True, weight_decay=5e-4)
tau, lam = 0.95, 1.0

def infinite(loader):
    while True:
        for batch in loader:
            yield batch

lab_iter   = infinite(lab_loader)
unlab_iter = infinite(unlab_loader)

for step in range(5000):         # paper uses 2**20; 5k is illustrative
    model.train()
    x_l, y_l        = next(lab_iter)
    x_u_w, x_u_s    = next(unlab_iter)
    x_l, y_l        = x_l.to(device), y_l.to(device)
    x_u_w, x_u_s    = x_u_w.to(device), x_u_s.to(device)

    # One concatenated forward pass for speed (interleaved BN trick):
    x = torch.cat([x_l, x_u_w, x_u_s], dim=0)
    logits = model(x)
    l_logits = logits[:B]
    u_w_logits, u_s_logits = logits[B:].chunk(2)

    # Supervised loss
    loss_s = F.cross_entropy(l_logits, y_l)

    # Pseudo-label from weak aug (no grad through target)
    with torch.no_grad():
        probs_w = F.softmax(u_w_logits, dim=-1)
        max_probs, pseudo = probs_w.max(dim=-1)
        mask = (max_probs >= tau).float()

    # Unsupervised loss on strong aug
    loss_u = (F.cross_entropy(u_s_logits, pseudo, reduction="none") * mask).mean()

    loss = loss_s + lam * loss_u
    opt.zero_grad(); loss.backward(); opt.step()

    if step % 500 == 0:
        model.eval()
        correct = total = 0
        with torch.no_grad():
            for xb, yb in test_loader:
                xb, yb = xb.to(device), yb.to(device)
                pred = model(xb).argmax(-1)
                correct += (pred == yb).sum().item()
                total   += yb.size(0)
        print(f"step {step:5d}  loss_s={loss_s.item():.3f}  "
              f"loss_u={loss_u.item():.3f}  mask_used={mask.mean().item():.2f}  "
              f"test_acc={100*correct/total:.2f}%")

Several observations follow from running the code above:

For the first few hundred steps, mask_used remains near zero: the model is not yet confident on anything, so the unsupervised term contributes nothing. This is expected; the supervised loss is performing the work.
Between approximately step 1,000 and step 3,000, mask_used begins climbing into the 0.2 to 0.6 range, and test accuracy increases noticeably. This is the point at which FixMatch begins to contribute substantively.
The 5,000-step budget here is an order of magnitude shorter than that used in the paper. Reproducing the reported 94.93% on CIFAR-10 with 250 labels requires much longer training, a cosine learning-rate schedule, and EMA weights at evaluation time.

A realistic labelled-only baseline (the same backbone, the same 250 labels, no unlabelled data, with only heavy augmentation) tends to land in the range of 50% to 60% test accuracy. FixMatch approaches 95%. That gap of more than 30 percentage points, from the same 250 labels, is the central result of modern semi-supervised learning.

Real-World Applications Across Domains

Semi-supervised learning is most valuable wherever the ratio of labelled to unlabelled data is extreme and the cost of labelling is high.

Domain	Why SSL fits	Typical setup
Medical imaging	Radiologist time is expensive; raw DICOMs accumulate	5k labeled scans + 500k unlabeled; FixMatch or Mean Teacher
Manufacturing QA	Defects are rare; passing parts flood the line	Few labeled defects, many unlabeled parts; SSL + one-class anomaly models
NLP (sentiment, NER)	Labeled corpora small; web text infinite	Backtranslation or UDA on top of a pretrained transformer
Autonomous driving	Segmentation labels cost minutes/frame; fleet logs petabytes	Mean Teacher for segmentation; auto-labeling pipelines
Fraud detection	Confirmed frauds are rare; transactions are billions	Graph SSL + entropy minimization + active learning loop
Speech recognition	Transcribed audio scarce; raw audio abundant	wav2vec 2.0 pretrain + semi-supervised fine-tuning
Industrial anomaly detection	Very few examples of failure; many normal runs	Deep SAD (semi-supervised variant of Deep SVDD)

The manufacturing and anomaly-detection cases deserve a particular note: a semi-supervised variant of one-class classification, Deep SAD, builds directly on the Deep SVDD framework. It uses the few labelled abnormal examples to tighten the hypersphere around normal data. For anomaly detection with even a handful of confirmed anomalies, Deep SAD typically outperforms pure Deep SVDD.

Paradigm Comparison: SSL, Self-SSL, Transfer, Active

When a stakeholder asks which approach to use, the underlying question is often whether more labelling can be avoided. Several paradigms address this question in different ways.

Paradigm	Data	Labeling cost	Typical performance	When to use
Fully supervised	All labeled	High	Baseline	Labels are cheap or already exist
Semi-supervised	Few labeled + many unlabeled	Low	Matches supervised at 1–10% labels	Labels scarce, unlabeled data plentiful, distributions match
Self-supervised	Unlabeled only (pretrain)	None for pretraining	Great when scaled to considerable data	You need reusable backbones; substantial unlabeled corpus
Transfer learning	Pretrained weights + small labeled	Low	Strong and fast	A suitable pretrained model exists in your modality
Active learning	Iteratively label smartly	Medium	Maximizes labels ROI	Labeling is possible but slow/expensive; you want to budget it
Domain adaptation	Labeled source + unlabeled target	Medium	Bridges distribution shift	Your deployment data differs from your labeled data

These paradigms combine freely. A strong 2026 pipeline might: (1) pretrain a backbone with self-supervised learning, (2) fine-tune with semi-supervised learning on the actual task, (3) apply DANN-style domain adaptation when deploying to a new facility, and (4) use active learning to prioritise which difficult examples to return to human annotators.

Method Comparison Within SSL

Method	Complexity	Typical CIFAR-10 (250 labels)	Strengths	Weaknesses
Pseudo-labeling	Very low	~60–70%	Trivial to implement	Confirmation bias, error amplification
Mean Teacher	Medium	~80%	Stable; good for regression/segmentation	Weaker on classification vs FixMatch
MixMatch	High	~88%	Strong with limited tricks	Many moving parts; sensitive to sharpening temperature
FixMatch	Medium	~95%	Simple, current best, broadly applicable	Global threshold can stall on hard classes
FlexMatch	Medium-high	~95.5%	Per-class dynamic thresholds; handles curriculum	More hyperparameters

Practical Guide: Thresholds, Data Ratios, Pitfalls

How Much Labelled Data Is Required?

Empirically, SSL gains are largest when very few labels are available (for example, 4 to 40 per class) and diminish as the count approaches thousands per class. Above roughly 10% of the dataset labelled, FixMatch and related methods tend to converge with the fully supervised baseline. This does not mean SSL is useless above 10%; rather, the marginal advantage of SSL over additional labelling becomes smaller. The most favourable regime is one in which labels are genuinely scarce.

Key Takeaway: The classic SSL gain curve shows substantial improvements at small labelled fractions (1% to 5%), steady diminution through 10%, and marginal returns by 20%. Labelling budgets should be designed accordingly.

Choosing a Method

Standard image classification. Start with FixMatch. It is a strong default with minimal hyperparameter sensitivity.
Regression or segmentation. Mean Teacher adapts more naturally, because the consistency target can be a continuous prediction or pixel map rather than a class.
Imbalanced classes or class-dependent difficulty. FlexMatch’s dynamic thresholds prevent the majority classes from absorbing all pseudo-labels.
Graph-structured data. Use GCN or GAT directly; both are natively semi-supervised.

Hyperparameter Tips

Confidence threshold τ: 0.95 is the FixMatch default. Lower it (0.7 to 0.8) if mask_used remains near zero for an extended period; raise it if pseudo-labels appear noisy.
Unsupervised weight λ: 1.0 typically works. If the supervised loss is unstable early in training, ramp λ from 0 to 1 over the first few epochs.
EMA decay (Mean Teacher): 0.999 is standard. Lower values cause the teacher to track the student noisily; higher values cause it to stop learning.
Batch size ratio μ: FixMatch uses μ = 7 (seven times more unlabelled per labelled batch). The unlabelled batch must be large enough that confidence-gated pseudo-labels are not all of the same class.

Common Pitfalls

Confirmation bias. The model pseudo-labels unlabelled data confidently but incorrectly, then trains on those incorrect labels. Strong augmentation and confidence thresholding mitigate this risk.
Class imbalance. If the labelled set is 90% class A, pseudo-labels will skew toward class A on unlabelled data, reinforcing the imbalance. FlexMatch and distribution alignment (ReMixMatch) address this.
Distribution shift. If the labelled data originates from Hospital A and the unlabelled data from Hospital B, SSL can degrade performance. The appropriate response is domain adaptation, either instead of SSL or in conjunction with it.
Open-set contamination. The unlabelled set contains classes that are absent from the labelled set. Pseudo-labelling forces these into known classes, corrupting the model.
Insufficient iterations. FixMatch requires extended training for mask_used to rise. Judgments should not be made after a single epoch.

Caution: If the labelled and unlabelled sets originate from different distributions—different hospitals, sensors, geographies, or time periods—semi-supervised learning can actively degrade performance. SSL should always be benchmarked against a supervised baseline on a held-out set that reflects deployment conditions.

Tools and Libraries

USB (Unified Semi-supervised learning Benchmark). PyTorch framework with more than 15 SSL algorithms and a common evaluation harness.
TorchSSL. Curated implementations of the classical SSL algorithms for image classification.
MMClassification / MMSegmentation. OpenMMLab tools with SSL support for image classification and segmentation.
Google’s official FixMatch repository. The paper authors’ reference TensorFlow implementation.

Connections to Transfer, Active, and Domain Adaptation

Semi-supervised learning is most powerful when treated not as a standalone technique but as one element of a broader set of complementary methods.

Semi-Supervised plus Transfer Learning

A common pattern is to begin with a pretrained backbone (ImageNet, CLIP, wav2vec) and fine-tune it using FixMatch on a small labelled set together with the unlabelled data. This combination routinely outperforms either approach in isolation. The pretrained features provide a head start on representation; SSL allows the model to adapt to the specific label structure. The transfer learning guide presents a concrete version of this pipeline for a cobot anomaly-detection project.

Semi-Supervised plus Active Learning

Active learning selects which unlabelled examples are most worth labelling next, while SSL uses the unlabelled examples without labelling them. The combined workflow trains with SSL, identifies examples on which the model is least confident or on which the SSL pseudo-label fluctuated across epochs, sends those to a human annotator, returns them as labelled data, and repeats. This pattern characterises most production labelling pipelines.

Semi-Supervised plus Domain Adaptation

If the labelled data (source domain) and unlabelled data (target domain) originate from different distributions, plain SSL will fail. Domain-adversarial training (DANN) or maximum-mean-discrepancy methods align the feature distributions, and once alignment is achieved, SSL can operate effectively. This combination is the basis on which many medical AI systems generalise across hospitals.

Semi-Supervised plus Self-Supervised

The two approaches need not be alternatives; they can be stacked. Pretrain with self-supervised learning on a substantial unlabelled corpus (see the self-supervised learning guide), then fine-tune with FixMatch on a small labelled set together with a focused unlabelled set. This combination underlies the modern recipe used in speech (wav2vec 2.0), vision (MAE plus FixMatch fine-tune), and NLP (pretraining plus UDA).

Statistical reasoning helps explain why additional data tends to assist: as unlabelled examples contribute to parameter estimation, the effective sample size grows and variance falls, a phenomenon closely related to the central limit theorem in parameter estimation.

Frequently Asked Questions

What’s the difference between semi-supervised and self-supervised learning?

Semi-supervised learning uses some human-labeled data plus unlabeled data to solve a specific downstream task directly. Self-supervised learning uses only unlabeled data and invents its own labels from data structure (masking, contrastive pairs) to produce a reusable pretrained backbone, which is later fine-tuned with labeled data on a downstream task. Semi-supervised is a training strategy for a task; self-supervised is a pretraining strategy for representations.

How many labeled samples do I need for semi-supervised learning?

The requirement depends on task complexity, but as a rule of thumb, FixMatch-class methods produce substantial gains with as few as 4 to 40 labelled examples per class for image classification. Returns diminish once approximately 10% of the dataset is labelled. For NLP and tabular data the curve is similar, though the inflection often arises with slightly more labels per class due to greater input variability.

When does semi-supervised learning hurt rather than help?

SSL can degrade performance when (a) the unlabelled data distribution differs materially from the labelled data distribution, (b) the unlabelled set contains novel classes not present in the labelled set, (c) class imbalance in the labelled set biases the pseudo-labels, or (d) the core assumptions (smoothness, cluster, manifold) do not hold for the data. The SSL model should always be measured against a strong supervised baseline on a held-out set that reflects deployment conditions.

FixMatch vs MixMatch—which should I use?

FixMatch is simpler, performs better on most benchmarks, and has fewer hyperparameters. It is the recommended starting point unless a specific reason argues for MixMatch (for example, a separate requirement for MixUp regularisation). MixMatch’s averaging-and-sharpening is conceptually elegant, but its empirical gains have been surpassed by FixMatch’s weak/strong pseudo-label scheme.

Can I combine semi-supervised learning with transfer learning?

Yes, and combining them is generally recommended. Initialise with a pretrained backbone (ImageNet, CLIP, or a domain-specific model) and apply FixMatch or Mean Teacher on top. Pretrained weights provide strong features from the start, which means FixMatch’s mask threshold is reached earlier in training and pseudo-labels are more reliable. This combination approximates the default recipe in modern practice.

References and Further Reading

Related Reading on AI Code Invest:

External References

Sohn, K. et al. (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. arXiv:2001.07685
Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models. arXiv:1703.01780
Xie, Q. et al. (2020). Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
Berthelot, D. et al. (2019). MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv:1905.02249
Chapelle, O., Schölkopf, B., Zien, A. (eds.) (2006). Semi-Supervised Learning. MIT Press.
USB benchmark—github.com/microsoft/Semi-supervised-learning
Google FixMatch reference implementation,github.com/google-research/fixmatch

This article is for informational and educational purposes only and does not constitute investment advice.

AI/MLDeep SVDD Explained: One-Class Deep Learning for Anomaly Detection AI/MLModel Context Protocol (MCP) Explained: The Universal Standard for Connecting AI to Everything AI/MLTime-Series Anomaly Detection in 2026: From Classical Methods to Foundation Models