Summary
What this post covers: A detailed examination of semi-supervised learning (SSL), from classical methods through modern consistency-based approaches, with a full PyTorch implementation of FixMatch that enables a model to match supervised accuracy using 10 to 100 times fewer labels.
Key insights:
- Modern SSL methods like FixMatch can match fully-supervised performance with 10x to 100x fewer labels by combining weak augmentation, confidence thresholding (tau = 0.95), and strong-augmentation consistency.
- Semi-supervised learning is not self-supervised learning: SSL uses some task labels plus unlabeled data, while self-supervised invents labels from data structure and produces a pretrained backbone.
- SSL only works when the smoothness, cluster, manifold, or low-density assumption holds; applying it blindly across distribution shift between labeled and unlabeled splits will silently destroy accuracy.
- The confidence-gated pseudo-label is a natural curriculum: early in training most unlabeled examples fall below threshold and are ignored, so the model is not poisoned by its own bad predictions.
- FixMatch’s effectiveness comes mostly from strong augmentation (RandAugment + Cutout) and high confidence thresholds, not from complex architectures, which is why it generalizes across vision, audio, NLP, and medical imaging.
Main topics: The Promise of Learning from Almost-Free Data, What Semi-Supervised Learning Is (and Isn’t), Semi-Supervised vs Self-Supervised: The Critical Distinction, The Four Assumptions That Make SSL Work, Classical Semi-Supervised Methods, The Deep Learning Era of SSL, FixMatch in Detail: How the Method Works, Full PyTorch Implementation of FixMatch, Real-World Applications Across Domains, Paradigm Comparison: SSL, Self-SSL, Transfer, Active, Practical Guide: Thresholds, Data Ratios, Pitfalls, Connections to Transfer, Active, and Domain Adaptation, Frequently Asked Questions, References and Further Reading.
The Promise of Learning from Almost-Free Data
Consider a setting with 1,000 labelled medical images and 100,000 unlabelled ones. Training only on the labelled portion yields 78% accuracy. Adding the unlabelled data through semi-supervised learning raises that figure to 93%, with no additional labels required.
That single observation explains why semi-supervised learning has quietly become one of the most consequential ideas in modern machine learning. Labels are expensive. A radiologist annotating a chest X-ray represents both real cost and real time. A crowd worker labelling toxic comments must read each one carefully. An engineer hand-segmenting pedestrians in a video frame may require ten minutes per frame. The raw data, however, is largely free: unlabelled X-rays accumulate on hospital servers, billions of comments sit on social platforms, and petabytes of driving footage occupy onboard storage.
Semi-supervised learning (SSL) refers to the set of techniques that train models using both kinds of data simultaneously: a small set of labelled examples and a much larger set of unlabelled ones. When SSL succeeds, the gains can be substantial. Modern methods such as FixMatch match fully supervised performance with 10 to 100 times fewer labels. When SSL fails, the causes are typically subtle—confirmation bias, distribution shift, and class imbalance—and are examined in detail below.
By the end of the article, a reader should understand the full arc: why SSL works in theory, how the classical methods of the 1960s evolved into today’s recent best, how FixMatch became the default, and how to implement it from scratch in PyTorch. The article also identifies cases in which SSL should not be applied, since applying it without consideration of distribution shift between labelled and unlabelled splits can quietly degrade accuracy.
What Semi-Supervised Learning Is (and Isn’t)
The formal definition is straightforward. Semi-supervised learning involves two datasets:
- A labeled set DL = {(x1, y1), (x2, y2),…, (xn, yn)}, typically small.
- An unlabeled set DU = {xn+1, xn+2,…, xn+m}, typically large—often m is 10 to 1000 times larger than n.
The labels correspond to the same target task of interest (for example, “cat” or “dog” or “pneumonia”). The unlabelled data is drawn from approximately the same distribution as the labelled data, but lacks annotations. The objective is to train a model that performs well on that target task, with the expectation that the unlabelled data, used judiciously, improves performance beyond what the labelled data alone would permit.
It sits on a spectrum of supervision:
- Fully supervised: every example has a label. The default. Expensive.
- Semi-supervised: some examples labeled, most not. Solves the downstream task directly.
- Self-supervised: no human labels at all. Invents labels from data structure (predict masked pixels, predict next token, match augmented views). Usually produces a backbone that’s then fine-tuned.
- Unsupervised: no labels, no downstream task, just clustering, density estimation, dimensionality reduction.
- Weakly supervised: labels exist but are noisy, imprecise, or indirect (e.g., image-level labels used for segmentation).
Semi-Supervised vs Self-Supervised: The Critical Distinction
The two paradigms are frequently conflated, partly because of the shared “SSL” abbreviation and partly because both involve unlabelled data. They are nonetheless distinct, and a clear separation prevents considerable downstream confusion.
Self-supervised learning uses no human-provided labels at training time. It generates labels from the structure of the data itself. A common pattern is to mask 15% of tokens in a sentence and predict them (BERT). Another is to crop two patches of an image and train the network to identify which pair came from the same image (contrastive learning). A third is to predict whether a rotated image was rotated 0°, 90°, 180°, or 270°. The “label” is generated automatically. The output of self-supervised learning is typically not a task-solving model but a pretrained backbone that is subsequently fine-tuned on a downstream task with labels.
Semi-supervised learning uses some human-provided labels together with unlabelled data. The labels correspond directly to the downstream task (“cat” versus “dog,” “malignant” versus “benign,” “spam” versus “ham”). The output is a model that solves that task. There is no pretext task. Unlabelled data is used to enforce consistency, propagate labels, or minimise entropy, but the objective is always tied back to the labelled task.
| Aspect | Semi-Supervised | Self-Supervised |
|---|---|---|
| Goal | Solve downstream task directly | Learn general representations (pretraining) |
| Human labels used | Yes, a small number | None during pretraining |
| Label source | Humans (partial coverage) | Invented from data (masking, pairs, rotations) |
| Typical methods | FixMatch, Mean Teacher, MixMatch, pseudo-labeling | MAE, SimCLR, MoCo, DINO, BERT, GPT |
| Output artifact | Task-ready classifier/regressor | Frozen backbone to be fine-tuned later |
| When to use | You have some labels and can’t afford more | You have substantial unlabeled corpora and want reusable features |
| Example | 250 labeled CIFAR-10 + 50k unlabeled → 94% accuracy | Pretrain on 1B images → fine-tune on ImageNet |
A useful summary: self-supervised learning produces backbones; semi-supervised learning produces task solvers. The two can be combined: pretrain with self-supervision, then fine-tune with semi-supervised learning. In practice, this combination underlies many of the strongest current pipelines. For the self-supervised half of that combination, the self-supervised learning guide covers masked image modelling, contrastive learning, and the DINO family in detail.
The Four Assumptions That Make SSL Work
Semi-supervised learning does not succeed unconditionally. If the unlabelled data were unrelated to the labelled data, no algorithmic refinement would help. SSL relies on structural assumptions about the relationship between inputs and labels. Four assumptions are most commonly cited:
- Smoothness: if two points are close in input space, their labels should be similar. This is what enables consistency regularization—perturb the input slightly, and the prediction shouldn’t change.
- Cluster assumption: data naturally forms clusters, and points in the same cluster share labels. Decision boundaries should run between clusters, not through them.
- Low-density separation: the optimal decision boundary lies in a low-density region of the input space. This is the cluster assumption restated in terms of density, semi-supervised SVMs (S³VM) directly encode it.
- Manifold assumption: high-dimensional data actually lies on a lower-dimensional manifold, and the relevant variation for labels happens along the manifold. Graph-based methods exploit this by defining similarity along the data manifold.
Classical Semi-Supervised Methods
Before deep learning, researchers developed a substantial body of semi-supervised algorithms. Many remain useful, and their ideas recur in modern deep methods.
Self-Training (Pseudo-Labelling)
This is the oldest approach, dating to Scudder in 1965 and popularised for deep learning by Dong-Hyun Lee in 2013. The procedure is simple:
- Train a model on the labeled set.
- Predict labels for the unlabeled set.
- Keep the predictions where the model is very confident (softmax > threshold).
- Add those pseudo-labeled examples to the training set.
- Retrain. Optionally iterate.
The principal risk is confirmation bias: if the model’s initial predictions are biased, retraining on those biased predictions reinforces the bias. Pseudo-labelling alone is rarely the strongest method, but it forms the backbone of every modern approach, including FixMatch.
Co-Training
Blum and Mitchell (1998) proposed training two classifiers on two different “views” of the input, such as the URL of a web page and the text on the page. Each classifier labels the unlabelled examples on which it is most confident, and those pseudo-labels are used to train the other classifier. The underlying assumption is that the two views are conditionally independent given the label. When this assumption holds, co-training can substantially reduce the number of labels required.
Label Propagation
The procedure constructs a k-nearest-neighbour graph over all examples (labelled and unlabelled). Labels propagate through the graph, with each node’s label becoming a weighted average of its neighbours’ labels. Iteration continues until convergence. Labelled nodes remain pinned to their true labels; unlabelled nodes absorb labels from their neighbourhood. This represents a direct implementation of the manifold assumption and pairs naturally with graph neural networks. See the graph attention networks (GAT) guide for the modern deep counterpart.
Transductive SVM (S³VM)
A standard SVM finds the maximum-margin hyperplane separating labelled points. A transductive SVM considers both labelled and unlabelled points, and seeks a hyperplane that (i) separates labels correctly and (ii) passes through a low-density region of the unlabelled data. The optimisation is non-convex and difficult, but the underlying idea—that decision boundaries should avoid data-dense regions—is central.
Generative Methods
The approach fits a generative model (a Gaussian mixture, a naive Bayes model, a variational autoencoder) jointly on labelled and unlabelled data. EM-style updates treat unlabelled examples as having latent class labels. Provided the generative model is well-specified, unlabelled data tightens parameter estimates and improves the classifier. If the model is misspecified—for example, if the data is not Gaussian—unlabelled data can degrade performance.
Entropy Minimisation
Grandvalet and Bengio (2005) observed that if the cluster assumption holds, the model should make confident predictions on unlabelled data. Their approach adds a term to the loss that minimises the entropy of predictions on unlabelled inputs:
L_total = L_supervised + lambda * H(p_model(y | x_unlabeled))
This term encourages the model to avoid decision boundaries that run through unlabelled data. Entropy minimisation is a small but pervasive component of nearly every modern method. FixMatch implements it indirectly through confidence thresholding and pseudo-labelling.
The Deep Learning Era of SSL
Deep networks transformed SSL in two principal ways. First, they made representation learning on unlabelled data genuinely useful, whereas shallow models gain little from unlabelled data once the feature space is fixed. Second, they made consistency regularisation, a powerful tool, practical at scale.
Consistency Regularisation
The central idea is that predictions should be invariant to small perturbations of the input. Flipping an image horizontally, cropping it, adding a small amount of noise, or applying different dropout masks should not materially change the output probability distribution. This constraint can be enforced directly in the loss, and importantly it can be applied to unlabelled examples, because stability under noise does not require a label.
Π-model (Laine and Aila, 2017). For each unlabelled example, two forward passes are run with different stochastic augmentations and dropout masks. The squared difference between the two softmax outputs is minimised. Combined with standard cross-entropy on the labelled data, this constitutes a complete SSL algorithm.
Temporal Ensembling. The Π-model’s two predictions are noisy. Temporal Ensembling replaces one of them with an exponential moving average of predictions across epochs, producing a smoother and more stable target. The drawback is memory consumption: running predictions must be stored for every unlabelled example.
Mean Teacher (Tarvainen and Valpola, 2017). Rather than averaging predictions over time, the method averages model weights over time. Two networks are maintained: a “student” trained via SGD, and a “teacher” whose weights are an exponential moving average of the student’s weights. The teacher produces the target for the consistency loss. Mean Teacher is more stable and more memory-efficient than Temporal Ensembling, and it remains an excellent baseline, particularly for regression and segmentation tasks.
Pseudo-Labelling, Revisited
Noisy Student (Xie et al., 2020). This method returned pseudo-labelling to the front rank of techniques. The procedure trains a teacher on labelled ImageNet, uses it to pseudo-label 300 million unlabelled images from JFT, and trains a larger student on the combined set under heavy noise (RandAugment, dropout, stochastic depth). The noisy student generalises better than its teacher; iteration follows, with each student becoming the next teacher. Noisy Student raised ImageNet accuracy beyond what fully supervised models had achieved.
Hybrid Methods
MixMatch (Berthelot et al., 2019). Combines (a) K augmented predictions averaged and sharpened into a soft pseudo-label, (b) MixUp between labelled and unlabelled batches, and (c) consistency. The method was strong at the time of publication.
ReMixMatch. Adds distribution alignment (the unlabelled pseudo-label distribution should match the labelled class distribution) and augmentation anchoring (predictions are anchored to weakly-augmented copies, not averages).
FixMatch (Sohn et al., 2020). The current default. It strips away most of MixMatch’s complexity and retains only what works: weak augmentation for pseudo-labels, strong augmentation for the consistency target, and a confidence threshold. The method is implemented from scratch later in this article.
FlexMatch. Replaces FixMatch’s single global threshold with per-class dynamic thresholds that reflect each class’s learning difficulty. It is helpful on imbalanced or curriculum-style problems.
Graph-Based Deep SSL
When data naturally lives on a graph—citation networks, molecular graphs, social networks—semi-supervised node classification with a Graph Convolutional Network or Graph Attention Network is the canonical approach. A handful of labelled nodes coexist with millions of unlabelled ones, and information flows along edges. The GAT architecture is, in effect, learned label propagation with attention-weighted edges.
FixMatch in Detail: How the Method Works
FixMatch warrants close examination. The method is simple, highly effective, and offers a useful mental model for what “modern SSL” entails.
The Idea in One Sentence
For every unlabelled example, if the model produces a confident prediction for a particular class from a weakly augmented version of the image, the model is then required to predict that class from a strongly augmented version of the same image.
Ingredients
- A backbone network f (ResNet, WideResNet, etc.) with a classification head.
- A weak augmentation α: typically random horizontal flip and random crop.
- A strong augmentation A: RandAugment or CTAugment (color, rotation, shear, contrast), followed by Cutout.
- A labeled batch of size B and an unlabeled batch of size μB (usually μ = 7, so 7× more unlabeled per step).
- A confidence threshold τ, commonly 0.95.
- A loss weight λ for the unsupervised term, commonly 1.0.
The Loss
On each training step, compute two losses:
Supervised loss on the labeled batch:
L_s = (1/B) * sum over labeled examples of CE(y_b, f(alpha(x_b)))
Unsupervised loss on the unlabeled batch:
# For each unlabeled example x_u:
q_u = softmax(f(alpha(x_u))) # weak-aug prediction
p_hat = argmax(q_u) # pseudo-label
mask = 1 if max(q_u) >= tau else 0 # confidence gate
L_u += mask * CE(p_hat, f(A(x_u))) # strong-aug prediction vs pseudo-label
The total loss is L = L_s + λ · L_u.
Two practical subtleties matter:
- The weak-augmentation forward pass uses
torch.no_grad(), or gradients are otherwise stopped on q_u. Backpropagation through the pseudo-label target is not permitted. - The confidence mask is applied element-wise. Early in training, most unlabelled examples fall below the threshold and are ignored. As the model improves, an increasing fraction of examples receive pseudo-labels. This produces a natural curriculum.
Full PyTorch Implementation of FixMatch
The following is a complete, runnable FixMatch implementation on CIFAR-10. It uses a simple WideResNet-style backbone and follows the original paper’s recipe closely enough to reach approximately 90%+ accuracy with 250 labels given sufficient training (the paper reports 94.93%). For illustration, the training loop is kept short; extending the number of epochs and iterations is required for full results.
import math
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
from torchvision.transforms import RandAugment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# ---------- 1. Dataset split: labeled + unlabeled ----------
def split_labeled_unlabeled(dataset, n_labeled_per_class=25, n_classes=10):
"""Create a small labeled subset and treat the rest as unlabeled."""
labels = np.array(dataset.targets)
labeled_idx, unlabeled_idx = [], []
for c in range(n_classes):
idx = np.where(labels == c)[0]
np.random.shuffle(idx)
labeled_idx.extend(idx[:n_labeled_per_class])
unlabeled_idx.extend(idx[n_labeled_per_class:])
return labeled_idx, unlabeled_idx
# ---------- 2. Weak and strong augmentation ----------
CIFAR_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR_STD = (0.2470, 0.2435, 0.2616)
class WeakAug:
def __init__(self):
self.t = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
transforms.ToTensor(),
transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
])
def __call__(self, x): return self.t(x)
class StrongAug:
"""Weak flip/crop + RandAugment + Cutout."""
def __init__(self):
self.base = transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
RandAugment(num_ops=2, magnitude=10),
transforms.ToTensor(),
transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
])
def __call__(self, x):
img = self.base(x)
# Cutout: random 16x16 zero patch
_, H, W = img.shape
y, x_ = np.random.randint(H), np.random.randint(W)
y1, y2 = max(0, y-8), min(H, y+8)
x1, x2 = max(0, x_-8), min(W, x_+8)
img[:, y1:y2, x1:x2] = 0
return img
class LabeledDataset(Dataset):
def __init__(self, base, idx):
self.base, self.idx, self.aug = base, idx, WeakAug()
def __len__(self): return len(self.idx)
def __getitem__(self, i):
img, y = self.base[self.idx[i]]
return self.aug(img), y
class UnlabeledDataset(Dataset):
"""Returns (weak_aug, strong_aug) pair."""
def __init__(self, base, idx):
self.base, self.idx = base, idx
self.weak, self.strong = WeakAug(), StrongAug()
def __len__(self): return len(self.idx)
def __getitem__(self, i):
img, _ = self.base[self.idx[i]]
return self.weak(img), self.strong(img)
# ---------- 3. Simple WideResNet-ish backbone ----------
class BasicBlock(nn.Module):
def __init__(self, cin, cout, stride=1):
super().__init__()
self.bn1 = nn.BatchNorm2d(cin)
self.conv1 = nn.Conv2d(cin, cout, 3, stride, 1, bias=False)
self.bn2 = nn.BatchNorm2d(cout)
self.conv2 = nn.Conv2d(cout, cout, 3, 1, 1, bias=False)
self.shortcut = (nn.Conv2d(cin, cout, 1, stride, bias=False)
if stride != 1 or cin != cout else nn.Identity())
def forward(self, x):
h = self.conv1(F.relu(self.bn1(x)))
h = self.conv2(F.relu(self.bn2(h)))
return h + self.shortcut(x)
class WideResNet(nn.Module):
def __init__(self, num_classes=10, widen=2):
super().__init__()
n = 16
self.stem = nn.Conv2d(3, n, 3, 1, 1, bias=False)
widths = [n, n*widen, n*2*widen, n*4*widen]
layers = []
for i in range(3):
stride = 1 if i == 0 else 2
layers.append(BasicBlock(widths[i], widths[i+1], stride))
layers.append(BasicBlock(widths[i+1], widths[i+1], 1))
self.blocks = nn.Sequential(*layers)
self.bn = nn.BatchNorm2d(widths[-1])
self.fc = nn.Linear(widths[-1], num_classes)
def forward(self, x):
h = self.blocks(self.stem(x))
h = F.relu(self.bn(h))
h = F.adaptive_avg_pool2d(h, 1).flatten(1)
return self.fc(h)
# ---------- 4. Data pipeline ----------
raw = datasets.CIFAR10("./data", train=True, download=True)
test = datasets.CIFAR10("./data", train=False, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(CIFAR_MEAN, CIFAR_STD)]))
lab_idx, unlab_idx = split_labeled_unlabeled(raw, n_labeled_per_class=25)
lab_ds = LabeledDataset(raw, lab_idx) # 250 images
unlab_ds = UnlabeledDataset(raw, unlab_idx) # ~49,750 images
B, mu = 64, 7
lab_loader = DataLoader(lab_ds, batch_size=B, shuffle=True,
num_workers=2, drop_last=True)
unlab_loader = DataLoader(unlab_ds, batch_size=B*mu, shuffle=True,
num_workers=2, drop_last=True)
test_loader = DataLoader(test, batch_size=256, num_workers=2)
# ---------- 5. FixMatch training loop ----------
model = WideResNet(num_classes=10, widen=2).to(device)
opt = torch.optim.SGD(model.parameters(), lr=0.03,
momentum=0.9, nesterov=True, weight_decay=5e-4)
tau, lam = 0.95, 1.0
def infinite(loader):
while True:
for batch in loader:
yield batch
lab_iter = infinite(lab_loader)
unlab_iter = infinite(unlab_loader)
for step in range(5000): # paper uses 2**20; 5k is illustrative
model.train()
x_l, y_l = next(lab_iter)
x_u_w, x_u_s = next(unlab_iter)
x_l, y_l = x_l.to(device), y_l.to(device)
x_u_w, x_u_s = x_u_w.to(device), x_u_s.to(device)
# One concatenated forward pass for speed (interleaved BN trick):
x = torch.cat([x_l, x_u_w, x_u_s], dim=0)
logits = model(x)
l_logits = logits[:B]
u_w_logits, u_s_logits = logits[B:].chunk(2)
# Supervised loss
loss_s = F.cross_entropy(l_logits, y_l)
# Pseudo-label from weak aug (no grad through target)
with torch.no_grad():
probs_w = F.softmax(u_w_logits, dim=-1)
max_probs, pseudo = probs_w.max(dim=-1)
mask = (max_probs >= tau).float()
# Unsupervised loss on strong aug
loss_u = (F.cross_entropy(u_s_logits, pseudo, reduction="none") * mask).mean()
loss = loss_s + lam * loss_u
opt.zero_grad(); loss.backward(); opt.step()
if step % 500 == 0:
model.eval()
correct = total = 0
with torch.no_grad():
for xb, yb in test_loader:
xb, yb = xb.to(device), yb.to(device)
pred = model(xb).argmax(-1)
correct += (pred == yb).sum().item()
total += yb.size(0)
print(f"step {step:5d} loss_s={loss_s.item():.3f} "
f"loss_u={loss_u.item():.3f} mask_used={mask.mean().item():.2f} "
f"test_acc={100*correct/total:.2f}%")
Several observations follow from running the code above:
- For the first few hundred steps,
mask_usedremains near zero: the model is not yet confident on anything, so the unsupervised term contributes nothing. This is expected; the supervised loss is performing the work. - Between approximately step 1,000 and step 3,000,
mask_usedbegins climbing into the 0.2 to 0.6 range, and test accuracy increases noticeably. This is the point at which FixMatch begins to contribute substantively. - The 5,000-step budget here is an order of magnitude shorter than that used in the paper. Reproducing the reported 94.93% on CIFAR-10 with 250 labels requires much longer training, a cosine learning-rate schedule, and EMA weights at evaluation time.
A realistic labelled-only baseline (the same backbone, the same 250 labels, no unlabelled data, with only heavy augmentation) tends to land in the range of 50% to 60% test accuracy. FixMatch approaches 95%. That gap of more than 30 percentage points, from the same 250 labels, is the central result of modern semi-supervised learning.
Real-World Applications Across Domains
Semi-supervised learning is most valuable wherever the ratio of labelled to unlabelled data is extreme and the cost of labelling is high.
| Domain | Why SSL fits | Typical setup |
|---|---|---|
| Medical imaging | Radiologist time is expensive; raw DICOMs accumulate | 5k labeled scans + 500k unlabeled; FixMatch or Mean Teacher |
| Manufacturing QA | Defects are rare; passing parts flood the line | Few labeled defects, many unlabeled parts; SSL + one-class anomaly models |
| NLP (sentiment, NER) | Labeled corpora small; web text infinite | Backtranslation or UDA on top of a pretrained transformer |
| Autonomous driving | Segmentation labels cost minutes/frame; fleet logs petabytes | Mean Teacher for segmentation; auto-labeling pipelines |
| Fraud detection | Confirmed frauds are rare; transactions are billions | Graph SSL + entropy minimization + active learning loop |
| Speech recognition | Transcribed audio scarce; raw audio abundant | wav2vec 2.0 pretrain + semi-supervised fine-tuning |
| Industrial anomaly detection | Very few examples of failure; many normal runs | Deep SAD (semi-supervised variant of Deep SVDD) |
The manufacturing and anomaly-detection cases deserve a particular note: a semi-supervised variant of one-class classification, Deep SAD, builds directly on the Deep SVDD framework. It uses the few labelled abnormal examples to tighten the hypersphere around normal data. For anomaly detection with even a handful of confirmed anomalies, Deep SAD typically outperforms pure Deep SVDD.
Paradigm Comparison: SSL, Self-SSL, Transfer, Active
When a stakeholder asks which approach to use, the underlying question is often whether more labelling can be avoided. Several paradigms address this question in different ways.
| Paradigm | Data | Labeling cost | Typical performance | When to use |
|---|---|---|---|---|
| Fully supervised | All labeled | High | Baseline | Labels are cheap or already exist |
| Semi-supervised | Few labeled + many unlabeled | Low | Matches supervised at 1–10% labels | Labels scarce, unlabeled data plentiful, distributions match |
| Self-supervised | Unlabeled only (pretrain) | None for pretraining | Great when scaled to considerable data | You need reusable backbones; substantial unlabeled corpus |
| Transfer learning | Pretrained weights + small labeled | Low | Strong and fast | A suitable pretrained model exists in your modality |
| Active learning | Iteratively label smartly | Medium | Maximizes labels ROI | Labeling is possible but slow/expensive; you want to budget it |
| Domain adaptation | Labeled source + unlabeled target | Medium | Bridges distribution shift | Your deployment data differs from your labeled data |
These paradigms combine freely. A strong 2026 pipeline might: (1) pretrain a backbone with self-supervised learning, (2) fine-tune with semi-supervised learning on the actual task, (3) apply DANN-style domain adaptation when deploying to a new facility, and (4) use active learning to prioritise which difficult examples to return to human annotators.
Method Comparison Within SSL
| Method | Complexity | Typical CIFAR-10 (250 labels) | Strengths | Weaknesses |
|---|---|---|---|---|
| Pseudo-labeling | Very low | ~60–70% | Trivial to implement | Confirmation bias, error amplification |
| Mean Teacher | Medium | ~80% | Stable; good for regression/segmentation | Weaker on classification vs FixMatch |
| MixMatch | High | ~88% | Strong with limited tricks | Many moving parts; sensitive to sharpening temperature |
| FixMatch | Medium | ~95% | Simple, current best, broadly applicable | Global threshold can stall on hard classes |
| FlexMatch | Medium-high | ~95.5% | Per-class dynamic thresholds; handles curriculum | More hyperparameters |
Practical Guide: Thresholds, Data Ratios, Pitfalls
How Much Labelled Data Is Required?
Empirically, SSL gains are largest when very few labels are available (for example, 4 to 40 per class) and diminish as the count approaches thousands per class. Above roughly 10% of the dataset labelled, FixMatch and related methods tend to converge with the fully supervised baseline. This does not mean SSL is useless above 10%; rather, the marginal advantage of SSL over additional labelling becomes smaller. The most favourable regime is one in which labels are genuinely scarce.
Choosing a Method
- Standard image classification. Start with FixMatch. It is a strong default with minimal hyperparameter sensitivity.
- Regression or segmentation. Mean Teacher adapts more naturally, because the consistency target can be a continuous prediction or pixel map rather than a class.
- Imbalanced classes or class-dependent difficulty. FlexMatch’s dynamic thresholds prevent the majority classes from absorbing all pseudo-labels.
- Graph-structured data. Use GCN or GAT directly; both are natively semi-supervised.
Hyperparameter Tips
- Confidence threshold τ: 0.95 is the FixMatch default. Lower it (0.7 to 0.8) if mask_used remains near zero for an extended period; raise it if pseudo-labels appear noisy.
- Unsupervised weight λ: 1.0 typically works. If the supervised loss is unstable early in training, ramp λ from 0 to 1 over the first few epochs.
- EMA decay (Mean Teacher): 0.999 is standard. Lower values cause the teacher to track the student noisily; higher values cause it to stop learning.
- Batch size ratio μ: FixMatch uses μ = 7 (seven times more unlabelled per labelled batch). The unlabelled batch must be large enough that confidence-gated pseudo-labels are not all of the same class.
Common Pitfalls
- Confirmation bias. The model pseudo-labels unlabelled data confidently but incorrectly, then trains on those incorrect labels. Strong augmentation and confidence thresholding mitigate this risk.
- Class imbalance. If the labelled set is 90% class A, pseudo-labels will skew toward class A on unlabelled data, reinforcing the imbalance. FlexMatch and distribution alignment (ReMixMatch) address this.
- Distribution shift. If the labelled data originates from Hospital A and the unlabelled data from Hospital B, SSL can degrade performance. The appropriate response is domain adaptation, either instead of SSL or in conjunction with it.
- Open-set contamination. The unlabelled set contains classes that are absent from the labelled set. Pseudo-labelling forces these into known classes, corrupting the model.
- Insufficient iterations. FixMatch requires extended training for mask_used to rise. Judgments should not be made after a single epoch.
Tools and Libraries
- USB (Unified Semi-supervised learning Benchmark). PyTorch framework with more than 15 SSL algorithms and a common evaluation harness.
- TorchSSL. Curated implementations of the classical SSL algorithms for image classification.
- MMClassification / MMSegmentation. OpenMMLab tools with SSL support for image classification and segmentation.
- Google’s official FixMatch repository. The paper authors’ reference TensorFlow implementation.
Connections to Transfer, Active, and Domain Adaptation
Semi-supervised learning is most powerful when treated not as a standalone technique but as one element of a broader set of complementary methods.
Semi-Supervised plus Transfer Learning
A common pattern is to begin with a pretrained backbone (ImageNet, CLIP, wav2vec) and fine-tune it using FixMatch on a small labelled set together with the unlabelled data. This combination routinely outperforms either approach in isolation. The pretrained features provide a head start on representation; SSL allows the model to adapt to the specific label structure. The transfer learning guide presents a concrete version of this pipeline for a cobot anomaly-detection project.
Semi-Supervised plus Active Learning
Active learning selects which unlabelled examples are most worth labelling next, while SSL uses the unlabelled examples without labelling them. The combined workflow trains with SSL, identifies examples on which the model is least confident or on which the SSL pseudo-label fluctuated across epochs, sends those to a human annotator, returns them as labelled data, and repeats. This pattern characterises most production labelling pipelines.
Semi-Supervised plus Domain Adaptation
If the labelled data (source domain) and unlabelled data (target domain) originate from different distributions, plain SSL will fail. Domain-adversarial training (DANN) or maximum-mean-discrepancy methods align the feature distributions, and once alignment is achieved, SSL can operate effectively. This combination is the basis on which many medical AI systems generalise across hospitals.
Semi-Supervised plus Self-Supervised
The two approaches need not be alternatives; they can be stacked. Pretrain with self-supervised learning on a substantial unlabelled corpus (see the self-supervised learning guide), then fine-tune with FixMatch on a small labelled set together with a focused unlabelled set. This combination underlies the modern recipe used in speech (wav2vec 2.0), vision (MAE plus FixMatch fine-tune), and NLP (pretraining plus UDA).
Statistical reasoning helps explain why additional data tends to assist: as unlabelled examples contribute to parameter estimation, the effective sample size grows and variance falls, a phenomenon closely related to the central limit theorem in parameter estimation.
Frequently Asked Questions
What’s the difference between semi-supervised and self-supervised learning?
Semi-supervised learning uses some human-labeled data plus unlabeled data to solve a specific downstream task directly. Self-supervised learning uses only unlabeled data and invents its own labels from data structure (masking, contrastive pairs) to produce a reusable pretrained backbone, which is later fine-tuned with labeled data on a downstream task. Semi-supervised is a training strategy for a task; self-supervised is a pretraining strategy for representations.
How many labeled samples do I need for semi-supervised learning?
The requirement depends on task complexity, but as a rule of thumb, FixMatch-class methods produce substantial gains with as few as 4 to 40 labelled examples per class for image classification. Returns diminish once approximately 10% of the dataset is labelled. For NLP and tabular data the curve is similar, though the inflection often arises with slightly more labels per class due to greater input variability.
When does semi-supervised learning hurt rather than help?
SSL can degrade performance when (a) the unlabelled data distribution differs materially from the labelled data distribution, (b) the unlabelled set contains novel classes not present in the labelled set, (c) class imbalance in the labelled set biases the pseudo-labels, or (d) the core assumptions (smoothness, cluster, manifold) do not hold for the data. The SSL model should always be measured against a strong supervised baseline on a held-out set that reflects deployment conditions.
FixMatch vs MixMatch—which should I use?
FixMatch is simpler, performs better on most benchmarks, and has fewer hyperparameters. It is the recommended starting point unless a specific reason argues for MixMatch (for example, a separate requirement for MixUp regularisation). MixMatch’s averaging-and-sharpening is conceptually elegant, but its empirical gains have been surpassed by FixMatch’s weak/strong pseudo-label scheme.
Can I combine semi-supervised learning with transfer learning?
Yes, and combining them is generally recommended. Initialise with a pretrained backbone (ImageNet, CLIP, or a domain-specific model) and apply FixMatch or Mean Teacher on top. Pretrained weights provide strong features from the start, which means FixMatch’s mask threshold is reached earlier in training and pseudo-labels are more reliable. This combination approximates the default recipe in modern practice.
References and Further Reading
External References
- Sohn, K. et al. (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. arXiv:2001.07685
- Tarvainen, A. and Valpola, H. (2017). Mean teachers are better role models. arXiv:1703.01780
- Xie, Q. et al. (2020). Self-training with Noisy Student improves ImageNet classification. arXiv:1911.04252
- Berthelot, D. et al. (2019). MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv:1905.02249
- Chapelle, O., Schölkopf, B., Zien, A. (eds.) (2006). Semi-Supervised Learning. MIT Press.
- USB benchmark—github.com/microsoft/Semi-supervised-learning
- Google FixMatch reference implementation,github.com/google-research/fixmatch
This article is for informational and educational purposes only and does not constitute investment advice.
Leave a Reply