Summary
What this post covers: A side-by-side, math-and-code walkthrough of Support Vector Machines (SVM) and One-Class SVM (OCSVM), showing when each is the right tool and how their kernel-based machinery diverges despite the shared name.
Key insights:
- SVM is a supervised binary classifier that maximizes the margin between two labeled classes; OCSVM is a semi-supervised anomaly detector that wraps a boundary around a single “normal” class and flags everything outside as suspicious.
- Use SVM only when you have labeled examples of both classes; use OCSVM when anomalies are rare, diverse, or absent from training data, applying the wrong one will either fail to train or throw away half your information.
- Feature scaling and the RBF gamma parameter dominate practical performance: a factor-of-two change in gamma can be the difference between a working model and a useless one, more impactful than any algorithmic substitution.
- OCSVM is highly sensitive to contamination, even a small fraction of anomalies leaking into the “normal” training set produces an overly permissive boundary, so curating clean training data or using a small nu is essential.
- For datasets with millions of samples, kernel SVM and OCSVM become impractical due to O(n^2) memory; Isolation Forest or SGD-based linear variants are better choices at that scale.
Main topics: Introduction, What Is SVM (Support Vector Machine)?, What Is OCSVM (One-Class SVM)?, SVM vs OCSVM: Head-to-Head Comparison, Implementation: Complete Python Code, Real-World Use Cases, Practical Decision Guide: When to Use Which?, Advanced Topics, Performance Comparison, Hyperparameter Tuning Guide, Common Pitfalls, Putting It Together, References.
Introduction
Consider a manufacturing engineer monitoring an assembly line that produces ten thousand circuit boards per day. Of those ten thousand, perhaps three are defective. A machine learning model must catch those three, yet the available data consists overwhelmingly of examples of good boards, with very few examples of defective ones. The choice is between waiting months to collect sufficient defective samples and building a model that learns the structure of “normal” and flags everything else.
This dilemma marks the fundamental divide between two of the most important algorithms in machine learning: the Support Vector Machine (SVM) and its less widely recognised counterpart, the One-Class SVM (OCSVM). Despite a shared name and mathematical lineage, the two algorithms address fundamentally different problems. SVM is a supervised classifier that draws a boundary between two labelled groups. OCSVM is a semi-supervised anomaly detector that wraps a boundary around a single group and treats any point falling outside it as suspicious.
Choosing the wrong method has serious consequences. Applying SVM in the absence of labelled anomalies prevents the model from training at all. Applying OCSVM to perfectly balanced, labelled data discards half of the available information. Yet in tutorials across the internet, the two are routinely conflated, treated cursorily, or illustrated with identical toy examples that obscure their substantive differences.
The present article addresses these gaps. Both algorithms are presented from first principles, with inline SVG diagrams that render the geometry visible. The mathematics is covered with sufficient depth but without excess, and complete runnable Python implementations of both algorithms are provided. A practical decision framework follows, intended to support correct method selection. The treatment is suitable both for a data scientist choosing between approaches in a fraud detection system and for a student aiming to understand when single-class modelling is appropriate.
What Is SVM (Support Vector Machine)?
The Support Vector Machine is one of the more elegant algorithms in machine learning. Developed in the 1990s by Vladimir Vapnik and colleagues at AT&T Bell Labs, SVM is a supervised binary classifier that identifies the optimal hyperplane—a decision boundary—that separates two classes of data with the maximum possible margin.
The intuition is as follows. Consider a scatterplot with blue points on one side and red points on the other. Infinitely many lines could separate them. SVM selects the line that sits as far as possible from the nearest points of both classes. Those nearest points are termed support vectors, and they support the position of the boundary in a literal sense: removing them shifts the boundary. All other points in the dataset are irrelevant to the final model.
Visualising the Standard SVM
The following diagram shows how SVM operates in two dimensions. The decision boundary (solid line) sits exactly between the two classes, with the margin (the gap between the dashed lines) maximised:
This is the central insight of SVM: only the support vectors are consequential. The algorithm is efficient precisely because it ignores the vast majority of training points and focuses on the few that determine the boundary.
Mathematical Formulation
For readers interested in the mathematics, SVM optimises the following objective. Given training data {(x₁, y₁),…, (xₙ, yₙ)} where yᵢ ∈ {-1, +1}, the hard-margin SVM solves:
Subject to: yᵢ(w · xᵢ + b) ≥ 1 for all i
Here, w is the weight vector (perpendicular to the hyperplane), b is the bias term, and the constraint ensures that every point lies on the correct side of the margin. The term ||w||² controls the margin width: minimising it maximises the margin.
Soft Margin SVM and the C Parameter
Real-world data is rarely clean. Classes overlap, and outliers occur. The hard-margin SVM fails on any dataset that is not perfectly separable. The soft-margin SVM introduces slack variables ξᵢ that allow some points to violate the margin or even be misclassified:
Subject to: yᵢ(w · xᵢ + b) ≥ 1 – ξᵢ, ξᵢ ≥ 0
The parameter C is the regularisation constant. A large C penalises misclassifications heavily (tight fit, risk of overfitting). A small C allows more misclassifications (smoother boundary, better generalisation). Tuning C is among the most important decisions in SVM usage.
The Kernel Trick
What if the data is not linearly separable in its original space, so that no hyperplane can divide the classes? The kernel trick is SVM’s principal mechanism for handling this case. It implicitly maps data into a higher-dimensional feature space in which a linear separator does exist, without ever computing coordinates in that space. Instead, every dot product x · x’ is replaced by a kernel function K(x, x’).
Common kernels include:
- Linear: K(x, x’) = x · x’, appropriate for linearly separable data.
- RBF (Gaussian): K(x, x’) = exp(-γ ||x – x’||²), the default choice for most nonlinear problems.
- Polynomial: K(x, x’) = (γ x · x’ + r)^d, used for polynomial decision boundaries.
The advantage of the kernel trick is computational. SVM optimisation requires only dot products between data points. Replacing those dot products with a kernel function produces the effect of operating in a high-dimensional (possibly infinite-dimensional) space without computing the explicit transformation. This is why SVM with an RBF kernel can handle strongly nonlinear boundaries at reasonable computational cost.
When to Use SVM
SVM performs particularly well in the following scenarios:
- Binary classification with labelled data: spam versus non-spam, tumour versus healthy, positive versus negative sentiment.
- High-dimensional data: text classification (TF-IDF vectors with thousands of features) and genomics data.
- Small to medium datasets: SVM’s training complexity of O(n²) to O(n³) makes it impractical for millions of samples, but it is highly effective on datasets in the thousands.
- When a clear margin is desired: the margin provides a geometric notion of confidence.
- When support vector interpretability matters: a practitioner can inspect which training examples serve as support vectors.
Strengths and Weaknesses
Strengths: Strong generalisation with appropriate tuning, effectiveness in high dimensions, memory efficiency (only support vectors are stored), robustness to overfitting when C is tuned, and versatility through different kernels.
Weaknesses: Limited scalability beyond roughly 100,000 samples, sensitivity to feature scaling, substantial dependence on kernel choice and hyperparameter settings, no direct provision of probability estimates (though Platt scaling can approximate them), and difficulty with highly noisy or strongly overlapping classes.
What Is OCSVM (One-Class SVM)?
The One-Class SVM, introduced by Bernhard Schölkopf and colleagues in 2001, inverts the standard SVM paradigm. Instead of learning a boundary between two classes, OCSVM learns a boundary around a single class. Points inside the boundary are treated as normal; points outside are treated as anomalous.
This formulation matches many real-world problems in which only one class is represented in the training data. Examples include:
- Millions of legitimate credit card transactions but only a handful of fraudulent ones.
- Years of sensor data from healthy machines but only a few recordings from moments preceding failure.
- Vast archives of normal network traffic but very few examples of novel attacks—and future attacks tend to differ from past ones.
In each of these cases, training a standard SVM is not feasible because representative examples of the negative class are unavailable. OCSVM addresses this constraint by requiring only normal data for training.
Visualising One-Class SVM
Unlike standard SVM, which requires two classes to construct a decision boundary, OCSVM requires only normal data. It learns the shape of the normal class and draws a tight boundary around it. Any new data point falling outside that boundary is flagged as anomalous.
Mathematical Formulation
Schölkopf’s formulation maps the data into a feature space via a kernel and then identifies a hyperplane that separates the data from the origin with maximum margin. The optimisation problem is:
Subject to: w · φ(xᵢ) ≥ ρ – ξᵢ, ξᵢ ≥ 0
Here ρ is the offset from the origin, and ν serves a dual role: it is an upper bound on the fraction of outliers and a lower bound on the fraction of support vectors. Setting ν = 0.05 means that at most 5% of the training data is expected to be outliers, and at least 5% of the points will serve as support vectors.
The ν Parameter
The ν (nu) parameter is the most important hyperparameter in OCSVM and warrants careful consideration:
- ν = 0.01: A very tight setting, permitting only 1% of training data outside the boundary. Appropriate when the training data is clean.
- ν = 0.05: A common starting point, allowing 5% as potential outliers.
- ν = 0.1: A more relaxed setting, useful when the training data is suspected to contain some contamination.
- ν = 0.5: A very loose setting under which up to half the data may fall outside the boundary. Rarely useful in practice.
The Effect of γ (Gamma) on the Boundary
When OCSVM is used with an RBF kernel (the most common configuration), the γ parameter controls how tightly the boundary wraps around the data. It is arguably the most sensitive parameter in the entire model:
The diagrams above illustrate the substantial effect of γ. At excessively low values, the boundary becomes so loose that it includes actual anomalies. At excessively high values, the boundary wraps so tightly that normal data is flagged as anomalous. Identifying an appropriate setting requires either domain knowledge of how tight the boundary should be or systematic evaluation against a validation set containing known anomalies.
When to Use OCSVM
- Anomaly or novelty detection: identifying unusual data points.
- Only normal data available: no labelled anomalies are present for training.
- Rare event detection: anomalies occur so infrequently that balanced classification is not feasible.
- Open-set recognition: the form of future anomalies is unknown.
- Manufacturing quality control: training on good parts, detecting defective ones.
Strengths and Weaknesses
Strengths: The method requires only normal data for training, naturally handles class imbalance, performs effectively in novelty detection (identifying anomaly types not previously observed), supports kernels for nonlinear boundaries, and provides a decision function score for ranking anomalies.
Weaknesses: The method shares the scalability constraints of SVM (O(n²) to O(n³)), is highly sensitive to the γ and ν parameters, offers no performance guarantee without labelled anomalies for validation, assumes that normal data is well clustered and anomalies are diffuse, and can struggle when the normal data exhibits multiple modes or clusters.
SVM and OCSVM: A Direct Comparison
The two algorithms are now placed side by side. The following diagram illustrates the fundamental difference in what each algorithm does:
Comprehensive Comparison Table
| Feature | SVM (SVC) | OCSVM (OneClassSVM) |
|---|---|---|
| Type | Supervised classification | Semi-supervised anomaly detection |
| Training Data | Labeled examples from BOTH classes | Only normal class (unlabeled or single-label) |
| Output | Class label (+1 or -1) | Normal (+1) or anomaly (-1), plus decision score |
| Objective | Maximize margin between two classes | Minimize boundary around normal data |
| Key Parameters | C (regularization), kernel, γ | ν (outlier fraction), kernel, γ |
| Primary Use Case | Binary/multi-class classification | Anomaly detection, novelty detection |
| Scalability | O(n² to n³)—practical up to ~100K | O(n² to n³),practical up to ~100K |
| Interpretability | Support vectors show boundary examples | Decision function score, support vectors on boundary |
| sklearn Class | sklearn.svm.SVC |
sklearn.svm.OneClassSVM |
| Handles Class Imbalance? | With class_weight parameter | Naturally (only trains on one class) |
Implementation: Complete Python Code
Theory now gives way to practice. The following sections present complete, runnable Python scripts for both algorithms. Each script generates synthetic data, trains the model, visualises the results, and prints evaluation metrics.
SVM Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
classification_report, confusion_matrix, accuracy_score, f1_score
)
# --- Generate synthetic 2D data ---
X, y = make_classification(
n_samples=300, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1,
class_sep=1.2, random_state=42
)
# --- Split and scale ---
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# --- Train SVM with RBF kernel ---
svm = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm.fit(X_train_s, y_train)
# --- Evaluate ---
y_pred = svm.predict(X_test_s)
print("=== SVM Results ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"Support Vectors: {svm.n_support_}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# --- Plot decision boundary ---
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
xx, yy = np.meshgrid(
np.linspace(X_train_s[:, 0].min()-1, X_train_s[:, 0].max()+1, 300),
np.linspace(X_train_s[:, 1].min()-1, X_train_s[:, 1].max()+1, 300)
)
Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 20),
cmap='RdBu', alpha=0.3)
ax.contour(xx, yy, Z, levels=[-1, 0, 1],
linestyles=['--', '-', '--'], colors='k')
ax.scatter(X_train_s[y_train==0, 0], X_train_s[y_train==0, 1],
c='#3b82f6', label='Class 0', edgecolors='k', s=40)
ax.scatter(X_train_s[y_train==1, 0], X_train_s[y_train==1, 1],
c='#ef4444', label='Class 1', edgecolors='k', s=40)
ax.scatter(svm.support_vectors_[:, 0], svm.support_vectors_[:, 1],
s=120, facecolors='none', edgecolors='gold', linewidths=2,
label='Support Vectors')
ax.set_title("SVM Decision Boundary (RBF Kernel)")
ax.legend()
plt.tight_layout()
plt.savefig("svm_decision_boundary.png", dpi=150)
plt.show()
# --- Hyperparameter tuning ---
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.01, 0.1, 1],
'kernel': ['rbf', 'poly']
}
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='f1', n_jobs=-1)
grid.fit(X_train_s, y_train)
print(f"\nBest params: {grid.best_params_}")
print(f"Best CV F1: {grid.best_score_:.3f}")
OCSVM Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score
# --- Generate synthetic normal data + anomalies ---
np.random.seed(42)
n_normal = 300
n_anomaly = 30
# Normal data: two Gaussian clusters
normal_data = np.vstack([
np.random.randn(n_normal // 2, 2) * 0.5 + [2, 2],
np.random.randn(n_normal // 2, 2) * 0.5 + [3, 3],
])
# Anomalies: scattered uniformly in a wider region
anomalies = np.random.uniform(low=-2, high=7, size=(n_anomaly, 2))
# Labels: +1 = normal, -1 = anomaly (OCSVM convention)
y_normal = np.ones(n_normal)
y_anomaly = -np.ones(n_anomaly)
# --- Scale features (critical for SVM-based methods!) ---
scaler = StandardScaler()
normal_scaled = scaler.fit_transform(normal_data)
# --- Train OCSVM on normal data only ---
ocsvm = OneClassSVM(kernel='rbf', gamma=0.3, nu=0.05)
ocsvm.fit(normal_scaled)
# --- Evaluate on combined dataset ---
X_all = np.vstack([normal_data, anomalies])
X_all_scaled = scaler.transform(X_all)
y_true = np.concatenate([y_normal, y_anomaly])
y_pred = ocsvm.predict(X_all_scaled)
scores = ocsvm.decision_function(X_all_scaled)
print("=== OCSVM Results ===")
print(f"Precision: {precision_score(y_true, y_pred, pos_label=-1):.3f}")
print(f"Recall: {recall_score(y_true, y_pred, pos_label=-1):.3f}")
print(f"F1 Score: {f1_score(y_true, y_pred, pos_label=-1):.3f}")
print(f"Support Vectors: {ocsvm.support_vectors_.shape[0]}")
print("\nClassification Report:")
print(classification_report(y_true, y_pred,
target_names=['Anomaly (-1)', 'Normal (+1)']))
# --- Plot decision boundary ---
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
xx, yy = np.meshgrid(
np.linspace(X_all_scaled[:, 0].min()-1, X_all_scaled[:, 0].max()+1, 300),
np.linspace(X_all_scaled[:, 1].min()-1, X_all_scaled[:, 1].max()+1, 300)
)
Z = ocsvm.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 10),
cmap='Reds_r', alpha=0.3)
ax.contourf(xx, yy, Z, levels=np.linspace(0, Z.max(), 10),
cmap='Greens', alpha=0.3)
ax.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')
ax.scatter(normal_scaled[:, 0], normal_scaled[:, 1],
c='#10b981', s=30, label='Normal', edgecolors='k', linewidths=0.5)
anomalies_scaled = scaler.transform(anomalies)
ax.scatter(anomalies_scaled[:, 0], anomalies_scaled[:, 1],
c='#ef4444', s=60, marker='D', label='Anomaly', edgecolors='k')
ax.set_title("OCSVM Decision Boundary")
ax.legend()
plt.tight_layout()
plt.savefig("ocsvm_decision_boundary.png", dpi=150)
plt.show()
# --- Tune nu and gamma ---
best_f1 = 0
best_params = {}
for nu in [0.01, 0.03, 0.05, 0.1, 0.2]:
for gamma in [0.01, 0.05, 0.1, 0.3, 0.5, 1.0]:
model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu)
model.fit(normal_scaled)
preds = model.predict(X_all_scaled)
f1 = f1_score(y_true, preds, pos_label=-1)
if f1 > best_f1:
best_f1 = f1
best_params = {'nu': nu, 'gamma': gamma}
print(f"\nBest params: {best_params}")
print(f"Best F1: {best_f1:.3f}")
Side-by-Side Comparison Script
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC, OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, accuracy_score
np.random.seed(42)
# Generate data: normal class + rare anomaly class
n_normal, n_anomaly = 400, 20
X_normal = np.random.randn(n_normal, 2) * 0.8 + [3, 3]
X_anomaly = np.random.uniform(0, 6, size=(n_anomaly, 2))
X_all = np.vstack([X_normal, X_anomaly])
y_all = np.array([1]*n_normal + [-1]*n_anomaly)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)
X_normal_scaled = scaler.transform(X_normal)
# --- Approach 1: SVM (supervised — uses BOTH labels) ---
svm = SVC(kernel='rbf', C=10, gamma='scale')
svm.fit(X_scaled, y_all)
y_pred_svm = svm.predict(X_scaled)
# --- Approach 2: OCSVM (semi-supervised — trained on normal only) ---
ocsvm = OneClassSVM(kernel='rbf', gamma=0.3, nu=0.05)
ocsvm.fit(X_normal_scaled)
y_pred_ocsvm = ocsvm.predict(X_scaled)
# --- Compare metrics ---
print("=" * 50)
print(f"{'Metric':<25} {'SVM':>10} {'OCSVM':>10}")
print("=" * 50)
print(f"{'Accuracy':<25} {accuracy_score(y_all, y_pred_svm):>10.3f} "
f"{accuracy_score(y_all, y_pred_ocsvm):>10.3f}")
print(f"{'F1 (anomaly class)':<25} {f1_score(y_all, y_pred_svm, pos_label=-1):>10.3f} "
f"{f1_score(y_all, y_pred_ocsvm, pos_label=-1):>10.3f}")
print(f"{'F1 (normal class)':<25} {f1_score(y_all, y_pred_svm, pos_label=1):>10.3f} "
f"{f1_score(y_all, y_pred_ocsvm, pos_label=1):>10.3f}")
print("=" * 50)
# --- Plot both ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, model, title, preds in zip(
axes, [svm, ocsvm],
["SVM (supervised)", "OCSVM (normal-only training)"],
[y_pred_svm, y_pred_ocsvm]
):
xx, yy = np.meshgrid(
np.linspace(X_scaled[:,0].min()-1, X_scaled[:,0].max()+1, 200),
np.linspace(X_scaled[:,1].min()-1, X_scaled[:,1].max()+1, 200)
)
Z = model.decision_function(
np.c_[xx.ravel(), yy.ravel()]
).reshape(xx.shape)
ax.contour(xx, yy, Z, levels=[0], colors='k', linewidths=2)
ax.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 20),
cmap='RdYlGn', alpha=0.3)
ax.scatter(X_scaled[y_all==1, 0], X_scaled[y_all==1, 1],
c='#10b981', s=20, label='Normal')
ax.scatter(X_scaled[y_all==-1, 0], X_scaled[y_all==-1, 1],
c='#ef4444', s=60, marker='D', label='Anomaly')
ax.set_title(title)
ax.legend(loc='lower right')
plt.suptitle("SVM vs OCSVM on the Same Dataset", fontsize=14, y=1.02)
plt.tight_layout()
plt.savefig("svm_vs_ocsvm_comparison.png", dpi=150, bbox_inches='tight')
plt.show()
Real-World Use Cases
SVM Use Cases
Standard SVM has served as a reliable instrument for classification tasks for more than two decades. The following are among its most consequential applications:
| Use Case | Dataset Example | Why SVM Works |
|---|---|---|
| Email spam detection | SpamAssassin Corpus | High-dimensional text features, clear binary labels |
| Image classification | CIFAR-10, MNIST | Kernel trick handles nonlinear pixel relationships |
| Medical diagnosis | Wisconsin Breast Cancer | Small dataset, high-dimensional features, labeled outcomes |
| Sentiment analysis | IMDB Reviews, Yelp | TF-IDF vectors are high-dimensional and sparse |
| Gene expression classification | Microarray datasets | highly high dimensions (thousands of genes), few samples |
| Handwriting recognition | USPS, MNIST digits | RBF kernel handles pixel-space nonlinearity well |
OCSVM Use Cases
OCSVM is particularly well suited to problems in which anomalies are rare, undefined, or continually evolving:
| Use Case | Industry | Why OCSVM over SVM |
|---|---|---|
| Manufacturing defect detection | Automotive, electronics | Defects are rare (< 0.1%) and come in unpredictable forms |
| Network intrusion detection | Cybersecurity | New attack types emerge constantly—can’t label them in advance |
| Credit card fraud detection | Finance | Fraud is < 0.01% of transactions; fraudsters change tactics |
| Predictive maintenance | Manufacturing, energy | Machines rarely fail, abundant healthy data, minimal failure data |
| IoT sensor anomaly detection | Smart buildings, agriculture | Continuous stream of normal readings; anomalies are diverse |
| Medical device monitoring | Healthcare | Train on healthy patients, flag unusual vital signs |
Practical Decision Guide: When to Use Which
The decision between SVM and OCSVM for a new problem can be approached through the following sequence of questions:
Question 1: Are labelled examples available from both classes?
- Yes → Consider SVM. The data permits training of a supervised classifier.
- No → Use OCSVM. Learning is possible only from the available class.
Question 2: Is one class extremely rare (less than 1% of the data)?
- Yes → OCSVM is likely the better choice. Even when some labelled anomalies are available, the extreme imbalance degrades SVM performance unless heavy resampling is applied.
- No → SVM with appropriate class weighting should perform well.
Question 3: Is the objective classification or anomaly detection?
- Classification (assigning examples to known categories) → SVM.
- Anomaly detection (identifying examples that do not belong) → OCSVM.
Question 4: Does the abnormal class have a clear, stable definition?
- Yes (for example, spam exhibits consistent patterns) → SVM can learn these patterns.
- No (for example, novel attacks or unprecedented failures) → OCSVM, since it does not require explicit knowledge of how anomalies appear.
Scenario Recommendations
| Scenario | Recommendation | Reason |
|---|---|---|
| 10K spam + 10K ham emails | SVM | Balanced labeled data available |
| 1M normal transactions, 50 fraud cases | OCSVM | Extreme imbalance, fraud evolves |
| Tumor vs healthy tissue (labeled) | SVM | Both classes labeled by pathologists |
| Monitoring a new machine (no failure data) | OCSVM | Only healthy operation data exists |
| Sentiment analysis (positive/negative) | SVM | Large labeled corpora available |
| Detecting unknown malware variants | OCSVM | New variants are undefined a priori |
| Dog vs cat image classifier | SVM | Clear binary task with labeled images |
| Rare disease screening in population | OCSVM | Disease prevalence < 0.01% |
Advanced Topics
SVDD: Support Vector Data Description
SVDD, proposed by Tax and Duin (2004), is closely related to OCSVM. Where OCSVM identifies a hyperplane in feature space that separates the data from the origin, SVDD identifies the minimum enclosing hypersphere that contains most of the data. Points outside the sphere are anomalies.
In practice, SVDD with an RBF kernel produces results identical to those of OCSVM (the two are mathematically equivalent under Gaussian kernels). The principal difference is conceptual: SVDD frames the problem in terms of spheres, while OCSVM frames it in terms of hyperplanes. Most practitioners use OCSVM via scikit-learn because of its wider availability.
Multi-Class SVM
Standard SVM is inherently binary, but two strategies extend it to multi-class problems:
- One-vs-Rest (OvR): Train K binary classifiers, each separating one class from all others. Assign the class with the highest decision function value. K classifiers are required.
- One-vs-One (OvO): Train K(K-1)/2 binary classifiers, one for each pair of classes, and use majority voting. This is the default for scikit-learn’s SVC and often performs better in practice, though more models must be trained.
Deep SVDD: Neural Networks and OCSVM
Deep SVDD (Ruff et al., 2018) replaces the kernel trick with a deep neural network. Instead of mapping data to a kernel-defined feature space and identifying a hypersphere, it trains a neural network to map data into a learned representation space in which normal data clusters tightly around a centre point. The loss function minimises the distance from the centre of normal data representations.
This approach scales considerably better than kernel-based OCSVM and can handle high-dimensional data such as images and time series. Libraries such as PyOD provide Deep SVDD as a default option.
OCSVM Alternatives: Isolation Forest and LOF
| Method | Approach | Scalability | Best For |
|---|---|---|---|
| OCSVM | Kernel-based boundary | O(n²-n³)—up to ~50K | Small-medium data, smooth boundaries |
| Isolation Forest | Random tree partitioning | O(n log n)—millions | Large datasets, tabular data |
| LOF | Local density comparison | O(n²),up to ~50K | Varying density clusters |
| Autoencoder | Reconstruction error | Depends on architecture | High-dimensional data (images, sequences) |
OCSVM for Time-Series Anomaly Detection
OCSVM does not natively handle time-series data, but with appropriate feature engineering it becomes an effective time-series anomaly detector. The standard procedure is as follows:
- Sliding window: Convert the time series into fixed-length windows (for example, 60-second windows).
- Feature extraction: For each window, compute statistical features—mean, standard deviation, minimum, maximum, skewness, kurtosis, spectral features, and rolling statistics.
- Train OCSVM: Fit on feature vectors drawn from known-normal periods.
- Detect: Score new windows; those below the decision threshold are flagged as anomalies.
# Time-series anomaly detection with OCSVM
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
def extract_features(window):
"""Extract statistical features from a time-series window."""
return [
np.mean(window), np.std(window),
np.min(window), np.max(window),
np.percentile(window, 25), np.percentile(window, 75),
np.max(window) - np.min(window), # range
np.mean(np.abs(np.diff(window))), # mean abs change
]
# Simulate normal time series + anomaly
np.random.seed(42)
normal_ts = np.sin(np.linspace(0, 20*np.pi, 2000)) + np.random.randn(2000)*0.1
anomaly_ts = np.sin(np.linspace(0, 2*np.pi, 100)) + np.random.randn(100)*0.5 + 3
# Sliding window feature extraction
window_size = 50
stride = 10
features_normal = [
extract_features(normal_ts[i:i+window_size])
for i in range(0, len(normal_ts)-window_size, stride)
]
features_anomaly = [
extract_features(anomaly_ts[i:i+window_size])
for i in range(0, len(anomaly_ts)-window_size, stride)
]
X_normal = np.array(features_normal)
X_anomaly = np.array(features_anomaly)
scaler = StandardScaler()
X_normal_s = scaler.fit_transform(X_normal)
X_anomaly_s = scaler.transform(X_anomaly)
ocsvm = OneClassSVM(kernel='rbf', gamma=0.1, nu=0.05)
ocsvm.fit(X_normal_s)
print(f"Normal windows flagged as anomaly: "
f"{(ocsvm.predict(X_normal_s) == -1).sum()}/{len(X_normal_s)}")
print(f"Anomaly windows detected: "
f"{(ocsvm.predict(X_anomaly_s) == -1).sum()}/{len(X_anomaly_s)}")
Performance Comparison
How do these methods compare on standard anomaly detection benchmarks? The following table summarises typical performance across commonly used datasets. Exact figures vary with preprocessing and hyperparameter choices, but the relative rankings are consistent across studies:
| Method | Shuttle (AUC) | Thyroid (AUC) | Satellite (AUC) | Training Time |
|---|---|---|---|---|
| OCSVM (RBF) | 0.995 | 0.920 | 0.850 | Medium |
| Isolation Forest | 0.997 | 0.940 | 0.830 | Fast |
| LOF | 0.540 | 0.910 | 0.820 | Medium |
| Autoencoder | 0.985 | 0.935 | 0.880 | Slow |
| SVM (supervised) | 0.999 | 0.980 | 0.920 | Medium |
Key observations:
- Supervised SVM consistently outperforms all unsupervised methods, but it requires labelled anomalies, which are often unavailable.
- OCSVM performs competitively with Isolation Forest on most benchmarks, with the additional advantage of producing a smooth decision boundary.
- Isolation Forest is typically the first choice for large datasets owing to its O(n log n) complexity.
- OCSVM is particularly effective when the normal data has a clear, compact structure in feature space.
Computational Complexity and Scalability
Both SVM and OCSVM have a training complexity of O(n²) to O(n³), where n denotes the number of training samples. This arises from solving a quadratic programming problem. In practice:
- Up to 10,000 samples: Both train in seconds to minutes without concern.
- 10,000 to 50,000 samples: Training takes minutes to an hour, and remains feasible.
- 50,000 to 100,000 samples: Training may take hours. Subsampling or approximate methods should be considered.
- Above 100,000 samples: Direct application is impractical without workarounds.
sklearn.linear_model.SGDOneClassSVM for linear OCSVM at scale; (3) Nystroem or RBFSampler, which approximate the kernel with explicit feature maps and allow subsequent use of a linear SVM; or (4) switching to Isolation Forest, which handles millions of samples efficiently.
Hyperparameter Tuning Guide
Appropriate hyperparameter settings often determine whether a model works at all. The following provides a complete tuning guide:
Tuning SVM
| Parameter | What It Controls | Starting Value | Search Range |
|---|---|---|---|
| C | Regularization—trade-off between margin width and misclassification penalty | 1.0 | [0.001, 0.01, 0.1, 1, 10, 100, 1000] |
| kernel | Shape of the decision boundary | ‘rbf’ | [‘rbf’, ‘poly’, ‘linear’] |
| γ (gamma) | RBF kernel width—controls influence radius of each point | ‘scale’ (= 1/(n_features * X.var())) | [0.001, 0.01, 0.1, 1, 10, ‘scale’, ‘auto’] |
Use GridSearchCV or RandomizedSearchCV with 5-fold cross-validation. The appropriate metric depends on the problem: accuracy for balanced classes, F1 for imbalanced classes, and AUC-ROC when threshold-independent evaluation is desired.
Tuning OCSVM
| Parameter | What It Controls | Starting Value | Search Range |
|---|---|---|---|
| ν (nu) | Upper bound on outlier fraction, lower bound on SV fraction | 0.05 | [0.001, 0.01, 0.03, 0.05, 0.1, 0.2] |
| kernel | Shape of the boundary around normal data | ‘rbf’ | [‘rbf’, ‘poly’] |
| γ (gamma) | Boundary tightness, most sensitive parameter | ‘scale’ | [0.001, 0.01, 0.05, 0.1, 0.3, 0.5, 1.0] |
Grid Search and Random Search
For SVM with three parameters (C, γ, kernel), a full grid search over the ranges above requires evaluating over 100 combinations per CV fold. Random search (Bergstra and Bengio, 2012) often finds good hyperparameters more quickly by sampling random combinations, particularly when certain parameters matter more than others. In this setting, γ almost always carries more weight than the remaining parameters.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
param_dist = {
'C': loguniform(0.01, 1000),
'gamma': loguniform(0.001, 10),
'kernel': ['rbf', 'poly'],
}
random_search = RandomizedSearchCV(
SVC(), param_dist, n_iter=50, cv=5,
scoring='f1', random_state=42, n_jobs=-1
)
random_search.fit(X_train_scaled, y_train)
print(f"Best: {random_search.best_params_} → F1={random_search.best_score_:.3f}")
Common Pitfalls
The following mistakes recur frequently among practitioners using these algorithms:
Using SVM Without Labelled Anomalies
The mistake is straightforward in principle but common in practice. A team aims to detect anomalies, selects SVM out of familiarity, and then either fabricates anomaly labels or uses the few available anomalies as a tiny minority class. The resulting model performs poorly because SVM requires representative examples from both classes. When labelled anomalies are unavailable—and in most anomaly detection problems they are not—OCSVM should be used instead.
Setting ν Too Low or Too High
Setting ν = 0.001 when the training data contains 5% contamination causes the model to enclose everything, including real anomalies, within the normal boundary. Setting ν = 0.5 produces a boundary so loose that half of the normal data is flagged. The value of ν should match the best available estimate of contamination, and when uncertain, a moderately higher value (0.05 is a safe default) should be preferred.
Failing to Scale Features
This is the most common mistake encountered with SVM and OCSVM. Both algorithms are based on distances (through their kernels), and features of larger magnitude will dominate. Features should always be standardised (zero mean, unit variance) before training. Use StandardScaler and fit it on training data only:
# CORRECT: fit on training data, transform both
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test) # use training statistics!
# WRONG: fitting scaler on test data leaks information
# scaler.fit_transform(X_test) # NEVER do this
Using a Linear Kernel on Nonlinear Data
A linear kernel produces a hyperplane decision boundary. If the classes are arranged in concentric circles, spirals, or any other nonlinear pattern, a linear kernel will fail outright. When in doubt, RBF is the preferred starting point: it can approximate linear boundaries with appropriate γ, so little is lost by defaulting to it.
Failing to Tune γ
The γ parameter for the RBF kernel is arguably the most important and most sensitive hyperparameter in both SVM and OCSVM. The default (‘scale’ in scikit-learn) is reasonable but rarely optimal. γ should always be included in the hyperparameter search. Small changes in γ can produce substantial changes in model behaviour; the difference between a working model and an ineffective one can amount to a factor of two in γ.
Training OCSVM on Contaminated Data
OCSVM assumes that its training data is “normal.” When anomalies enter the training set, which occurs frequently in practice, the model learns an overly permissive boundary that incorporates those anomalies as normal. Mitigation strategies include careful curation of training data, use of a small ν that allows some contamination, and pre-filtering of obvious outliers before training.
Putting It Together
SVM and OCSVM share a name, a mathematical foundation, and a kernel-based approach to learning, but they address fundamentally different problems. SVM is a supervised classifier that requires labelled examples from both classes to draw a separating boundary between them. OCSVM is a semi-supervised anomaly detector that requires only normal data to draw a boundary around the normal class.
The choice between them is not a matter of which is preferable in general, but of which matches the problem:
- Labelled data from both classes is available. SVM will almost always outperform OCSVM, since it uses more information.
- Only normal data is available, or anomalies are too rare and diverse to label. OCSVM is the appropriate tool. It builds a model of normality and detects anything unusual, including anomaly types not previously observed.
- Scaling to millions of samples is required. Consider Isolation Forest or SGD-based variants in place of kernel SVM or OCSVM.
Several essential practices apply throughout: scale features, tune γ and C (or ν), start with an RBF kernel unless a specific reason argues otherwise, and validate the model as rigorously as the labelled data permits. With these principles in place, the appropriate SVM variant can be selected for any classification or anomaly detection problem.
When the distinction between SVM and OCSVM is conflated, the basis for distinguishing them—and the circumstances in which each is appropriate—should now be clear.
References
- Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag.
- Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., & Williamson, R. (2001). “Estimating the Support of a High-Dimensional Distribution.” Neural Computation, 13(7), 1443-1471.
- Tax, D. M. J., & Duin, R. P. W. (2004). “Support Vector Data Description.” Machine Learning, 54(1), 45-66.
- Ruff, L., et al. (2018). “Deep One-Class Classification.” Proceedings of the 35th International Conference on Machine Learning (ICML).
- Bergstra, J., & Bengio, Y. (2012). “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research, 13, 281-305.
- Pedregosa, F., et al. (2011). “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research, 12, 2825-2830.
- Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). “Isolation Forest.” Proceedings of the 8th IEEE International Conference on Data Mining.
- Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). “LOF: Identifying Density-Based Local Outliers.” Proceedings of the 2000 ACM SIGMOD.
- scikit-learn documentation: Support Vector Machines.
- scikit-learn documentation: Novelty and Outlier Detection.
Leave a Reply