Your fraud detector achieves 99.9% accuracy. Sounds great—until you realize 99.9% of transactions are legitimate, and your model just flags everything as “normal.” Accuracy is a lie when anomalies are rare. Picking the wrong metric is the #1 mistake in anomaly detection.
This guide walks through every metric that matters for anomaly detection—from the basics like Precision and Recall, to threshold-independent ranking metrics like AUROC and AUPRC, to specialized time-series metrics like PA-F1 and VUS. We’ll cover the formulas, the trade-offs, and full Python implementations you can drop into a project today.
Summary
What this post covers: A complete reference for choosing and computing anomaly detection metrics — Precision, Recall, F1, FAR, MCC, AUROC, AUPRC, time-series variants, and Top-K — with formulas, trade-offs, and Python implementations aimed at ML engineers building rare-event detectors (fraud, intrusion, defects, biometrics).
Key insights:
- Accuracy is degenerate when anomalies are rare — a constant “normal” predictor can score 99.9% — so the first decision in any anomaly-detection project is to discard accuracy as the headline metric.
- For severely imbalanced data (anomalies under 1%), AUPRC is the primary ranking metric and AUROC is secondary; AUROC can look misleadingly high on heavily imbalanced data because the TN count dominates the denominator.
- Different stakeholders need different metrics on the same model — engineers care about AUROC/AUPRC, operations about FAR and alert volume, finance about dollar-weighted recall — so a single number is always a stakeholder choice in disguise.
- Standard point-wise F1 breaks for time-series anomalies because real anomalies are contiguous events, not isolated samples; use range-based F1, VUS, or NAB Score instead.
- Most production teams should report a small bundle — AUPRC + Precision@K + Recall + FAR — which covers model quality, operational alert volume, miss rate, and false-alarm rate together.
Main topics: why anomaly metrics matter, the confusion matrix foundation, threshold-dependent metrics, threshold-independent metrics, a decision framework for picking metrics, time-series-specific metrics, Top-K ranking metrics, Python implementations, threshold selection for production, common pitfalls, and domain reporting templates.
Why Anomaly Detection Metrics Matter (and Why Accuracy Doesn’t)
Suppose you build a fraud detector and proudly announce it hits 99.9% accuracy. Your manager is impressed. The board is impressed. You’re a genius. Then someone asks how many actual fraud cases it caught last quarter, and the answer is—none. Zero. Your model achieves 99.9% accuracy by predicting “not fraud” on every single transaction, because in a payment processor, fraud rates hover around 0.1%. The “model” is a constant. The accuracy is real. And it’s worthless.
This is the foundational truth of anomaly detection: the positive class (the anomaly) is rare. Sometimes extremely rare. Network intrusions, manufacturing defects, credit-card fraud, rare diseases—all of them follow base rates from 0.01% to maybe 5%. When the negative class dominates, accuracy becomes a degenerate metric. A model that predicts “normal” for everything will look brilliant.
That’s the imbalance problem. But there’s a second, equally important issue: cost asymmetry. Missing a true anomaly (false negative) almost always costs more than flagging a legitimate event by mistake (false positive). A missed credit-card fraud could cost $5,000. An unnecessary alert costs maybe 30 seconds of an analyst’s time. These aren’t symmetric mistakes, and the metric you choose has to reflect that asymmetry.
Different stakeholders care about different metrics for the same model:
- The ML engineer wants AUROC and AUPRC to compare model architectures.
- The product manager wants Precision@K because the UI shows the top 50 alerts per day.
- The operations lead wants False Alarm Rate (FAR) and Mean Time To Detect (MTTD) because their analysts have to triage every alert.
- The CFO wants dollar-weighted recall, the fraction of fraud value caught, not just the count.
If you pick a single number to optimize, you’re implicitly making a stakeholder choice. The right answer is to report a small set of complementary metrics so each audience sees what they need.
The Confusion Matrix Foundation
Every metric in this guide is built from four numbers—the cells of the confusion matrix. By convention, in anomaly detection the anomaly is the positive class and the normal point is the negative class.
| Term | Definition | Fraud Example |
|---|---|---|
| True Positive (TP) | Model predicts anomaly, truly is anomaly | Caught a fraudulent transaction |
| False Positive (FP) | Model predicts anomaly, truly is normal | Flagged a legitimate purchase |
| True Negative (TN) | Model predicts normal, truly is normal | Correctly cleared a normal payment |
| False Negative (FN) | Model predicts normal, truly is anomaly | Missed a fraudulent transaction |
Here’s a worked example. Imagine 10,000 credit-card transactions where 100 are fraudulent (1% anomaly rate) and your model predicts as follows:
From the cells above, every metric we discuss in this guide is derivable. Note something important: the accuracy for this model is (95 + 9870) / 10000 = 99.65%,which sounds excellent. But a constant “always normal” model would score 99.0%. The lift from a real model is just 0.65 percentage points. If you compare two models on accuracy alone, you’ll learn almost nothing.
The fundamental trade-off in any threshold-based detector is this: lower the threshold → catch more anomalies (TP↑) but also flag more normals (FP↑). Raise the threshold → fewer false alarms (FP↓) but you’ll miss anomalies (FN↑). Every metric in this guide either picks one threshold and reports performance there, or sweeps over all thresholds and summarizes the trade-off.
Threshold-Dependent Metrics: Precision, Recall, F1, FAR, MCC
These metrics require you to commit to one decision threshold (typically 0.5 for probabilities, or some calibrated value for anomaly scores). Once committed, you can compute the four-cell confusion matrix and derive everything below.
Precision—How Pure Are My Alerts?
Precision = TP / (TP + FP). It answers: “Of everything I flagged as anomalous, how many actually were?” In our worked example, Precision = 95/125 = 0.76. That means 76% of the alerts were real fraud, and 24% were false alarms.
When precision matters most:
- Alert fatigue. If a SOC analyst gets 100 alerts a day and 90 are wrong, they stop trusting the system. Precision = 0.10.
- Costly interventions. If acting on an alert means freezing a customer’s account, you’d better be right.
- Limited human review capacity. When you can only investigate the top 50 cases, you want those to be high-quality.
Recall (Sensitivity, True Positive Rate)—How Many Did I Catch?
Recall = TP / (TP + FN). “Of all true anomalies, how many did I catch?” In our example, Recall = 95/100 = 0.95. That’s 95% catch rate.
When recall matters most:
- Catastrophic miss costs. Cancer screening. Cybersecurity intrusions. Aircraft engine faults. Missing one is unacceptable.
- Rare but serious anomalies. When the cost of FN dwarfs the cost of FP.
- Compliance and regulatory contexts. Anti-money-laundering regulations effectively mandate high recall.
F1 Score, The Compromise
F1 = 2·P·R / (P + R). It’s the harmonic mean of Precision and Recall, designed so a low score in either drags the F1 down. In our example, F1 = 2·(0.76)(0.95)/(0.76+0.95) = 0.844.
Why harmonic mean and not arithmetic? Because Precision = 1.0 and Recall = 0.01 (you flagged exactly one true anomaly out of 100) shouldn’t average to 0.505—that’s misleading. The harmonic mean gives 0.0198, which is closer to the truth: this model is bad.
For asymmetric costs, use F-beta:
Fβ = (1 + β2) · P · R / (β2·P + R)
- β = 1 → standard F1, equal weight
- β = 2 → F2, recall weighted twice as much (medical, security)
- β = 0.5 → F0.5, precision weighted twice as much (alert fatigue contexts)
Specificity (TNR) and False Alarm Rate (FAR/FPR)
Specificity = TN / (TN + FP). The fraction of true normals correctly left alone. FAR (= FPR = 1 − Specificity) is the fraction of normals that got flagged. In our example FAR = 30/9900 = 0.30%.
FAR is the metric your operations team will actually quote. If you process 1 million events per day at FAR = 0.5%, that’s 5,000 false alarms daily—completely unworkable. Most operational systems target FAR < 0.1% or even < 0.01%, and accept whatever recall results.
False Reject Rate (FRR)
FRR = FN / (FN + TP) = 1 − Recall. This is biometrics terminology, in face recognition or fingerprint authentication, an FRR is the fraction of legitimate users incorrectly rejected. The “False Acceptance Rate” in biometrics is the same as FAR/FPR here.
Matthews Correlation Coefficient (MCC)
MCC = (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Range: [−1, +1]. +1 = perfect, 0 = random, −1 = perfectly wrong. Unlike F1, MCC uses all four cells of the confusion matrix and remains informative even when the imbalance is severe. It’s particularly useful when you want a single, balanced number that doesn’t get fooled by a model that just predicts the majority class.
Balanced Accuracy
Balanced Accuracy = (Sensitivity + Specificity) / 2. A simple average of the per-class accuracies. The “always normal” model gets 50% balanced accuracy, regardless of the imbalance. Use this when you want an accuracy-like number that doesn’t reward majority-class prediction.
| Metric | Formula | Range | When to Use |
|---|---|---|---|
| Precision | TP / (TP + FP) | [0, 1] | Alert fatigue, costly interventions |
| Recall (TPR, Sensitivity) | TP / (TP + FN) | [0, 1] | Catastrophic miss costs, security, medical |
| F1 | 2PR / (P + R) | [0, 1] | Single threshold, balanced trade-off |
| Fβ | (1+β2)PR / (β2P+R) | [0, 1] | Asymmetric costs (β>1: recall, β<1: precision) |
| Specificity (TNR) | TN / (TN + FP) | [0, 1] | Medical screening (avoid false positives) |
| FAR (FPR) | FP / (FP + TN) | [0, 1] | Operations, alert volume control |
| FRR (FNR) | FN / (FN + TP) | [0, 1] | Biometrics |
| MCC | see formula above | [−1, 1] | Balanced single number for imbalanced data |
| Balanced Accuracy | (TPR + TNR) / 2 | [0, 1] | Accuracy-like, imbalance-aware |
| AUROC | ∫TPR d(FPR) | [0, 1] | Threshold-free comparison, mild imbalance |
| AUPRC (AP) | ∫P d(R) | [0, 1] | Severe imbalance—preferred over AUROC |
Threshold-Independent Metrics: AUROC, AUPRC, DET
The metrics above all assume you’ve picked a threshold. But during model development, you usually want a single number that summarizes the model’s quality across all possible thresholds. That’s where ranking metrics come in.
ROC Curve and AUROC
The Receiver Operating Characteristic (ROC) curve plots TPR (y-axis) against FPR (x-axis) as the threshold varies. Each point on the curve corresponds to a different decision threshold. The area under this curve—AUROC,has a beautiful probabilistic interpretation:
AUROC = P(score(positive) > score(negative))
“If I randomly draw one anomaly and one normal, AUROC is the probability the model scores the anomaly higher.” 0.5 is random guessing. 1.0 is perfect ranking. 0.95 means 95% of randomly chosen pairs are correctly ordered.
AUROC has lovely properties: it’s threshold-independent, scale-invariant (only the rank order of scores matters), and the random baseline is always exactly 0.5 regardless of class balance. That last point is also its weakness.
When AUROC Misleads
Here’s a real scenario. You have 1 million transactions, 1,000 of which are fraud (0.1% rate). Your model achieves AUROC = 0.97. Sounds amazing. Now look at the actual usability: at the threshold that produces 1,000 alerts, you might catch 600 frauds and raise 400 false positives. Precision = 60%, Recall = 60%. The model still misses 400 frauds, and 40% of alerts are false. AUROC = 0.97 sold us a story that the operational reality didn’t deliver.
The reason: AUROC averages TPR over the full FPR range from 0 to 1. But in production you only care about FPR < 1% or so. Most of the AUROC area is contributed by regions you’ll never operate in. With severe imbalance, even a sub-1% FPR generates massive numbers of false positives because the negative class is huge.
Precision-Recall Curve and AUPRC
The PR curve plots Precision (y-axis) against Recall (x-axis) as threshold varies. The area under this curve—AUPRC, also called Average Precision (AP)—is far more honest for imbalanced data. Saito and Rehmsmeier (2015) showed empirically that PR curves provide a more informative picture than ROC curves when class imbalance is severe.
The random baseline for AUPRC equals the positive class fraction. So if anomalies are 1% of your data, a coin-flip detector gets AUPRC ≈ 0.01. Beating that baseline by a wide margin is much harder than beating AUROC’s 0.5 baseline.
Below is the canonical illustration of the same model evaluated by both curves on a severely imbalanced dataset.
The two curves describe the same model. AUROC = 0.95 sounds like a top-tier detector. AUPRC = 0.42 says “decent, but you’ll see lots of false positives in production.” The PR curve is closer to operational reality.
Detection Error Tradeoff (DET) Curve
Popular in biometrics and speaker recognition. DET plots FAR (x-axis) vs FRR (y-axis), but both axes are on a probit (normal-deviate) scale. This stretches the small-error region and makes near-perfect detectors easier to compare. The Equal Error Rate (EER)—where FAR = FRR—is a single-number summary often quoted in this domain.
When to Use Which Metric, A Decision Framework
If you only remember one decision tree from this article, make it this one:
| Situation | Recommended Metric(s) |
|---|---|
| Severe imbalance (anomalies < 1%) | AUPRC (primary), AUROC (secondary) |
| Need a single threshold for production | F1 (or F-beta if asymmetric costs) |
| Operations team cares about alert volume | FAR + Recall, or Precision@K |
| Cost-sensitive (FN ≫ FP) | Recall, F2, cost-weighted score |
| Cost-sensitive (FP ≫ FN) | Precision, F0.5 |
| Model selection across architectures | AUROC for general comparison; AUPRC if imbalanced |
| Reporting to non-technical stakeholders | Precision@K, Recall@K, dollar-weighted recall |
| Time-series anomaly detection | Range-based F1, VUS, NAB Score |
| Biometrics / authentication | EER, DET curve, FAR @ fixed FRR |
Most production teams report a small bundle: AUPRC + Precision@K + Recall + FAR. That set covers model quality, operational alert volume, miss rate, and false-alarm rate—enough for a useful conversation across stakeholder groups.
Time-Series-Specific Metrics
Time-series anomaly detection is where most “standard” metrics fall apart. The core issue: anomalies are typically events—contiguous segments of points, not isolated samples. If a real anomaly lasts from t=100 to t=120 (21 timesteps) and your model detects it at t=103 only, did you detect it? Standard point F1 says “1 TP, 20 FN”—recall = 1/21 = 4.8%. But operationally, you caught the event. The label says you almost completely missed it.
Several alternatives have been proposed. None are perfect, and there’s an active debate about what’s right. For a deeper survey of the models that produce these scores, see our companion guide on time-series anomaly detection models.
Point-Adjusted (PA) F1
Proposed in early time-series benchmarks (Xu et al., 2018), Point-Adjusted F1 says: if at least one point inside a true anomaly segment is detected, mark the entire segment as detected. This patches the miss-by-one-point problem dramatically—but it inflates scores in misleading ways. Kim et al. (2022) showed that even random scores can achieve PA-F1 above 0.9 on common benchmarks. Use PA-F1 with extreme caution and never as your only metric.
Range-Based Precision/Recall (Tatbul et al., 2018)
The seminal Tatbul et al. paper introduced a parametric framework for range-based recall and precision. Each detection range overlapping a real anomaly range earns partial credit, with knobs for: how to reward partial overlap (existence vs cardinality vs size), bias toward early or late detection, and penalty for fragmentation. It’s principled, configurable, and widely cited, but the parameters need careful selection per use case.
NAB Score (Numenta Anomaly Benchmark)
Built around streaming detection. Each true anomaly segment has a “detection window,” and points inside that window earn weighted positive credit (more for early detection); points outside earn weighted negative credit. The result is normalized so a perfect detector scores 100 and a “no detection” baseline scores 0. NAB is opinionated—it explicitly rewards early detection—which makes it appropriate for streaming applications and inappropriate for retrospective analysis.
VUS (Volume Under the Surface, Paparrizos et al., 2022)
A range-aware extension of AUROC and AUPRC. Instead of computing area under a 2D curve, VUS computes volume under a 3D surface where the third dimension is the detection-tolerance buffer. This produces a smooth, parameter-free range-aware metric. VUS-PR is currently among the most defensible single-number summaries for time-series anomaly detection benchmarks.
Affiliation-Based Metrics (Huet et al., 2022)
Defines a continuous “affiliation” between predicted and true segments based on temporal distance, with statistical normalization that makes results comparable across datasets. More principled than PA-F1 but less widely tooled.
| Metric | Range-Aware? | Threshold-Free? | Notes |
|---|---|---|---|
| Point F1 | No | No | Penalizes brief detection lag harshly |
| Point-Adjusted F1 | Partially | No | Inflates scores; controversial |
| Range-Based F1 (Tatbul) | Yes | No | Configurable; needs parameters per use case |
| NAB Score | Yes | No | Rewards early detection; for streaming |
| VUS-ROC / VUS-PR | Yes | Yes | Modern, parameter-free, recommended |
| Affiliation Metrics | Yes | No | Statistical normalization; less tooled |
Top-K Metrics for Ranking
In many production environments, what matters isn’t the binary classification quality—it’s the ranking quality at the top of the list. A SOC analyst reviews the top 50 alerts per shift; a fraud team escalates the top 100 highest-risk transactions per day. For these, top-K metrics fit better.
- Precision@K: of the top K most anomalous predictions, how many are true anomalies. Concrete and operationally meaningful.
- Recall@K: of all true anomalies, how many appear in the top K. Useful when you have a fixed review budget.
- Mean Average Precision (MAP@K): average precision computed up to position K, sometimes used in ranking contexts.
- Lift@K: Precision@K divided by base rate. A lift of 50 means alerts in your top-K are 50× more likely to be anomalies than random samples.
Top-K metrics require fixing K—typically determined by the human review capacity. They’re less useful for academic comparisons (different K values produce different rankings) but invaluable for production health monitoring.
Practical Implementation in Python
Time to code. We’ll build everything from a confusion matrix up through bootstrapped AUROC confidence intervals, with both scikit-learn shortcuts and from-scratch implementations.
Setup and Synthetic Data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
confusion_matrix, precision_score, recall_score, f1_score,
fbeta_score, roc_auc_score, average_precision_score,
roc_curve, precision_recall_curve, matthews_corrcoef,
balanced_accuracy_score
)
np.random.seed(42)
# 10,000 samples, 1% anomaly rate
n = 10_000
anomaly_rate = 0.01
y_true = np.random.binomial(1, anomaly_rate, size=n)
# Synthetic anomaly score: anomalies tend to score higher
# Normal points: Beta(2, 5) -> mean ~0.29
# Anomalies: shifted up by 0.4 (clipped at 1.0)
y_score = np.random.beta(2, 5, size=n) + y_true * 0.4
y_score = np.clip(y_score, 0, 1)
print(f"Total samples: {n}")
print(f"Anomalies: {y_true.sum()} ({y_true.mean()*100:.2f}%)")
print(f"Score range: [{y_score.min():.3f}, {y_score.max():.3f}]")
Building the Confusion Matrix from Scratch
def confusion_from_scratch(y_true, y_pred):
"""Compute (TN, FP, FN, TP) without sklearn."""
y_true = np.asarray(y_true).astype(int)
y_pred = np.asarray(y_pred).astype(int)
TP = int(((y_pred == 1) & (y_true == 1)).sum())
FP = int(((y_pred == 1) & (y_true == 0)).sum())
TN = int(((y_pred == 0) & (y_true == 0)).sum())
FN = int(((y_pred == 0) & (y_true == 1)).sum())
return TN, FP, FN, TP
threshold = 0.5
y_pred = (y_score >= threshold).astype(int)
TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
print(f"TP = {TP}, FP = {FP}, TN = {TN}, FN = {FN}")
# Verify against sklearn
cm = confusion_matrix(y_true, y_pred)
assert (TN, FP, FN, TP) == (cm[0,0], cm[0,1], cm[1,0], cm[1,1])
All Threshold-Dependent Metrics, From Scratch
def metrics_from_confusion(TN, FP, FN, TP):
"""Compute every threshold-dependent metric from a confusion matrix."""
eps = 1e-12
precision = TP / (TP + FP + eps)
recall = TP / (TP + FN + eps) # TPR / sensitivity
specificity = TN / (TN + FP + eps) # TNR
fpr = FP / (FP + TN + eps) # FAR / FPR
fnr = FN / (FN + TP + eps) # FRR
accuracy = (TP + TN) / (TP + TN + FP + FN + eps)
balanced_acc = (recall + specificity) / 2
f1 = 2 * precision * recall / (precision + recall + eps)
f2 = 5 * precision * recall / (4 * precision + recall + eps)
f05 = 1.25 * precision * recall / (0.25 * precision + recall + eps)
# MCC
num = TP * TN - FP * FN
den = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN) + eps)
mcc = num / den
return {
"Precision": precision, "Recall": recall, "Specificity": specificity,
"FAR (FPR)": fpr, "FRR (FNR)": fnr, "Accuracy": accuracy,
"BalancedAcc": balanced_acc, "F1": f1, "F2": f2, "F0.5": f05, "MCC": mcc,
}
m = metrics_from_confusion(TN, FP, FN, TP)
for k, v in m.items():
print(f" {k:14s} = {v:.4f}")
# Verify with sklearn
assert abs(m["F1"] - f1_score(y_true, y_pred)) < 1e-6
assert abs(m["MCC"] - matthews_corrcoef(y_true, y_pred)) < 1e-6
assert abs(m["BalancedAcc"] - balanced_accuracy_score(y_true, y_pred)) < 1e-6
AUROC and AUPRC With sklearn
auroc = roc_auc_score(y_true, y_score)
auprc = average_precision_score(y_true, y_score)
print(f"AUROC = {auroc:.4f} (random baseline = 0.5)")
print(f"AUPRC = {auprc:.4f} (random baseline = {y_true.mean():.4f})")
Plotting ROC and PR Curves
fpr, tpr, _ = roc_curve(y_true, y_score)
prec, rec, _ = precision_recall_curve(y_true, y_score)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(fpr, tpr, lw=2, label=f"Model (AUROC = {auroc:.3f})")
ax1.plot([0, 1], [0, 1], "--", color="gray", label="Random")
ax1.set_xlabel("False Positive Rate")
ax1.set_ylabel("True Positive Rate")
ax1.set_title("ROC Curve")
ax1.legend()
ax1.grid(alpha=0.3)
ax2.plot(rec, prec, lw=2, color="crimson", label=f"Model (AUPRC = {auprc:.3f})")
ax2.axhline(y=y_true.mean(), linestyle="--", color="gray",
label=f"Random = {y_true.mean():.3f}")
ax2.set_xlabel("Recall")
ax2.set_ylabel("Precision")
ax2.set_title("Precision-Recall Curve")
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=120)
Finding the Optimal F1 Threshold
prec, rec, thresholds = precision_recall_curve(y_true, y_score)
# precision_recall_curve returns one extra point; align with thresholds
prec_t, rec_t = prec[:-1], rec[:-1]
f1_curve = 2 * prec_t * rec_t / (prec_t + rec_t + 1e-12)
best_idx = int(np.argmax(f1_curve))
best_threshold = thresholds[best_idx]
best_f1 = f1_curve[best_idx]
print(f"Best F1 = {best_f1:.4f} at threshold = {best_threshold:.4f}")
print(f" Precision = {prec_t[best_idx]:.4f}")
print(f" Recall = {rec_t[best_idx]:.4f}")
Sweeping the Threshold
def threshold_sweep(y_true, y_score, n_thresholds=100):
"""Compute Precision, Recall, F1, FAR for a grid of thresholds."""
grid = np.linspace(y_score.min(), y_score.max(), n_thresholds)
rows = []
for t in grid:
y_pred = (y_score >= t).astype(int)
TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
m = metrics_from_confusion(TN, FP, FN, TP)
rows.append([t, m["Precision"], m["Recall"], m["F1"], m["FAR (FPR)"]])
return np.asarray(rows)
sweep = threshold_sweep(y_true, y_score, 200)
t_grid, prec_g, rec_g, f1_g, far_g = sweep.T
plt.figure(figsize=(9, 5))
plt.plot(t_grid, prec_g, color="#e74c3c", label="Precision")
plt.plot(t_grid, rec_g, color="#3498db", label="Recall")
plt.plot(t_grid, f1_g, color="#27ae60", label="F1")
plt.plot(t_grid, far_g, color="#f39c12", label="FAR")
plt.axvline(best_threshold, linestyle="--", color="black", alpha=0.6,
label=f"Best F1 t={best_threshold:.3f}")
plt.xlabel("Threshold")
plt.ylabel("Metric value")
plt.title("Metric vs Threshold (1% anomaly rate)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
Cost-Weighted Metric
def cost_weighted_score(y_true, y_pred, c_fp=1.0, c_fn=10.0):
"""Lower is better. Useful when FN costs ~10x more than FP."""
TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
return c_fp * FP + c_fn * FN
def best_threshold_by_cost(y_true, y_score, c_fp=1.0, c_fn=10.0, n=200):
grid = np.linspace(y_score.min(), y_score.max(), n)
costs = []
for t in grid:
y_pred = (y_score >= t).astype(int)
costs.append(cost_weighted_score(y_true, y_pred, c_fp, c_fn))
best = int(np.argmin(costs))
return grid[best], costs[best]
t_cost, c_cost = best_threshold_by_cost(y_true, y_score, c_fp=1, c_fn=20)
print(f"Cost-optimal threshold = {t_cost:.4f}, total cost = {c_cost:.0f}")
Bootstrap Confidence Intervals (the Underrated Step)
Single-number reports without uncertainty are dangerous. A 1,000-sample test set with 10 positives can produce wildly different AUPRC values across reasonable bootstrap resamples. The bootstrap is the standard way to attach a confidence interval. The intuition behind why averaging across many resamples produces a stable estimate goes back to the Central Limit Theorem.
def bootstrap_ci(y_true, y_score, metric_fn, n_boot=1000, alpha=0.05, seed=0):
"""Bootstrap percentile CI for any score-based metric."""
rng = np.random.default_rng(seed)
n = len(y_true)
scores = []
for _ in range(n_boot):
idx = rng.integers(0, n, size=n)
y_t, y_s = y_true[idx], y_score[idx]
if y_t.sum() == 0 or y_t.sum() == n:
continue # degenerate resample
scores.append(metric_fn(y_t, y_s))
scores = np.asarray(scores)
lo = np.quantile(scores, alpha/2)
hi = np.quantile(scores, 1 - alpha/2)
return float(np.mean(scores)), (float(lo), float(hi))
mean_auroc, ci_auroc = bootstrap_ci(y_true, y_score, roc_auc_score, n_boot=500)
mean_auprc, ci_auprc = bootstrap_ci(y_true, y_score, average_precision_score, n_boot=500)
print(f"AUROC = {mean_auroc:.4f} 95% CI [{ci_auroc[0]:.4f}, {ci_auroc[1]:.4f}]")
print(f"AUPRC = {mean_auprc:.4f} 95% CI [{ci_auprc[0]:.4f}, {ci_auprc[1]:.4f}]")
Time-Series PA-F1 Implementation
def get_event_segments(y):
"""Return list of (start, end_inclusive) for runs of 1s."""
y = np.asarray(y).astype(int)
if len(y) == 0:
return []
diff = np.diff(np.concatenate(([0], y, [0])))
starts = np.where(diff == 1)[0]
ends = np.where(diff == -1)[0] - 1
return list(zip(starts.tolist(), ends.tolist()))
def point_adjusted_predictions(y_true, y_pred):
"""Apply Point-Adjusted (PA) protocol: if any point inside a true
anomaly segment is detected, flag the entire segment as detected."""
y_pred = y_pred.copy().astype(int)
for s, e in get_event_segments(y_true):
if y_pred[s:e+1].any():
y_pred[s:e+1] = 1
return y_pred
# Worked example
y_t = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0])
y_p = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])
print("Raw point F1 =", round(f1_score(y_t, y_p), 4))
y_pa = point_adjusted_predictions(y_t, y_p)
print("PA-adjusted pred =", y_pa.tolist())
print("PA-F1 =", round(f1_score(y_t, y_pa), 4))
In this example the raw point F1 is around 0.18 (one TP, two FN inside the first event, one FP outside, no detection on the second event). After point adjustment, the entire first event becomes "detected" because we flagged one point inside it—recall jumps massively. This is the inflation effect Kim et al. (2022) warned about: PA-F1 can look impressive even when the underlying detection is weak. For range-aware alternatives, look at the VUS package or the Tatbul range-based implementation in the tsad Python library.
How to Choose the Threshold for Production
You've trained the model. AUROC and AUPRC look fine. Now what threshold do you actually deploy with? Here are the five common strategies, in order from simplest to most sophisticated.
Maximize F1 on Validation
Sweep thresholds on a held-out validation set and pick the one with the highest F1. Simple, defensible, gives a balanced precision/recall point. Watch out: never select the threshold on your test set—that's data leakage. Always reserve validation data for hyperparameter and threshold selection.
Fixed FAR Budget
The operations-driven approach. "We can handle 100 alerts/day. With 1M events/day, that's FAR ≤ 0.01%." Pick the threshold where FAR = 0.0001 on the validation set, then report the corresponding recall. This is how most real cybersecurity and network monitoring systems are tuned.
def threshold_for_far_budget(y_true, y_score, far_budget=0.001):
"""Largest recall achievable subject to FAR ≤ far_budget."""
fpr, tpr, thr = roc_curve(y_true, y_score)
feasible = fpr <= far_budget
if not feasible.any():
return None, 0.0, 0.0
idx = np.argmax(tpr * feasible)
return float(thr[idx]), float(tpr[idx]), float(fpr[idx])
t, r, f = threshold_for_far_budget(y_true, y_score, far_budget=0.005)
print(f"Threshold = {t:.4f}, Recall = {r:.4f} at FAR = {f:.4f}")
Cost-Weighted Optimization
If you can quantify the dollar cost of a false positive (analyst time, customer impact) and a false negative (missed fraud value), pick the threshold that minimizes CFP·FP + CFN·FN. This is the most defensible approach when the asymmetry is well understood.
Top-K Selection
Skip the threshold entirely. Rank scores and take the top K. Useful when the human review capacity is the binding constraint and the alert volume per period is fixed.
Sliding / Contextual Threshold
Time-of-day, day-of-week, or per-segment thresholds. A retail fraud detector might use threshold = 0.6 on weekday afternoons and 0.4 on holiday weekends. Implementation usually involves a small lookup table or a contextual model that outputs both score and threshold.
Common Pitfalls to Avoid
After dozens of anomaly detection projects across fraud, manufacturing, security, and healthcare, here are the recurring mistakes I see most often.
- Reporting AUROC without AUPRC on imbalanced data. AUROC = 0.99 with 0.1% positives often means AUPRC = 0.40. Report both, always.
- Reporting accuracy. For anomaly detection, accuracy is almost always meaningless. The "always negative" baseline beats most real models on accuracy.
- Cherry-picking the threshold on the test set. Tune on validation, evaluate on test. If you maximize F1 across thresholds on the same test set, you've overfit.
- Not using stratified k-fold. With 1% positives in 1,000 samples, a random fold could end up with zero positives in the validation split. Use
StratifiedKFold. - Ignoring confidence intervals. A reported AUPRC of 0.42 ± 0.15 (95% CI) is qualitatively different from 0.42 ± 0.02. Bootstrap and report.
- Comparing models on different test sets. Apples to oranges. Use the same fixed test set across all model comparisons.
- Using point F1 for time series. One-off detection lag tanks the score. Use range-based metrics or VUS instead.
- Microaverage vs macroaverage confusion in multi-class anomaly settings. Microaverage favors common classes; macroaverage equalizes them. Choose intentionally and document.
- Treating PA-F1 as gospel. It can be inflated by random noise. Report it alongside non-PA metrics if you must use it.
- Optimizing offline metrics that don't translate to deployment. If your business runs on alert-volume budgets, optimize for the metric that respects that constraint, not just F1.
Real-World Reporting Templates by Domain
Different domains converge on different metric stacks. Here's a recommendation distilled from real production systems. For deeper dives into the underlying anomaly detection methods, see our companion guides on Deep SVDD and One-Class SVM.
| Domain | Recommended Metric Stack | Why |
|---|---|---|
| Fraud detection | AUPRC, Precision@K, Recall, $-weighted recall | Severe imbalance + dollar asymmetry |
| Network intrusion | AUROC, Precision, FAR @ fixed Recall | Operations cares about alert volume |
| Medical screening | Sensitivity (Recall), Specificity, AUROC | Regulatory norms; symmetric reporting |
| Industrial sensor | Range-based F1, Precision@K, time-to-detect | Time-series events; early detection valued |
| Server monitoring | Precision@K, MTTD, false-alert-per-day | Streaming context, on-call workload |
| Biometrics / authentication | EER, DET curve, FAR @ fixed FRR | Field-standard reporting |
| Anti-money-laundering | Recall + Precision@K, regulatory alert quality | Compliance sets minimum recall |
| Manufacturing defect | Recall, Precision, cost-weighted score | Defect cost vs over-inspection cost |
If your model is built on top of transfer learning or fine-tuning approaches, the same metric framework applies, just be especially cautious about confidence intervals, since pre-training source-target distribution gaps can make small test sets very noisy.
Frequently Asked Questions
Why isn't accuracy a good metric for anomaly detection?
Because anomalies are rare. If 99% of your data is normal, a "predict normal always" model achieves 99% accuracy without learning anything. Real models barely lift accuracy by a few tenths of a percentage point, so accuracy can't distinguish good models from useless ones. Use AUPRC, F1, or Precision@K instead.
AUROC vs AUPRC—when should I use which?
For mild imbalance (positives 5–50%), AUROC and AUPRC tell roughly similar stories, AUROC is fine. For severe imbalance (positives below 1%), AUROC inflates because most of its area comes from FPR regions you'll never operate in. AUPRC is more honest because its random baseline equals the positive class fraction. Best practice: report both, but rely on AUPRC for imbalanced anomaly detection.
How do I pick a threshold for production?
Pick the strategy that matches your business constraint. If your team has a fixed alert-review budget, use top-K or fixed-FAR. If you can quantify costs, optimize C_FP·FP + C_FN·FN. If neither, maximize F1 on a held-out validation set. Always select the threshold on validation, evaluate on test, and re-tune monthly as data shifts.
What's the difference between FAR and FPR?
None—they're the same metric: FP / (FP + TN). "False Alarm Rate" is the operations and biometrics term; "False Positive Rate" is the statistical term. Some literature also uses "False Acceptance Rate" (biometrics, identical concept) or "Type I Error rate" (classical statistics). Don't be confused by the multiple names.
Are time-series anomaly detection metrics different?
Yes. Anomalies in time series are typically contiguous events, not isolated points, so naive point-wise F1 over-penalizes brief detection lag. Use range-based metrics (Tatbul et al., 2018), VUS-PR (Paparrizos et al., 2022), or NAB Score for streaming. Avoid using only Point-Adjusted F1—recent work has shown it can be gamed by random noise.
References and Further Reading
External References:
- scikit-learn metrics documentation—https://scikit-learn.org/stable/modules/model_evaluation.html
- Saito, T. & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLOS ONE.
- Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., & Gottschlich, J. (2018). "Precision and Recall for Time Series." NeurIPS.
- Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R., Elmore, A., & Franklin, M. (2022). "Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection." VLDB.
- Numenta Anomaly Benchmark (NAB),https://github.com/numenta/NAB
- Huet, A., Navarro, J. M., & Rossi, D. (2022). "Local Evaluation of Time Series Anomaly Detection Algorithms." KDD.
- Kim, S. et al. (2022). "Towards a Rigorous Evaluation of Time-Series Anomaly Detection." AAAI.
This article is for informational purposes only and does not constitute investment, security, or medical advice. Always validate metrics against your specific operational context.
Leave a Reply