Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR

Q: Why isn't accuracy a good metric for anomaly detection?

Because anomalies are rare. If 99% of your data is normal, a 'predict normal always' model achieves 99% accuracy without learning anything. Real models barely lift accuracy by a few tenths of a percentage point, so accuracy can't distinguish good models from useless ones. Use AUPRC, F1, or Precision@K instead.

Q: AUROC vs AUPRC,when should I use which?

For mild imbalance (positives 5–50%), AUROC and AUPRC tell roughly similar stories. For severe imbalance (positives below 1%), AUROC inflates because most of its area comes from FPR regions you'll never operate in. AUPRC is more honest because its random baseline equals the positive class fraction. Best practice: report both, but rely on AUPRC for imbalanced anomaly detection.

Q: How do I pick a threshold for production?

Pick the strategy that matches your business constraint. If your team has a fixed alert-review budget, use top-K or fixed-FAR. If you can quantify costs, optimize C_FP·FP + C_FN·FN. If neither, maximize F1 on a held-out validation set. Always select the threshold on validation, evaluate on test, and re-tune monthly as data shifts.

Q: What's the difference between FAR and FPR?

None—they're the same metric: FP / (FP + TN). 'False Alarm Rate' is the operations and biometrics term; 'False Positive Rate' is the statistical term. Some literature also uses 'False Acceptance Rate' (biometrics, identical concept) or 'Type I Error rate' (classical statistics).

Q: Are time-series anomaly detection metrics different?

Yes. Anomalies in time series are typically contiguous events, not isolated points, so naive point-wise F1 over-penalizes brief detection lag. Use range-based metrics (Tatbul et al., 2018), VUS-PR (Paparrizos et al., 2022), or NAB Score for streaming. Avoid using only Point-Adjusted F1—recent work has shown it can be gamed by random noise.

Last updated: May 17, 2026

By kongastral

Published April 30, 2026 · Updated May 17, 2026 · 23 min read

Your fraud detector achieves 99.9% accuracy. Sounds great—until you realize 99.9% of transactions are legitimate, and your model just flags everything as “normal.” Accuracy is a lie when anomalies are rare. Picking the wrong metric is the #1 mistake in anomaly detection.

This guide walks through every metric that matters for anomaly detection—from the basics like Precision and Recall, to threshold-independent ranking metrics like AUROC and AUPRC, to specialized time-series metrics like PA-F1 and VUS. We’ll cover the formulas, the trade-offs, and full Python implementations you can drop into a project today.

Summary

What this post covers: A complete reference for choosing and computing anomaly detection metrics — Precision, Recall, F1, FAR, MCC, AUROC, AUPRC, time-series variants, and Top-K — with formulas, trade-offs, and Python implementations aimed at ML engineers building rare-event detectors (fraud, intrusion, defects, biometrics).

Key insights:

Accuracy is degenerate when anomalies are rare — a constant “normal” predictor can score 99.9% — so the first decision in any anomaly-detection project is to discard accuracy as the headline metric.
For severely imbalanced data (anomalies under 1%), AUPRC is the primary ranking metric and AUROC is secondary; AUROC can look misleadingly high on heavily imbalanced data because the TN count dominates the denominator.
Different stakeholders need different metrics on the same model — engineers care about AUROC/AUPRC, operations about FAR and alert volume, finance about dollar-weighted recall — so a single number is always a stakeholder choice in disguise.
Standard point-wise F1 breaks for time-series anomalies because real anomalies are contiguous events, not isolated samples; use range-based F1, VUS, or NAB Score instead.
Most production teams should report a small bundle — AUPRC + Precision@K + Recall + FAR — which covers model quality, operational alert volume, miss rate, and false-alarm rate together.

Main topics: why anomaly metrics matter, the confusion matrix foundation, threshold-dependent metrics, threshold-independent metrics, a decision framework for picking metrics, time-series-specific metrics, Top-K ranking metrics, Python implementations, threshold selection for production, common pitfalls, and domain reporting templates.

Why Anomaly Detection Metrics Matter (and Why Accuracy Doesn’t)

Suppose you build a fraud detector and proudly announce it hits 99.9% accuracy. Your manager is impressed. The board is impressed. You’re a genius. Then someone asks how many actual fraud cases it caught last quarter, and the answer is—none. Zero. Your model achieves 99.9% accuracy by predicting “not fraud” on every single transaction, because in a payment processor, fraud rates hover around 0.1%. The “model” is a constant. The accuracy is real. And it’s worthless.

This is the foundational truth of anomaly detection: the positive class (the anomaly) is rare. Sometimes extremely rare. Network intrusions, manufacturing defects, credit-card fraud, rare diseases—all of them follow base rates from 0.01% to maybe 5%. When the negative class dominates, accuracy becomes a degenerate metric. A model that predicts “normal” for everything will look brilliant.

That’s the imbalance problem. But there’s a second, equally important issue: cost asymmetry. Missing a true anomaly (false negative) almost always costs more than flagging a legitimate event by mistake (false positive). A missed credit-card fraud could cost $5,000. An unnecessary alert costs maybe 30 seconds of an analyst’s time. These aren’t symmetric mistakes, and the metric you choose has to reflect that asymmetry.

Different stakeholders care about different metrics for the same model:

The ML engineer wants AUROC and AUPRC to compare model architectures.
The product manager wants Precision@K because the UI shows the top 50 alerts per day.
The operations lead wants False Alarm Rate (FAR) and Mean Time To Detect (MTTD) because their analysts have to triage every alert.
The CFO wants dollar-weighted recall, the fraction of fraud value caught, not just the count.

If you pick a single number to optimize, you’re implicitly making a stakeholder choice. The right answer is to report a small set of complementary metrics so each audience sees what they need.

Key Takeaway: Accuracy is almost never the right metric for anomaly detection. The base rate is too low, and the cost of false negatives is too high. Use Precision, Recall, F1, AUPRC, and FAR depending on what you actually care about.

The Confusion Matrix Foundation

Every metric in this guide is built from four numbers—the cells of the confusion matrix. By convention, in anomaly detection the anomaly is the positive class and the normal point is the negative class.

Term	Definition	Fraud Example
True Positive (TP)	Model predicts anomaly, truly is anomaly	Caught a fraudulent transaction
False Positive (FP)	Model predicts anomaly, truly is normal	Flagged a legitimate purchase
True Negative (TN)	Model predicts normal, truly is normal	Correctly cleared a normal payment
False Negative (FN)	Model predicts normal, truly is anomaly	Missed a fraudulent transaction

Here’s a worked example. Imagine 10,000 credit-card transactions where 100 are fraudulent (1% anomaly rate) and your model predicts as follows:

From the cells above, every metric we discuss in this guide is derivable. Note something important: the accuracy for this model is (95 + 9870) / 10000 = 99.65%,which sounds excellent. But a constant “always normal” model would score 99.0%. The lift from a real model is just 0.65 percentage points. If you compare two models on accuracy alone, you’ll learn almost nothing.

The fundamental trade-off in any threshold-based detector is this: lower the threshold → catch more anomalies (TP↑) but also flag more normals (FP↑). Raise the threshold → fewer false alarms (FP↓) but you’ll miss anomalies (FN↑). Every metric in this guide either picks one threshold and reports performance there, or sweeps over all thresholds and summarizes the trade-off.

Threshold-Dependent Metrics: Precision, Recall, F1, FAR, MCC

These metrics require you to commit to one decision threshold (typically 0.5 for probabilities, or some calibrated value for anomaly scores). Once committed, you can compute the four-cell confusion matrix and derive everything below.

Precision—How Pure Are My Alerts?

Precision = TP / (TP + FP). It answers: “Of everything I flagged as anomalous, how many actually were?” In our worked example, Precision = 95/125 = 0.76. That means 76% of the alerts were real fraud, and 24% were false alarms.

When precision matters most:

Alert fatigue. If a SOC analyst gets 100 alerts a day and 90 are wrong, they stop trusting the system. Precision = 0.10.
Costly interventions. If acting on an alert means freezing a customer’s account, you’d better be right.
Limited human review capacity. When you can only investigate the top 50 cases, you want those to be high-quality.

Recall (Sensitivity, True Positive Rate)—How Many Did I Catch?

Recall = TP / (TP + FN). “Of all true anomalies, how many did I catch?” In our example, Recall = 95/100 = 0.95. That’s 95% catch rate.

When recall matters most:

Catastrophic miss costs. Cancer screening. Cybersecurity intrusions. Aircraft engine faults. Missing one is unacceptable.
Rare but serious anomalies. When the cost of FN dwarfs the cost of FP.
Compliance and regulatory contexts. Anti-money-laundering regulations effectively mandate high recall.

F1 Score, The Compromise

F1 = 2·P·R / (P + R). It’s the harmonic mean of Precision and Recall, designed so a low score in either drags the F1 down. In our example, F1 = 2·(0.76)(0.95)/(0.76+0.95) = 0.844.

Why harmonic mean and not arithmetic? Because Precision = 1.0 and Recall = 0.01 (you flagged exactly one true anomaly out of 100) shouldn’t average to 0.505—that’s misleading. The harmonic mean gives 0.0198, which is closer to the truth: this model is bad.

For asymmetric costs, use F-beta:

F_β = (1 + β²) · P · R / (β²·P + R)

β = 1 → standard F1, equal weight
β = 2 → F2, recall weighted twice as much (medical, security)
β = 0.5 → F0.5, precision weighted twice as much (alert fatigue contexts)

Specificity (TNR) and False Alarm Rate (FAR/FPR)

Specificity = TN / (TN + FP). The fraction of true normals correctly left alone. FAR (= FPR = 1 − Specificity) is the fraction of normals that got flagged. In our example FAR = 30/9900 = 0.30%.

FAR is the metric your operations team will actually quote. If you process 1 million events per day at FAR = 0.5%, that’s 5,000 false alarms daily—completely unworkable. Most operational systems target FAR < 0.1% or even < 0.01%, and accept whatever recall results.

False Reject Rate (FRR)

FRR = FN / (FN + TP) = 1 − Recall. This is biometrics terminology, in face recognition or fingerprint authentication, an FRR is the fraction of legitimate users incorrectly rejected. The “False Acceptance Rate” in biometrics is the same as FAR/FPR here.

Matthews Correlation Coefficient (MCC)

MCC = (TP·TN − FP·FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Range: [−1, +1]. +1 = perfect, 0 = random, −1 = perfectly wrong. Unlike F1, MCC uses all four cells of the confusion matrix and remains informative even when the imbalance is severe. It’s particularly useful when you want a single, balanced number that doesn’t get fooled by a model that just predicts the majority class.

Balanced Accuracy

Balanced Accuracy = (Sensitivity + Specificity) / 2. A simple average of the per-class accuracies. The “always normal” model gets 50% balanced accuracy, regardless of the imbalance. Use this when you want an accuracy-like number that doesn’t reward majority-class prediction.

Metric	Formula	Range	When to Use
Precision	TP / (TP + FP)	[0, 1]	Alert fatigue, costly interventions
Recall (TPR, Sensitivity)	TP / (TP + FN)	[0, 1]	Catastrophic miss costs, security, medical
F1	2PR / (P + R)	[0, 1]	Single threshold, balanced trade-off
F_β	(1+β²)PR / (β²P+R)	[0, 1]	Asymmetric costs (β>1: recall, β<1: precision)
Specificity (TNR)	TN / (TN + FP)	[0, 1]	Medical screening (avoid false positives)
FAR (FPR)	FP / (FP + TN)	[0, 1]	Operations, alert volume control
FRR (FNR)	FN / (FN + TP)	[0, 1]	Biometrics
MCC	see formula above	[−1, 1]	Balanced single number for imbalanced data
Balanced Accuracy	(TPR + TNR) / 2	[0, 1]	Accuracy-like, imbalance-aware
AUROC	∫TPR d(FPR)	[0, 1]	Threshold-free comparison, mild imbalance
AUPRC (AP)	∫P d(R)	[0, 1]	Severe imbalance—preferred over AUROC

Threshold-Independent Metrics: AUROC, AUPRC, DET

The metrics above all assume you’ve picked a threshold. But during model development, you usually want a single number that summarizes the model’s quality across all possible thresholds. That’s where ranking metrics come in.

ROC Curve and AUROC

The Receiver Operating Characteristic (ROC) curve plots TPR (y-axis) against FPR (x-axis) as the threshold varies. Each point on the curve corresponds to a different decision threshold. The area under this curve—AUROC,has a beautiful probabilistic interpretation:

AUROC = P(score(positive) > score(negative))

“If I randomly draw one anomaly and one normal, AUROC is the probability the model scores the anomaly higher.” 0.5 is random guessing. 1.0 is perfect ranking. 0.95 means 95% of randomly chosen pairs are correctly ordered.

AUROC has lovely properties: it’s threshold-independent, scale-invariant (only the rank order of scores matters), and the random baseline is always exactly 0.5 regardless of class balance. That last point is also its weakness.

When AUROC Misleads

Here’s a real scenario. You have 1 million transactions, 1,000 of which are fraud (0.1% rate). Your model achieves AUROC = 0.97. Sounds amazing. Now look at the actual usability: at the threshold that produces 1,000 alerts, you might catch 600 frauds and raise 400 false positives. Precision = 60%, Recall = 60%. The model still misses 400 frauds, and 40% of alerts are false. AUROC = 0.97 sold us a story that the operational reality didn’t deliver.

The reason: AUROC averages TPR over the full FPR range from 0 to 1. But in production you only care about FPR < 1% or so. Most of the AUROC area is contributed by regions you’ll never operate in. With severe imbalance, even a sub-1% FPR generates massive numbers of false positives because the negative class is huge.

Precision-Recall Curve and AUPRC

The PR curve plots Precision (y-axis) against Recall (x-axis) as threshold varies. The area under this curve—AUPRC, also called Average Precision (AP)—is far more honest for imbalanced data. Saito and Rehmsmeier (2015) showed empirically that PR curves provide a more informative picture than ROC curves when class imbalance is severe.

The random baseline for AUPRC equals the positive class fraction. So if anomalies are 1% of your data, a coin-flip detector gets AUPRC ≈ 0.01. Beating that baseline by a wide margin is much harder than beating AUROC’s 0.5 baseline.

Below is the canonical illustration of the same model evaluated by both curves on a severely imbalanced dataset.

The two curves describe the same model. AUROC = 0.95 sounds like a top-tier detector. AUPRC = 0.42 says “decent, but you’ll see lots of false positives in production.” The PR curve is closer to operational reality.

Caution: Always report AUROC and AUPRC for imbalanced anomaly detection. Reporting only AUROC on a 0.1% anomaly task is, at best, misleading; at worst, dishonest.

Detection Error Tradeoff (DET) Curve

Popular in biometrics and speaker recognition. DET plots FAR (x-axis) vs FRR (y-axis), but both axes are on a probit (normal-deviate) scale. This stretches the small-error region and makes near-perfect detectors easier to compare. The Equal Error Rate (EER)—where FAR = FRR—is a single-number summary often quoted in this domain.

When to Use Which Metric, A Decision Framework

If you only remember one decision tree from this article, make it this one:

Situation	Recommended Metric(s)
Severe imbalance (anomalies < 1%)	AUPRC (primary), AUROC (secondary)
Need a single threshold for production	F1 (or F-beta if asymmetric costs)
Operations team cares about alert volume	FAR + Recall, or Precision@K
Cost-sensitive (FN ≫ FP)	Recall, F2, cost-weighted score
Cost-sensitive (FP ≫ FN)	Precision, F0.5
Model selection across architectures	AUROC for general comparison; AUPRC if imbalanced
Reporting to non-technical stakeholders	Precision@K, Recall@K, dollar-weighted recall
Time-series anomaly detection	Range-based F1, VUS, NAB Score
Biometrics / authentication	EER, DET curve, FAR @ fixed FRR

Most production teams report a small bundle: AUPRC + Precision@K + Recall + FAR. That set covers model quality, operational alert volume, miss rate, and false-alarm rate—enough for a useful conversation across stakeholder groups.

Time-Series-Specific Metrics

Time-series anomaly detection is where most “standard” metrics fall apart. The core issue: anomalies are typically events—contiguous segments of points, not isolated samples. If a real anomaly lasts from t=100 to t=120 (21 timesteps) and your model detects it at t=103 only, did you detect it? Standard point F1 says “1 TP, 20 FN”—recall = 1/21 = 4.8%. But operationally, you caught the event. The label says you almost completely missed it.

Several alternatives have been proposed. None are perfect, and there’s an active debate about what’s right. For a deeper survey of the models that produce these scores, see our companion guide on time-series anomaly detection models.

Point-Adjusted (PA) F1

Proposed in early time-series benchmarks (Xu et al., 2018), Point-Adjusted F1 says: if at least one point inside a true anomaly segment is detected, mark the entire segment as detected. This patches the miss-by-one-point problem dramatically—but it inflates scores in misleading ways. Kim et al. (2022) showed that even random scores can achieve PA-F1 above 0.9 on common benchmarks. Use PA-F1 with extreme caution and never as your only metric.

Range-Based Precision/Recall (Tatbul et al., 2018)

The seminal Tatbul et al. paper introduced a parametric framework for range-based recall and precision. Each detection range overlapping a real anomaly range earns partial credit, with knobs for: how to reward partial overlap (existence vs cardinality vs size), bias toward early or late detection, and penalty for fragmentation. It’s principled, configurable, and widely cited, but the parameters need careful selection per use case.

NAB Score (Numenta Anomaly Benchmark)

Built around streaming detection. Each true anomaly segment has a “detection window,” and points inside that window earn weighted positive credit (more for early detection); points outside earn weighted negative credit. The result is normalized so a perfect detector scores 100 and a “no detection” baseline scores 0. NAB is opinionated—it explicitly rewards early detection—which makes it appropriate for streaming applications and inappropriate for retrospective analysis.

VUS (Volume Under the Surface, Paparrizos et al., 2022)

A range-aware extension of AUROC and AUPRC. Instead of computing area under a 2D curve, VUS computes volume under a 3D surface where the third dimension is the detection-tolerance buffer. This produces a smooth, parameter-free range-aware metric. VUS-PR is currently among the most defensible single-number summaries for time-series anomaly detection benchmarks.

Affiliation-Based Metrics (Huet et al., 2022)

Defines a continuous “affiliation” between predicted and true segments based on temporal distance, with statistical normalization that makes results comparable across datasets. More principled than PA-F1 but less widely tooled.

Metric	Range-Aware?	Threshold-Free?	Notes
Point F1	No	No	Penalizes brief detection lag harshly
Point-Adjusted F1	Partially	No	Inflates scores; controversial
Range-Based F1 (Tatbul)	Yes	No	Configurable; needs parameters per use case
NAB Score	Yes	No	Rewards early detection; for streaming
VUS-ROC / VUS-PR	Yes	Yes	Modern, parameter-free, recommended
Affiliation Metrics	Yes	No	Statistical normalization; less tooled

Tip: For new time-series benchmarks, report VUS-PR and range-based F1 with documented parameters. Avoid relying solely on PA-F1,recent literature has shown it can be gamed by random scores.

Top-K Metrics for Ranking

In many production environments, what matters isn’t the binary classification quality—it’s the ranking quality at the top of the list. A SOC analyst reviews the top 50 alerts per shift; a fraud team escalates the top 100 highest-risk transactions per day. For these, top-K metrics fit better.

Precision@K: of the top K most anomalous predictions, how many are true anomalies. Concrete and operationally meaningful.
Recall@K: of all true anomalies, how many appear in the top K. Useful when you have a fixed review budget.
Mean Average Precision (MAP@K): average precision computed up to position K, sometimes used in ranking contexts.
Lift@K: Precision@K divided by base rate. A lift of 50 means alerts in your top-K are 50× more likely to be anomalies than random samples.

Top-K metrics require fixing K—typically determined by the human review capacity. They’re less useful for academic comparisons (different K values produce different rankings) but invaluable for production health monitoring.

Practical Implementation in Python

Time to code. We’ll build everything from a confusion matrix up through bootstrapped AUROC confidence intervals, with both scikit-learn shortcuts and from-scratch implementations.

Setup and Synthetic Data

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (
    confusion_matrix, precision_score, recall_score, f1_score,
    fbeta_score, roc_auc_score, average_precision_score,
    roc_curve, precision_recall_curve, matthews_corrcoef,
    balanced_accuracy_score
)

np.random.seed(42)

# 10,000 samples, 1% anomaly rate
n = 10_000
anomaly_rate = 0.01
y_true = np.random.binomial(1, anomaly_rate, size=n)

# Synthetic anomaly score: anomalies tend to score higher
# Normal points: Beta(2, 5) -> mean ~0.29
# Anomalies: shifted up by 0.4 (clipped at 1.0)
y_score = np.random.beta(2, 5, size=n) + y_true * 0.4
y_score = np.clip(y_score, 0, 1)

print(f"Total samples: {n}")
print(f"Anomalies: {y_true.sum()} ({y_true.mean()*100:.2f}%)")
print(f"Score range: [{y_score.min():.3f}, {y_score.max():.3f}]")

Building the Confusion Matrix from Scratch

def confusion_from_scratch(y_true, y_pred):
    """Compute (TN, FP, FN, TP) without sklearn."""
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    TP = int(((y_pred == 1) & (y_true == 1)).sum())
    FP = int(((y_pred == 1) & (y_true == 0)).sum())
    TN = int(((y_pred == 0) & (y_true == 0)).sum())
    FN = int(((y_pred == 0) & (y_true == 1)).sum())
    return TN, FP, FN, TP

threshold = 0.5
y_pred = (y_score >= threshold).astype(int)

TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
print(f"TP = {TP}, FP = {FP}, TN = {TN}, FN = {FN}")

# Verify against sklearn
cm = confusion_matrix(y_true, y_pred)
assert (TN, FP, FN, TP) == (cm[0,0], cm[0,1], cm[1,0], cm[1,1])

All Threshold-Dependent Metrics, From Scratch

def metrics_from_confusion(TN, FP, FN, TP):
    """Compute every threshold-dependent metric from a confusion matrix."""
    eps = 1e-12
    precision = TP / (TP + FP + eps)
    recall    = TP / (TP + FN + eps)        # TPR / sensitivity
    specificity = TN / (TN + FP + eps)       # TNR
    fpr = FP / (FP + TN + eps)               # FAR / FPR
    fnr = FN / (FN + TP + eps)               # FRR
    accuracy = (TP + TN) / (TP + TN + FP + FN + eps)
    balanced_acc = (recall + specificity) / 2
    f1 = 2 * precision * recall / (precision + recall + eps)
    f2 = 5 * precision * recall / (4 * precision + recall + eps)
    f05 = 1.25 * precision * recall / (0.25 * precision + recall + eps)
    # MCC
    num = TP * TN - FP * FN
    den = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN) + eps)
    mcc = num / den

    return {
        "Precision": precision, "Recall": recall, "Specificity": specificity,
        "FAR (FPR)": fpr, "FRR (FNR)": fnr, "Accuracy": accuracy,
        "BalancedAcc": balanced_acc, "F1": f1, "F2": f2, "F0.5": f05, "MCC": mcc,
    }

m = metrics_from_confusion(TN, FP, FN, TP)
for k, v in m.items():
    print(f"  {k:14s} = {v:.4f}")

# Verify with sklearn
assert abs(m["F1"] - f1_score(y_true, y_pred)) < 1e-6
assert abs(m["MCC"] - matthews_corrcoef(y_true, y_pred)) < 1e-6
assert abs(m["BalancedAcc"] - balanced_accuracy_score(y_true, y_pred)) < 1e-6

AUROC and AUPRC With sklearn

auroc = roc_auc_score(y_true, y_score)
auprc = average_precision_score(y_true, y_score)
print(f"AUROC = {auroc:.4f}  (random baseline = 0.5)")
print(f"AUPRC = {auprc:.4f}  (random baseline = {y_true.mean():.4f})")

Plotting ROC and PR Curves

fpr, tpr, _ = roc_curve(y_true, y_score)
prec, rec, _ = precision_recall_curve(y_true, y_score)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(fpr, tpr, lw=2, label=f"Model (AUROC = {auroc:.3f})")
ax1.plot([0, 1], [0, 1], "--", color="gray", label="Random")
ax1.set_xlabel("False Positive Rate")
ax1.set_ylabel("True Positive Rate")
ax1.set_title("ROC Curve")
ax1.legend()
ax1.grid(alpha=0.3)

ax2.plot(rec, prec, lw=2, color="crimson", label=f"Model (AUPRC = {auprc:.3f})")
ax2.axhline(y=y_true.mean(), linestyle="--", color="gray",
            label=f"Random = {y_true.mean():.3f}")
ax2.set_xlabel("Recall")
ax2.set_ylabel("Precision")
ax2.set_title("Precision-Recall Curve")
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig("roc_pr_curves.png", dpi=120)

Finding the Optimal F1 Threshold

prec, rec, thresholds = precision_recall_curve(y_true, y_score)
# precision_recall_curve returns one extra point; align with thresholds
prec_t, rec_t = prec[:-1], rec[:-1]

f1_curve = 2 * prec_t * rec_t / (prec_t + rec_t + 1e-12)
best_idx = int(np.argmax(f1_curve))
best_threshold = thresholds[best_idx]
best_f1 = f1_curve[best_idx]

print(f"Best F1 = {best_f1:.4f} at threshold = {best_threshold:.4f}")
print(f"  Precision = {prec_t[best_idx]:.4f}")
print(f"  Recall    = {rec_t[best_idx]:.4f}")

Sweeping the Threshold

def threshold_sweep(y_true, y_score, n_thresholds=100):
    """Compute Precision, Recall, F1, FAR for a grid of thresholds."""
    grid = np.linspace(y_score.min(), y_score.max(), n_thresholds)
    rows = []
    for t in grid:
        y_pred = (y_score >= t).astype(int)
        TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
        m = metrics_from_confusion(TN, FP, FN, TP)
        rows.append([t, m["Precision"], m["Recall"], m["F1"], m["FAR (FPR)"]])
    return np.asarray(rows)

sweep = threshold_sweep(y_true, y_score, 200)
t_grid, prec_g, rec_g, f1_g, far_g = sweep.T

plt.figure(figsize=(9, 5))
plt.plot(t_grid, prec_g, color="#e74c3c", label="Precision")
plt.plot(t_grid, rec_g,  color="#3498db", label="Recall")
plt.plot(t_grid, f1_g,   color="#27ae60", label="F1")
plt.plot(t_grid, far_g,  color="#f39c12", label="FAR")
plt.axvline(best_threshold, linestyle="--", color="black", alpha=0.6,
            label=f"Best F1 t={best_threshold:.3f}")
plt.xlabel("Threshold")
plt.ylabel("Metric value")
plt.title("Metric vs Threshold (1% anomaly rate)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()

Cost-Weighted Metric

def cost_weighted_score(y_true, y_pred, c_fp=1.0, c_fn=10.0):
    """Lower is better. Useful when FN costs ~10x more than FP."""
    TN, FP, FN, TP = confusion_from_scratch(y_true, y_pred)
    return c_fp * FP + c_fn * FN

def best_threshold_by_cost(y_true, y_score, c_fp=1.0, c_fn=10.0, n=200):
    grid = np.linspace(y_score.min(), y_score.max(), n)
    costs = []
    for t in grid:
        y_pred = (y_score >= t).astype(int)
        costs.append(cost_weighted_score(y_true, y_pred, c_fp, c_fn))
    best = int(np.argmin(costs))
    return grid[best], costs[best]

t_cost, c_cost = best_threshold_by_cost(y_true, y_score, c_fp=1, c_fn=20)
print(f"Cost-optimal threshold = {t_cost:.4f}, total cost = {c_cost:.0f}")

Bootstrap Confidence Intervals (the Underrated Step)

Single-number reports without uncertainty are dangerous. A 1,000-sample test set with 10 positives can produce wildly different AUPRC values across reasonable bootstrap resamples. The bootstrap is the standard way to attach a confidence interval. The intuition behind why averaging across many resamples produces a stable estimate goes back to the Central Limit Theorem.

def bootstrap_ci(y_true, y_score, metric_fn, n_boot=1000, alpha=0.05, seed=0):
    """Bootstrap percentile CI for any score-based metric."""
    rng = np.random.default_rng(seed)
    n = len(y_true)
    scores = []
    for _ in range(n_boot):
        idx = rng.integers(0, n, size=n)
        y_t, y_s = y_true[idx], y_score[idx]
        if y_t.sum() == 0 or y_t.sum() == n:
            continue  # degenerate resample
        scores.append(metric_fn(y_t, y_s))
    scores = np.asarray(scores)
    lo = np.quantile(scores, alpha/2)
    hi = np.quantile(scores, 1 - alpha/2)
    return float(np.mean(scores)), (float(lo), float(hi))

mean_auroc, ci_auroc = bootstrap_ci(y_true, y_score, roc_auc_score, n_boot=500)
mean_auprc, ci_auprc = bootstrap_ci(y_true, y_score, average_precision_score, n_boot=500)

print(f"AUROC = {mean_auroc:.4f}  95% CI [{ci_auroc[0]:.4f}, {ci_auroc[1]:.4f}]")
print(f"AUPRC = {mean_auprc:.4f}  95% CI [{ci_auprc[0]:.4f}, {ci_auprc[1]:.4f}]")

Time-Series PA-F1 Implementation

def get_event_segments(y):
    """Return list of (start, end_inclusive) for runs of 1s."""
    y = np.asarray(y).astype(int)
    if len(y) == 0:
        return []
    diff = np.diff(np.concatenate(([0], y, [0])))
    starts = np.where(diff == 1)[0]
    ends   = np.where(diff == -1)[0] - 1
    return list(zip(starts.tolist(), ends.tolist()))

def point_adjusted_predictions(y_true, y_pred):
    """Apply Point-Adjusted (PA) protocol: if any point inside a true
    anomaly segment is detected, flag the entire segment as detected."""
    y_pred = y_pred.copy().astype(int)
    for s, e in get_event_segments(y_true):
        if y_pred[s:e+1].any():
            y_pred[s:e+1] = 1
    return y_pred

# Worked example
y_t = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0])
y_p = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0])

print("Raw point F1     =", round(f1_score(y_t, y_p), 4))
y_pa = point_adjusted_predictions(y_t, y_p)
print("PA-adjusted pred =", y_pa.tolist())
print("PA-F1            =", round(f1_score(y_t, y_pa), 4))

In this example the raw point F1 is around 0.18 (one TP, two FN inside the first event, one FP outside, no detection on the second event). After point adjustment, the entire first event becomes "detected" because we flagged one point inside it—recall jumps massively. This is the inflation effect Kim et al. (2022) warned about: PA-F1 can look impressive even when the underlying detection is weak. For range-aware alternatives, look at the VUS package or the Tatbul range-based implementation in the tsad Python library.

How to Choose the Threshold for Production

You've trained the model. AUROC and AUPRC look fine. Now what threshold do you actually deploy with? Here are the five common strategies, in order from simplest to most sophisticated.

Maximize F1 on Validation

Sweep thresholds on a held-out validation set and pick the one with the highest F1. Simple, defensible, gives a balanced precision/recall point. Watch out: never select the threshold on your test set—that's data leakage. Always reserve validation data for hyperparameter and threshold selection.

Fixed FAR Budget

The operations-driven approach. "We can handle 100 alerts/day. With 1M events/day, that's FAR ≤ 0.01%." Pick the threshold where FAR = 0.0001 on the validation set, then report the corresponding recall. This is how most real cybersecurity and network monitoring systems are tuned.

def threshold_for_far_budget(y_true, y_score, far_budget=0.001):
    """Largest recall achievable subject to FAR ≤ far_budget."""
    fpr, tpr, thr = roc_curve(y_true, y_score)
    feasible = fpr <= far_budget
    if not feasible.any():
        return None, 0.0, 0.0
    idx = np.argmax(tpr * feasible)
    return float(thr[idx]), float(tpr[idx]), float(fpr[idx])

t, r, f = threshold_for_far_budget(y_true, y_score, far_budget=0.005)
print(f"Threshold = {t:.4f}, Recall = {r:.4f} at FAR = {f:.4f}")

Cost-Weighted Optimization

If you can quantify the dollar cost of a false positive (analyst time, customer impact) and a false negative (missed fraud value), pick the threshold that minimizes C_FP·FP + C_FN·FN. This is the most defensible approach when the asymmetry is well understood.

Top-K Selection

Skip the threshold entirely. Rank scores and take the top K. Useful when the human review capacity is the binding constraint and the alert volume per period is fixed.

Sliding / Contextual Threshold

Time-of-day, day-of-week, or per-segment thresholds. A retail fraud detector might use threshold = 0.6 on weekday afternoons and 0.4 on holiday weekends. Implementation usually involves a small lookup table or a contextual model that outputs both score and threshold.

Caution: Thresholds drift. As your data distribution shifts (seasonal effects, fraud-pattern evolution), the threshold that maximized F1 in January may produce twice the alert volume in June. Schedule monthly threshold re-tuning and monitor precision and FAR continuously.

Common Pitfalls to Avoid

After dozens of anomaly detection projects across fraud, manufacturing, security, and healthcare, here are the recurring mistakes I see most often.

Reporting AUROC without AUPRC on imbalanced data. AUROC = 0.99 with 0.1% positives often means AUPRC = 0.40. Report both, always.
Reporting accuracy. For anomaly detection, accuracy is almost always meaningless. The "always negative" baseline beats most real models on accuracy.
Cherry-picking the threshold on the test set. Tune on validation, evaluate on test. If you maximize F1 across thresholds on the same test set, you've overfit.
Not using stratified k-fold. With 1% positives in 1,000 samples, a random fold could end up with zero positives in the validation split. Use StratifiedKFold.
Ignoring confidence intervals. A reported AUPRC of 0.42 ± 0.15 (95% CI) is qualitatively different from 0.42 ± 0.02. Bootstrap and report.
Comparing models on different test sets. Apples to oranges. Use the same fixed test set across all model comparisons.
Using point F1 for time series. One-off detection lag tanks the score. Use range-based metrics or VUS instead.
Microaverage vs macroaverage confusion in multi-class anomaly settings. Microaverage favors common classes; macroaverage equalizes them. Choose intentionally and document.
Treating PA-F1 as gospel. It can be inflated by random noise. Report it alongside non-PA metrics if you must use it.
Optimizing offline metrics that don't translate to deployment. If your business runs on alert-volume budgets, optimize for the metric that respects that constraint, not just F1.

Real-World Reporting Templates by Domain

Different domains converge on different metric stacks. Here's a recommendation distilled from real production systems. For deeper dives into the underlying anomaly detection methods, see our companion guides on Deep SVDD and One-Class SVM.

Domain	Recommended Metric Stack	Why
Fraud detection	AUPRC, Precision@K, Recall, $-weighted recall	Severe imbalance + dollar asymmetry
Network intrusion	AUROC, Precision, FAR @ fixed Recall	Operations cares about alert volume
Medical screening	Sensitivity (Recall), Specificity, AUROC	Regulatory norms; symmetric reporting
Industrial sensor	Range-based F1, Precision@K, time-to-detect	Time-series events; early detection valued
Server monitoring	Precision@K, MTTD, false-alert-per-day	Streaming context, on-call workload
Biometrics / authentication	EER, DET curve, FAR @ fixed FRR	Field-standard reporting
Anti-money-laundering	Recall + Precision@K, regulatory alert quality	Compliance sets minimum recall
Manufacturing defect	Recall, Precision, cost-weighted score	Defect cost vs over-inspection cost

If your model is built on top of transfer learning or fine-tuning approaches, the same metric framework applies, just be especially cautious about confidence intervals, since pre-training source-target distribution gaps can make small test sets very noisy.

Key Takeaway: A solid default reporting set for any anomaly detection project: AUPRC, Precision@K, Recall, and FAR—each with bootstrap 95% confidence intervals and a documented threshold. That covers model quality, top-of-list usefulness, miss rate, and operational alert volume.

Frequently Asked Questions

Why isn't accuracy a good metric for anomaly detection?

Because anomalies are rare. If 99% of your data is normal, a "predict normal always" model achieves 99% accuracy without learning anything. Real models barely lift accuracy by a few tenths of a percentage point, so accuracy can't distinguish good models from useless ones. Use AUPRC, F1, or Precision@K instead.

AUROC vs AUPRC—when should I use which?

For mild imbalance (positives 5–50%), AUROC and AUPRC tell roughly similar stories, AUROC is fine. For severe imbalance (positives below 1%), AUROC inflates because most of its area comes from FPR regions you'll never operate in. AUPRC is more honest because its random baseline equals the positive class fraction. Best practice: report both, but rely on AUPRC for imbalanced anomaly detection.

How do I pick a threshold for production?

Pick the strategy that matches your business constraint. If your team has a fixed alert-review budget, use top-K or fixed-FAR. If you can quantify costs, optimize C_FP·FP + C_FN·FN. If neither, maximize F1 on a held-out validation set. Always select the threshold on validation, evaluate on test, and re-tune monthly as data shifts.

What's the difference between FAR and FPR?

None—they're the same metric: FP / (FP + TN). "False Alarm Rate" is the operations and biometrics term; "False Positive Rate" is the statistical term. Some literature also uses "False Acceptance Rate" (biometrics, identical concept) or "Type I Error rate" (classical statistics). Don't be confused by the multiple names.

Are time-series anomaly detection metrics different?

Yes. Anomalies in time series are typically contiguous events, not isolated points, so naive point-wise F1 over-penalizes brief detection lag. Use range-based metrics (Tatbul et al., 2018), VUS-PR (Paparrizos et al., 2022), or NAB Score for streaming. Avoid using only Point-Adjusted F1—recent work has shown it can be gamed by random noise.

References and Further Reading

Related Reading on aicodeinvest.com:

External References:

scikit-learn metrics documentation—https://scikit-learn.org/stable/modules/model_evaluation.html
Saito, T. & Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." PLOS ONE.
Tatbul, N., Lee, T. J., Zdonik, S., Alam, M., & Gottschlich, J. (2018). "Precision and Recall for Time Series." NeurIPS.
Paparrizos, J., Boniol, P., Palpanas, T., Tsay, R., Elmore, A., & Franklin, M. (2022). "Volume Under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection." VLDB.
Numenta Anomaly Benchmark (NAB),https://github.com/numenta/NAB
Huet, A., Navarro, J. M., & Rossi, D. (2022). "Local Evaluation of Time Series Anomaly Detection Algorithms." KDD.
Kim, S. et al. (2022). "Towards a Rigorous Evaluation of Time-Series Anomaly Detection." AAAI.

This article is for informational purposes only and does not constitute investment, security, or medical advice. Always validate metrics against your specific operational context.

AI/MLHow to Install and Use OpenClaw on Windows 11: A Complete Setup Guide AI/MLBuilding a Personal AI Knowledge Base: How to Use AI Agents to Organize, Remember, and Retrieve Everything AI/MLDiscrete Event Simulation (DES) in Python: A Practical Guide with SimPy

Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR

Summary

Why Anomaly Detection Metrics Matter (and Why Accuracy Doesn’t)

The Confusion Matrix Foundation

Threshold-Dependent Metrics: Precision, Recall, F1, FAR, MCC

Precision—How Pure Are My Alerts?

Recall (Sensitivity, True Positive Rate)—How Many Did I Catch?

F1 Score, The Compromise

Specificity (TNR) and False Alarm Rate (FAR/FPR)

False Reject Rate (FRR)

Matthews Correlation Coefficient (MCC)

Balanced Accuracy

Threshold-Independent Metrics: AUROC, AUPRC, DET

ROC Curve and AUROC

When AUROC Misleads

Precision-Recall Curve and AUPRC

Detection Error Tradeoff (DET) Curve

When to Use Which Metric, A Decision Framework

Time-Series-Specific Metrics

Point-Adjusted (PA) F1

Range-Based Precision/Recall (Tatbul et al., 2018)

NAB Score (Numenta Anomaly Benchmark)

VUS (Volume Under the Surface, Paparrizos et al., 2022)

Affiliation-Based Metrics (Huet et al., 2022)

Top-K Metrics for Ranking

Practical Implementation in Python

Setup and Synthetic Data

Building the Confusion Matrix from Scratch

All Threshold-Dependent Metrics, From Scratch

AUROC and AUPRC With sklearn

Plotting ROC and PR Curves

Finding the Optimal F1 Threshold

Sweeping the Threshold

Cost-Weighted Metric

Bootstrap Confidence Intervals (the Underrated Step)

Time-Series PA-F1 Implementation

How to Choose the Threshold for Production

Maximize F1 on Validation

Fixed FAR Budget

Cost-Weighted Optimization

Top-K Selection

Sliding / Contextual Threshold

Common Pitfalls to Avoid

Real-World Reporting Templates by Domain

Frequently Asked Questions

Why isn't accuracy a good metric for anomaly detection?

AUROC vs AUPRC—when should I use which?

How do I pick a threshold for production?

What's the difference between FAR and FPR?

Are time-series anomaly detection metrics different?

References and Further Reading

You Might Also Like

Comments

Leave a Reply Cancel reply

More posts

Who Owns Anthropic? Public Company Stakes and Investor Map in 2026

AMD vs NVIDIA in 2026: Prospects, Risks, and Conditional Scenarios

xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR