Home AI/ML GP-Based Hyperparameter Optimization: Bayesian Tuning for ML Models

GP-Based Hyperparameter Optimization: Bayesian Tuning for ML Models

Last updated: May 17, 2026
k
Published April 26, 2026 · Updated May 17, 2026 · 35 min read

Summary

What this post covers: A practitioner’s guide to tuning ML hyperparameters with Gaussian Process Bayesian Optimization, walking through the full BayesOpt pipeline, acquisition functions, search-space design, and four working Python implementations (scikit-optimize, BoTorch, qNEHVI multi-objective, Optuna+BoTorch).

Key insights:

  • GP-based BayesOpt typically reaches a good configuration in ~20 trials versus ~60 for random search and millions for grid search, making it the right default whenever each training run costs serious GPU time.
  • GPs win for HPO because they natively model observation noise, quantify uncertainty everywhere, and produce a smooth surrogate that an acquisition function can exploit; this is the source of their sample efficiency in low-to-moderate dimensions.
  • Acquisition function choice matters: Expected Improvement is the safe default, UCB exposes an explicit explore/exploit knob, and Thompson Sampling or qNEHVI dominate when you need parallel batches or multi-objective Pareto fronts.
  • Search-space design (log-uniform priors for learning rate, integer dimensions, conditional parameters) often determines success more than the optimizer itself, and mixing GP-BO with Hyperband (BOHB) is the practical winner once you have tens of GPUs.
  • For most teams the right stack is Optuna with the BoTorch sampler: it handles mixed/conditional spaces, parallelizes, and gives you GP-grade sample efficiency without writing BoTorch directly.

Main topics: Why Hyperparameter Tuning Is Hard, The HPO Landscape: A Method Tour, Why Gaussian Processes Win for HPO, The Full BayesOpt Pipeline for HPO, Acquisition Functions Deep Dive, Search Space Design, Full Python Implementation, Multi-Fidelity and Parallel HPO, Tools Comparison, Real-World Case Studies, Practical Guide and Pitfalls.

Tuning a 10-hyperparameter neural network by grid search with 5 values each requires 9.7 million experiments. Random search needs about 60 to find a good config. Gaussian Process Bayesian Optimization needs 20. Same accuracy, 500,000 times less compute.

That gap is the entire reason GP-based hyperparameter optimization moved from a curious research idea to the production default at Google, Meta, and OpenAI. If a single training run takes hours and costs hundreds of dollars in GPU time, you literally cannot afford grid search. You also cannot afford random search to “stumble onto” something good. You need the optimizer to think between trials, picking the next configuration with knowledge of every previous one.

Gaussian Processes are the mathematical machinery that makes this possible. A GP fits a smooth surrogate over your validation loss landscape, quantifies its own uncertainty everywhere, and an acquisition function turns that uncertainty into a principled “where should I look next?” decision.

This post is the practitioner guide. We’re not re-deriving GP regression—for the underlying math (kernels, posterior, marginal likelihood), see the Gaussian Process fundamentals post with Python and GPyTorch. Here we focus on the applied question: how do you actually tune XGBoost, a CNN, or a transformer using GP-based BayesOpt in production?

You’ll get four working code examples (scikit-optimize on XGBoost, BoTorch on a CNN, multi-objective BO with qNEHVI, and Optuna with the BoTorch sampler), a walk through every common acquisition function, three SVG diagrams, and an opinionated tools recommendation. Let’s go.

Why Hyperparameter Tuning Is Hard

Before we praise GPs, it’s worth being honest about what makes HPO genuinely difficult—because the difficulty is what justifies all the BayesOpt machinery in the first place.

The Combinatorial Explosion

A typical modern ML model has somewhere between 10 and 30 tunable hyperparameters. A baseline XGBoost has 10–15 (learning rate, max depth, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda, scale_pos_weight…). A vision transformer has more (depth, width, heads, MLP ratio, patch size, dropout, attention dropout, learning rate, weight decay, warmup, label smoothing, mixup alpha, drop path, EMA…).

If you grid-search 10 hyperparameters with 5 values each, that’s 510 ≈ 9.77 million configurations. At 30 minutes per training run on a single GPU, that’s 5,580 GPU-years. Even with massive parallelism, this is dead on arrival.

Non-Trivial Interactions

Hyperparameters are not independent. The optimal learning rate depends on batch size (linear scaling rule), depends on optimizer (Adam vs SGD), depends on weight initialization, depends on architecture depth. Grid search assumes you can study them one at a time, which is exactly wrong.

Random search handles this better, by sampling jointly, it sees interactions. But it still wastes compute on unpromising regions because it has no memory between trials.

Each Evaluation Is Expensive

Training a single config can take anywhere from minutes (small XGBoost) to days (large language model fine-tune). When each evaluation costs $50–$500 in cloud GPU time, sample efficiency stops being academic and becomes a budget line item.

Noise

The same hyperparameters give different validation losses across random seeds. Variance from data shuffling, dropout randomness, weight initialization, and stochastic optimization means every observation is noisy. A naive optimizer treats this noise as signal and chases ghosts. GPs handle observation noise natively through the kernel—that’s a built-in advantage.

Mixed Types and Conditional Spaces

Real search spaces include continuous parameters (learning rate), integers (max depth, number of layers), categoricals (activation function, optimizer choice), and conditional dimensions (dropout rate only matters if dropout is enabled; momentum only matters for SGD, not Adam). Standard GPs assume continuous Euclidean inputs, so this is a real engineering challenge that we’ll address in the search space section.

Key Takeaway: HPO is hard because the search space is huge and weirdly shaped, evaluations are expensive, observations are noisy, and there’s no gradient. Every characteristic of HPO points away from grid search and toward a sample-efficient, model-based optimizer. That’s exactly what GP-based BayesOpt is.

The HPO Landscape: A Method Tour

Before zooming into GPs, here’s the practical taxonomy of methods you’ll see in the wild.

Cartesian product of values for each hyperparameter. Easy to implement, easy to parallelize, embarrassingly inefficient. Breaks past 4–5 hyperparameters because of the curse of dimensionality. Use only for tiny problems or final pinning of 2–3 parameters.

Sample uniformly at random from the search space. Bergstra & Bengio (2012) famously showed it dominates grid search because most hyperparameters don’t matter equally—random search projects effectively onto the “important” axes. It’s the baseline every other method should beat. If a fancy method can’t beat random search, it’s broken.

Evolutionary / Genetic Algorithms

Maintain a population of configurations, mutate and recombine them, select the fittest. Parallelizes beautifully, no gradient needed, handles weird search spaces. Sample efficiency is moderate, better than random, usually worse than BO. Used heavily in NAS (Regularized Evolution, AmoebaNet). For a deeper dive, see the genetic algorithm Python implementation guide.

Bandit-Based: Hyperband and ASHA

Frame HPO as a multi-armed bandit problem. Run many configs for a small budget, kill the worst, double the budget for survivors, repeat. Successive Halving is the core idea. Hyperband sweeps over different initial budgets to hedge against bad fidelity choices. ASHA is the asynchronous variant that scales to massive parallelism. These are multi-fidelity methods—they use cheap proxies (early epochs) to filter expensive ones.

Bayesian Optimization with GPs

Fit a GP surrogate over (hyperparameter, validation_loss) pairs. Use an acquisition function to pick the next trial. Sample-efficient, principled uncertainty, smooth and theoretically grounded. The focus of this post.

TPE (Tree-Structured Parzen Estimator)

Bayesian optimization with a different surrogate: instead of a GP, model two densities, p(x | y < threshold) and p(x | y ≥ threshold), and pick x maximizing their ratio. Handles conditional spaces natively, scales well in dimensions, dominates the Optuna and HyperOpt defaults. Less sample-efficient than GPs in low-D, but more flexible in high-D and with mixed types.

Hybrid: BOHB

Falkner et al. 2018: combine Bayesian Optimization (TPE) with Hyperband. Best of both—Hyperband’s compute efficiency via early stopping, BO’s smart sampling instead of random sampling within rungs. Often the right default for deep learning HPO when you have tens of GPUs.

HPO Method Sample Efficiency on 10-D Problem Approximate trials needed to reach a good configuration (lower is better) Trials (log scale) 10 100 1k 10k 100k 1M+ Grid Search ~9.7M (off-chart) Random Search ~100 Genetic Algorithm ~80 TPE (Optuna) ~40 GP-BO (BoTorch) ~25 BOHB (multi-fid.) ~15 Winners: GP-BO (low-D) BOHB (deep nets) Note: BOHB advantage assumes you can early-stop confidently from partial training curves.

Quick Decision: When to Use What

Method Sample Efficiency Parallelism Complexity Categorical Support Best For
Grid Search Very low Trivial Trivial Native ≤3 hyperparams, final pinning
Random Search Low Trivial Trivial Native Baseline, exploration phase
Genetic Algorithm Medium Excellent Medium Native NAS, irregular spaces
Hyperband / ASHA Medium Excellent Medium Native Big compute, slow training
TPE High Good Medium Native, conditional Mixed types, conditional spaces
GP-BO Highest Good (qEI/Thompson) High Custom kernels needed ≤20 dims, expensive evals
BOHB Highest Excellent High Native (TPE-based) Deep learning at scale

 

Why Gaussian Processes Win for HPO

For the majority of “real” HPO problems, under 20 dimensions, expensive evaluations, mostly continuous space—GP-based BO is the strongest method on every benchmark we’ve seen. Here’s why specifically.

Sample Efficiency Is Everything

When each evaluation costs hours of GPU time, you don’t care about wall-clock overhead of fitting a GP (a few seconds). You care about making every trial count. GPs use the full information of every prior observation to decide the next one. Random search throws that information away.

Principled Uncertainty

A GP doesn’t just predict the loss—it predicts the loss and a confidence interval. This is what makes intelligent exploration possible. The GP knows where it doesn’t know, and the acquisition function exploits that. Without a probabilistic surrogate, “exploration” becomes random sampling with extra steps.

Smooth Surrogate, Smooth Landscape

Hyperparameter loss landscapes are usually smooth, especially in log-space (learning rate, weight decay). The Matérn 5/2 kernel is a near-perfect inductive bias for this. GPs interpolate cleanly between observations, giving you a believable map of the space after just 10–20 trials.

Calibrated Exploration vs Exploitation

Acquisition functions like Expected Improvement automatically balance “go where we think it’s good” (exploitation) and “go where we don’t know yet” (exploration). The trade-off emerges from the math, not from a hand-tuned epsilon-greedy.

Sweet Spot: ≤20 Dimensions

GPs become unwieldy past ~20 dimensions because the kernel struggles to model meaningful similarity in high-D Euclidean space. The good news: the vast majority of HPO problems are in this regime. For higher dimensions, see the discussion of TuRBO and random embeddings.

Tip: If your search space has fewer than 20 dimensions, you can afford a few seconds of GP fitting overhead per trial, and each trial is expensive (more than a minute), GP-based BO is almost always the right choice. The only reasons not to use it: extreme parallelism (use Thompson sampling), conditional spaces (use TPE), or genuinely high dimensions (use TuRBO).

The Full BayesOpt Pipeline for HPO

Here’s what GP-based BayesOpt actually does, step by step. We’re describing the loop you’ll see in BoTorch, scikit-optimize, and Optuna’s GP sampler.

Step 1: Define the Search Space

Specify the bounds and type for each hyperparameter. Continuous (with optional log scale), integer, categorical. This is where most production mistakes happen, bounds too tight (miss the optimum), bounds too wide (waste trials in bad regions), wrong scale (linear instead of log for learning rate).

Step 2: Initial Random Trials

Run 5–10 random configurations to seed the GP. Without seed observations, the GP has no signal and the acquisition function picks the center of the box, every time. A common rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension.

Step 3: Fit the GP Surrogate

Given observations (x1, y1), …, (xn, yn), fit a GP with Matérn 5/2 kernel (default for HPO). Optimize kernel hyperparameters (lengthscales, signal variance, noise) via maximum marginal likelihood. This takes seconds for n < 1000.

Step 4: Optimize the Acquisition Function

The acquisition function α(x) takes the GP posterior and outputs a scalar—”how valuable is it to evaluate at x?” Maximize α(x) over the search space (using L-BFGS, multi-start, or random sampling for non-smooth cases). The argmax is your next trial.

Step 5: Run the Trial

Train the model with the proposed hyperparameters. Record (xn+1, yn+1).

Step 6: Update and Repeat

Append the new observation, refit the GP, optimize the acquisition again, propose the next trial. Loop until budget exhausted.

BayesOpt Iterations: GP Posterior + Acquisition Function Iteration 5: wide uncertainty, exploring next x Iteration 10: narrowing on promising area next x Iteration 20: converged on local minimum explore Iteration 30: confirmed global optimum ★ optimum acquisition flat—converged GP mean GP ±2σ observed trial next trial acquisition α(x)

Caution: The trade-off no one mentions: GP fitting + acquisition optimization adds 1–10 seconds of overhead per trial. If your individual trials take 5 seconds (a tiny model on a tiny dataset), this overhead dominates and BO loses to random search. BO wins specifically when each trial takes minutes to days. Don’t BO a sklearn linear regression.

Acquisition Functions Deep Dive

The acquisition function is where exploration vs exploitation lives. Choosing the right one matters less than people think, Expected Improvement is fine 90% of the time—but understanding the options helps you debug when things go wrong.

Expected Improvement (EI)

EI(x) = E[max(0, fbest − f(x))]. “How much do we expect to improve over the current best?” For a Gaussian posterior with mean μ(x) and standard deviation σ(x), this has a closed form:

EI(x) = (fbest − μ(x)) · Φ(z) + σ(x) · φ(z),    where z = (fbest − μ(x)) / σ(x)

Φ is the standard normal CDF, φ is the PDF. Smooth, differentiable, well-behaved. The default choice. Slight bias toward exploitation, but in practice it explores plenty because σ(x) is large in unexplored regions.

Upper Confidence Bound (UCB)

UCB(x) = μ(x) − β · σ(x) (for minimization; flip the sign for maximization). The β coefficient explicitly controls exploration. Larger β → more exploration. Theoretical regret bounds (Srinivas et al. 2010): with βt growing logarithmically, UCB has sublinear cumulative regret. In practice, β = 2 is a fine default. UCB is more aggressive about exploring than EI when σ is large.

Probability of Improvement (PI)

PI(x) = P(f(x) < fbest) = Φ(z). Just the probability you’ll beat the current best by any amount. Purely greedy—picks any tiny improvement, can stagnate by exploiting near the current best forever. Rarely used in modern HPO except as a teaching example.

Thompson Sampling

Sample a function from the GP posterior, take its argmin. This is naturally diverse: independent samples from the posterior pick different points. The killer feature: trivial parallelization. For batch HPO of size k, draw k posterior samples and run all argmins simultaneously. Used heavily in production systems with many parallel workers.

Knowledge Gradient (KG)

EI is myopic, it only cares about immediate improvement. KG looks one step ahead: “what’s the expected best after I observe at x and update the GP?” More principled, more expensive (requires nested optimization), and roughly 10–20% better in practice for noisy problems. BoTorch’s qKnowledgeGradient is the standard implementation.

Max-Value Entropy Search (MES)

Information-theoretic: pick x that gives maximum mutual information about the location of the optimum. Robust to noise, handles batches well, but more complex to compute. Wang & Jegelka 2017. Available as qMaxValueEntropy in BoTorch.

Acquisition Formula Intuition Strength Weakness When to Use
EI Expected gain over best so far Closed-form, balanced Slight exploitation bias Default—start here
UCB μ − β·σ Tunable exploration, regret bounds Need to set β When EI underexplores
PI Probability of any improvement Simplest Stagnates, no exploration Almost never
Thompson argmin of posterior sample Trivial parallelization Higher variance Batch / parallel HPO
KG Look-ahead expected best Robust to noise Expensive to compute Very noisy objectives
MES Mutual info about optimum Strong batch behavior Implementation complexity Research / best-of-best

 

Search Space Design

This is the most underappreciated part of HPO. A GP can only optimize what you tell it to optimize, and most HPO failures trace back to a poorly-defined search space.

Log-Scale for Multiplicative Parameters

Learning rates, weight decay, regularization coefficients. These have an effect that is fundamentally multiplicative—going from 1e-3 to 1e-4 is “the same kind of step” as going from 1e-4 to 1e-5. Use log-uniform sampling. Bounds typically 1e-5 to 1e-1 for learning rate.

Linear Scale for Additive Parameters

Layer sizes, number of estimators, batch size, number of layers. These are additive and roughly linear in their effect.

Integer Handling

Most BO libraries treat integers as continuous and round at evaluation time. This works but creates plateaus in the objective. BoTorch’s OneHotToNumeric and Round input transforms handle it cleanly. For Optuna and skopt, just declare the param as integer and the library handles rounding.

Categorical Handling

Three options: (1) one-hot encode and treat as continuous (works, slight efficiency loss), (2) use a custom kernel like the categorical Hamming kernel (cleaner), (3) use TPE which handles categoricals natively. BoTorch’s MixedSingleTaskGP handles mixed continuous-categorical.

Conditional Spaces

“Dropout rate is only meaningful if dropout is enabled.” “Momentum is only relevant for SGD, not Adam.” TPE handles this natively, it learns the conditional structure. GP-based BO needs custom handling: typically you flatten to the union and let the optimizer learn that some dimensions are irrelevant. For deeply conditional spaces, TPE often wins.

Hyperparameter Type Recommended Representation Typical Range
Learning rate Log-uniform continuous 1e-5 to 1e-1
Weight decay / L2 Log-uniform continuous 1e-6 to 1e-2
Dropout rate Linear continuous 0.0 to 0.5
Hidden size / width Log-uniform integer 32 to 1024
Number of layers Linear integer 2 to 12
Batch size Log-uniform integer (powers of 2) 8 to 512
Optimizer choice Categorical {Adam, SGD, AdamW, RMSprop}
Activation Categorical {ReLU, GELU, SiLU, Mish}
XGBoost max_depth Linear integer 3 to 12
XGBoost subsample Linear continuous 0.5 to 1.0

 

Caution: GPs extrapolate poorly outside their training data. If your “best” hyperparameter is on the boundary of the search space, that’s a strong signal you set the bounds too tight. Widen and re-run.

Full Python Implementation

Four working examples, in increasing complexity. Use any of these as a starting template for your own HPO.

Example 1: Tuning XGBoost with scikit-optimize

scikit-optimize is the gentlest entry point—pip install, sklearn-style API, GP-based by default. Great for tabular ML.

"""
GP-BO for XGBoost using scikit-optimize.
pip install scikit-optimize xgboost scikit-learn matplotlib
"""
import numpy as np
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
from skopt.plots import plot_convergence, plot_objective
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

# Load a real tabular dataset
data = fetch_openml("adult", version=2, as_frame=True)
X = data.data.select_dtypes(include=[np.number]).fillna(0).values
y = (data.target == ">50K").astype(int).values

# Define the search space
space = [
    Real(1e-3, 0.3, prior="log-uniform", name="learning_rate"),
    Integer(3, 12, name="max_depth"),
    Integer(50, 500, name="n_estimators"),
    Real(0.5, 1.0, name="subsample"),
    Real(0.5, 1.0, name="colsample_bytree"),
    Real(1e-6, 1.0, prior="log-uniform", name="reg_alpha"),
    Real(1e-6, 1.0, prior="log-uniform", name="reg_lambda"),
    Real(0.0, 5.0, name="gamma"),
]

@use_named_args(space)
def objective(**params):
    """We minimize negative ROC AUC (skopt minimizes)."""
    clf = XGBClassifier(
        **params,
        tree_method="hist",
        eval_metric="logloss",
        n_jobs=-1,
        random_state=42,
        verbosity=0,
    )
    score = cross_val_score(
        clf, X, y, cv=3, scoring="roc_auc", n_jobs=1
    ).mean()
    return -score

# Run GP-BO with EI acquisition
result = gp_minimize(
    objective,
    space,
    n_calls=50,            # total trials
    n_initial_points=10,   # random seed trials
    acq_func="EI",         # Expected Improvement
    random_state=42,
    verbose=True,
)

print(f"Best AUC: {-result.fun:.4f}")
print("Best hyperparameters:")
for name, val in zip([s.name for s in space], result.x):
    print(f"  {name}: {val}")

# Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plot_convergence(result, ax=axes[0])
axes[0].set_title("Convergence")
plot_objective(result, ax=axes[1] if False else None)  # separate fig
plt.tight_layout()
plt.savefig("xgb_bo_convergence.png", dpi=120)

What’s happening: 10 random seed trials, then 40 GP-guided trials using Expected Improvement. The plot_convergence shows the running best score vs trial number—the classic “BO crushes random search” plot. plot_objective shows partial dependence plots for each hyperparameter, revealing which ones actually mattered.

On the Adult dataset with 50 trials, you’ll typically beat random search’s 50-trial best by 0.5–1.5% AUC. That sounds small until you remember it’s “free” (same trial budget) and reproducible.

Example 2: Tuning a PyTorch CNN with BoTorch

BoTorch is what you reach for when you outgrow scikit-optimize. It’s PyTorch-native, GPU-accelerated, and built on top of GPyTorch (the same library used in the GP fundamentals post). For research and production deep learning HPO, this is the standard.

"""
GP-BO for a PyTorch CNN using BoTorch.
pip install botorch gpytorch torch torchvision
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition import qExpectedImprovement
from botorch.optim import optimize_acqf
from gpytorch.mlls import ExactMarginalLogLikelihood

device = "cuda" if torch.cuda.is_available() else "cpu"

# Search space: [log_lr, log_wd, dropout, log_hidden]
# Bounds in normalized space [0,1] mapped to actual ranges below.
BOUNDS = torch.tensor(
    [[0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0]],
    device=device, dtype=torch.double,
)

def unnormalize(x):
    """Map [0,1]^4 to actual hyperparameter ranges."""
    log_lr   = -5.0 + x[..., 0] * 3.0   # 1e-5 to 1e-2
    log_wd   = -6.0 + x[..., 1] * 4.0   # 1e-6 to 1e-2
    dropout  = x[..., 2] * 0.5          # 0 to 0.5
    log_hidden = 5.0 + x[..., 3] * 4.0  # 32 to 512 (log2)
    return {
        "lr": float(10 ** log_lr),
        "wd": float(10 ** log_wd),
        "dropout": float(dropout),
        "hidden": int(2 ** round(log_hidden.item())),
    }

class SmallCNN(nn.Module):
    def __init__(self, hidden, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(32 * 7 * 7, hidden), nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, 10),
        )
    def forward(self, x):
        return self.net(x)

# Load FashionMNIST (small enough to iterate quickly)
tfm = transforms.Compose([transforms.ToTensor()])
train_ds = datasets.FashionMNIST("./data", train=True, download=True, transform=tfm)
val_ds = datasets.FashionMNIST("./data", train=False, download=True, transform=tfm)
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=2)
val_loader = DataLoader(val_ds, batch_size=512, num_workers=2)

def train_eval(params, epochs=3):
    """Train CNN with given hyperparams, return validation accuracy."""
    model = SmallCNN(params["hidden"], params["dropout"]).to(device)
    opt = optim.AdamW(model.parameters(), lr=params["lr"], weight_decay=params["wd"])
    crit = nn.CrossEntropyLoss()
    for _ in range(epochs):
        model.train()
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            opt.zero_grad()
            crit(model(xb), yb).backward()
            opt.step()
    # Evaluate
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            preds = model(xb).argmax(1)
            correct += (preds == yb).sum().item()
            total += yb.size(0)
    return correct / total

# Initial random trials
N_INIT = 8
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=torch.double)
Y_obs = torch.tensor(
    [[train_eval(unnormalize(x))] for x in X_obs],
    device=device, dtype=torch.double,
)
print(f"Init complete. Best so far: {Y_obs.max().item():.4f}")

# BO loop
N_BO_ITERS = 20
for it in range(N_BO_ITERS):
    # Fit GP (BoTorch handles standardization, kernel, MLL)
    gp = SingleTaskGP(X_obs, Y_obs)
    mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
    fit_gpytorch_mll(mll)

    # qEI acquisition (q=1 for sequential)
    acq = qExpectedImprovement(model=gp, best_f=Y_obs.max())
    candidate, _ = optimize_acqf(
        acq_function=acq,
        bounds=BOUNDS,
        q=1,
        num_restarts=10,
        raw_samples=512,
    )
    # Evaluate candidate
    new_y = train_eval(unnormalize(candidate.squeeze(0)))
    X_obs = torch.cat([X_obs, candidate], dim=0)
    Y_obs = torch.cat([Y_obs, torch.tensor([[new_y]], device=device, dtype=torch.double)], dim=0)
    print(f"Iter {it+1}: y={new_y:.4f} | best={Y_obs.max().item():.4f}")

best_idx = Y_obs.argmax()
print("\nBest hyperparameters:")
print(unnormalize(X_obs[best_idx]))

Notes on this implementation:

  • We work in normalized [0,1]d space and unnormalize for the actual training. BoTorch strongly prefers normalized inputs.
  • BoTorch’s SingleTaskGP uses Matérn 5/2 kernel by default with automatic relevance determination (ARD), which learns per-dimension lengthscales.
  • optimize_acqf uses 10 multi-start L-BFGS optimizations with 512 random initial points to find the global optimum of the acquisition function.
  • This loop runs 28 trials total (8 random + 20 BO). On a single GPU with 3-epoch FashionMNIST, that’s ~30 minutes total.

Example 3: Multi-Objective BO with qNEHVI

Real-world deployment cares about more than accuracy. Latency matters. Memory matters. With multi-objective BO, you find the entire Pareto frontier between competing objectives.

"""
Multi-objective HPO: maximize accuracy AND minimize latency.
Returns the Pareto frontier instead of a single best.
"""
import time
import torch
from botorch.models import SingleTaskGP, ModelListGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition.multi_objective.monte_carlo import qNoisyExpectedHypervolumeImprovement
from botorch.optim import optimize_acqf
from botorch.utils.multi_objective.box_decompositions.dominated import DominatedPartitioning
from gpytorch.mlls import ExactMarginalLogLikelihood

device = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.double

# Search space: same 4-dim CNN tuning problem
BOUNDS = torch.tensor([[0.0]*4, [1.0]*4], device=device, dtype=DTYPE)

# Two objectives: accuracy (maximize) and -latency_ms (maximize, since BoTorch maximizes)
REF_POINT = torch.tensor([0.5, -200.0], device=device, dtype=DTYPE)  # worst-case bounds

def objective_2d(x_norm):
    """Returns [accuracy, -latency_ms]."""
    params = unnormalize(x_norm)  # reuse from Example 2
    acc = train_eval(params, epochs=3)
    # Measure latency on a batch
    model = SmallCNN(params["hidden"], params["dropout"]).to(device).eval()
    dummy = torch.randn(64, 1, 28, 28, device=device)
    # Warm up
    with torch.no_grad():
        _ = model(dummy)
    torch.cuda.synchronize() if device == "cuda" else None
    t0 = time.perf_counter()
    with torch.no_grad():
        for _ in range(20):
            _ = model(dummy)
    torch.cuda.synchronize() if device == "cuda" else None
    latency_ms = (time.perf_counter() - t0) * 1000 / 20
    return torch.tensor([acc, -latency_ms], device=device, dtype=DTYPE)

# Initial design
N_INIT = 10
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=DTYPE)
Y_obs = torch.stack([objective_2d(x) for x in X_obs])

# Multi-objective BO loop
for it in range(20):
    # Fit independent GPs for each objective
    models = [SingleTaskGP(X_obs, Y_obs[:, i:i+1]) for i in range(2)]
    model_list = ModelListGP(*models)
    for m in models:
        mll = ExactMarginalLogLikelihood(m.likelihood, m)
        fit_gpytorch_mll(mll)

    # qNEHVI acquisition
    acq = qNoisyExpectedHypervolumeImprovement(
        model=model_list,
        ref_point=REF_POINT,
        X_baseline=X_obs,
        prune_baseline=True,
    )
    candidate, _ = optimize_acqf(
        acq_function=acq, bounds=BOUNDS,
        q=2, num_restarts=10, raw_samples=512,
    )
    new_y = torch.stack([objective_2d(x) for x in candidate])
    X_obs = torch.cat([X_obs, candidate])
    Y_obs = torch.cat([Y_obs, new_y])
    # Compute hypervolume
    hv = DominatedPartitioning(ref_point=REF_POINT, Y=Y_obs).compute_hypervolume()
    print(f"Iter {it+1}: HV={hv.item():.3f} | n_obs={len(X_obs)}")

# Extract Pareto frontier
from botorch.utils.multi_objective.pareto import is_non_dominated
mask = is_non_dominated(Y_obs)
pareto = Y_obs[mask]
print(f"\nPareto frontier: {len(pareto)} points")
for acc, neg_lat in pareto.cpu().numpy():
    print(f"  acc={acc:.4f}, latency={-neg_lat:.2f}ms")

The output is not “the best config” but a frontier: configs that are Pareto-optimal, for each one, you can’t improve accuracy without sacrificing latency, and vice versa. The hypervolume metric quantifies the size of the dominated region; bigger is better.

Example 4: Optuna with BoTorch Sampler

Optuna is the most popular HPO library by adoption—and the underrated detail is that you can swap its default TPE sampler for a GP-based BoTorch sampler in one line.

"""
Optuna with GP (BoTorch) sampler vs default TPE.
pip install optuna botorch
"""
import optuna
from optuna.samplers import TPESampler
from optuna.integration import BoTorchSampler
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

def objective(trial):
    params = {
        "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-6, 1.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-6, 1.0, log=True),
    }
    clf = xgb.XGBClassifier(
        **params, tree_method="hist", eval_metric="logloss",
        n_jobs=-1, random_state=42, verbosity=0,
    )
    return cross_val_score(clf, X, y, cv=5, scoring="roc_auc").mean()

# A: TPE sampler (Optuna default)
study_tpe = optuna.create_study(
    direction="maximize",
    sampler=TPESampler(seed=42, n_startup_trials=10),
)
study_tpe.optimize(objective, n_trials=50, show_progress_bar=True)

# B: BoTorch (GP) sampler
study_gp = optuna.create_study(
    direction="maximize",
    sampler=BoTorchSampler(n_startup_trials=10, seed=42),
)
study_gp.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"TPE best AUC: {study_tpe.best_value:.4f}")
print(f"GP-BO best AUC: {study_gp.best_value:.4f}")

# Visualize convergence
import matplotlib.pyplot as plt
def running_best(trials):
    vals = [t.value for t in trials]
    return np.maximum.accumulate(vals)

plt.figure(figsize=(10, 5))
plt.plot(running_best(study_tpe.trials), label="TPE", linewidth=2)
plt.plot(running_best(study_gp.trials), label="GP-BO (BoTorch)", linewidth=2)
plt.xlabel("Trial")
plt.ylabel("Best AUC so far")
plt.legend()
plt.grid(alpha=0.3)
plt.title("TPE vs GP-BO convergence")
plt.savefig("tpe_vs_gp.png", dpi=120, bbox_inches="tight")

Empirically on smaller search spaces (≤10 dims) and noisy objectives, GP-BO converges faster than TPE in trial count. On bigger spaces or with conditional dimensions, TPE catches up. The key benefit of Optuna is the framework: pruning, distributed trials, web dashboard, and easy sampler swapping.

Tip: For an end-to-end HPO orchestration pipeline (queue trials, distribute to workers, persist results), pair Optuna with Apache Airflow for orchestration. Each Airflow task = one trial; the study state lives in a shared database.

Multi-Fidelity and Parallel HPO

The compute reality of modern deep learning: full training is expensive, but partial training is informative. A 100-epoch run is 10× more expensive than a 10-epoch run, but the 10-epoch result correlates strongly with the 100-epoch result. Multi-fidelity HPO exploits this.

BOHB (Falkner et al. 2018)

Combine Hyperband (early stopping based on partial training curves) with BO (smart sampling instead of random). Hyperband decides when to kill a trial; BO decides which configs to try at each rung. Empirically dominates either method alone for deep learning HPO.

BOHB uses TPE rather than GP for the BO part because the sampling-based density model handles the high-dimensional and conditional spaces of NN architectures well. There are GP variants (Falkner mentions trade-offs) but TPE is the default.

Multi-Fidelity BO (MFBO)

Add fidelity (e.g., training epochs, dataset fraction) as an extra dimension in the GP. The GP learns the relationship between fidelity and final performance. The acquisition function picks both x AND a fidelity, balancing information gained against compute cost. BoTorch has qMultiFidelityKnowledgeGradient for this.

Asynchronous BO (Kriging Believer)

For batch parallelism: when a trial is running, “fantasize” its result using the GP posterior mean, add the hallucinated observation to the training set, fit a temporary GP, and pick the next trial assuming the in-flight one will land at its predicted value. Update for real when the trial finishes. This decouples scheduling from observations, enabling many parallel workers without serializing on the GP fit.

Trust Region BO (TuRBO)

Eriksson et al. 2019. For high-D HPO (50+ dimensions), maintain a small “trust region” around the current best, fit a local GP, optimize within. Expand if successful, shrink if not. Effectively decomposes a high-D problem into many local low-D problems. Available in BoTorch.

Key Takeaway: If you have 8+ GPUs and slow training, BOHB usually beats vanilla GP-BO. If you have 1 GPU and ≤20 hyperparameters, vanilla GP-BO with EI is the best ROI. If you have 50+ hyperparameters (NAS territory), look at TuRBO or evolutionary methods.

Tools Comparison

HPO Tools Landscape X-axis: Simplicity → Research flexibility  |  Y-axis: Sample efficiency Simplicity (left) ⟶ Research flexibility (right) Sample efficiency ⟶ simple API framework research-flexible high med low skopt GP-based Optuna TPE / GP Ax GP-based BoTorch GP-based HyperOpt TPE Ray Tune mixed W&B mixed Vizier GP-based SageMaker managed Backend legend GP-based TPE Mixed Managed cloud

Tool Default Backend Multi-Objective Constraints Conditional Spaces Best For
Optuna TPE (GP via BoTorch) Yes Limited Native Production engineering
Ax GP (BoTorch) Yes (Pareto) Yes Yes Adaptive experimentation
BoTorch GP (PyTorch) Yes Yes Custom Research, custom algorithms
scikit-optimize GP / RF No No No Quickstart, sklearn integration
HyperOpt TPE Limited No Native Mature distributed TPE
Ray Tune Pluggable (BO/TPE/PBT/ASHA) Yes (via Ax) Via backend Via backend Distributed orchestration
W&B Sweeps Bayes / Random / Grid No No Limited Experiment tracking integration
Vertex AI Vizier GP (Google) Yes Yes Yes Managed, GCP-native
SageMaker AMT GP / Hyperband No No Limited Managed, AWS-native

 

Opinionated Recommendation

For 90% of practitioners doing 90% of HPO problems:

  • Start with Optuna. The API is the cleanest, defaults are sensible, the dashboard is great, and you can swap in BoTorch sampler when you outgrow TPE.
  • Move to Ax when you need multi-objective with constraints, or you want a higher-level service-style API for ongoing experimentation.
  • Use BoTorch directly when you’re writing custom acquisition functions, doing research, or need fine control over GP fitting (custom kernels, priors, multi-task).
  • Use scikit-optimize for one-off tabular ML tuning where simplicity beats power.
  • Use Ray Tune when distributed orchestration is the bottleneck—you have hundreds of workers and need scheduling.

Real-World Case Studies

Google Vizier

Vizier is Google’s internal Bayesian Optimization service, used to tune everything from ad models to ranking systems to LLM training pipelines. The original 2017 paper reports thousands of studies per day across the company. The default algorithm is GP-based BO with batched parallel evaluation. Vertex AI Vizier exposes this externally on GCP.

Meta’s Ax / BoTorch

Meta open-sourced Ax and BoTorch from the work of optimizing ranking models. Public reports show ranking-quality lifts of >40% relative to random search, with significantly fewer trials required. The same stack tunes hyperparameters in their video encoding research, ad auction simulators, and infrastructure scheduling.

AlphaGo and AlphaFold

DeepMind has used Bayesian optimization in the inner loop for years. AlphaGo reportedly used GP-based BO to tune the MCTS hyperparameters and training schedule. AlphaFold 2’s training pipeline used multi-fidelity BO for architecture-related hyperparameters where each evaluation was prohibitively expensive.

Drug Discovery and Protein Design

Beyond ML hyperparameters, GP-BO is the standard for “real” experimental design, choosing which molecules to synthesize next, which protein variants to screen, which experimental conditions to test. Each “trial” costs days of lab time and thousands of dollars in reagents. Sample efficiency stops being a nice-to-have.

Key Takeaway: GP-based BO is not a research toy. It runs in production at scale at every major tech company and most pharmaceutical companies. The tools (BoTorch, Ax, Optuna, Vizier) have hundreds of person-years of engineering. If you’re not using BO for HPO, you’re probably leaving 0.5–5% accuracy on the table.

Practical Guide and Pitfalls

Initial Design: Don’t Start Cold

Run 5–10 random trials before BO kicks in. Without seed observations, the GP has no signal and the acquisition function picks the geometric center of the box. Rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension.

Parallelize 4–8 Trials per BO Step

Modern HPO at scale uses batch acquisition (qEI, qNEI, qNEHVI) to propose 4–8 candidates per BO iteration. This is the sweet spot—enough parallelism to use a multi-GPU node, not so much that the GP information gain saturates within a batch.

Stopping Criteria

  • Trial budget (most common): “run 100 trials.” Simple and reproducible.
  • Time budget: “run for 24 hours.” Useful in production where wallclock matters more than trial count.
  • Convergence: stop when running-best improvement < ε for k consecutive trials. Risky alone—BO can stall before finding the global optimum.
  • Combination: max(trial_budget, no_improvement_for_k_trials). Practical default.

Reproducibility

Seed everything: numpy, torch, the BO library, the model training. Log every (config, score, wallclock, seed) tuple. The cheapest way to lose value from HPO is being unable to reproduce the best config. Pair with experiment tracking (W&B, MLflow) and you’re set.

Debugging GP Fits

If BO recommendations look weird (clustered in a corner, wildly oscillating), check:

  • Lengthscales: are they reasonable? Tiny lengthscales mean the GP thinks every observation is noise; huge lengthscales mean it thinks the function is constant.
  • Output standardization: BoTorch handles this internally; some libraries don’t. Standardize y manually if in doubt.
  • Input normalization: always normalize inputs to [0,1]d before passing to a GP.
  • Noise: is observation noise too low? Refit with a slightly higher noise prior.

High-Dimensional Pitfalls

Past ~20 dimensions, vanilla GPs degrade. Symptoms: BO doesn’t beat random search, GP lengthscales hit boundary values. Solutions: TuRBO (trust regions), random embeddings (REMBO), dimensionality reduction via PCA on a random sample, or switch to evolutionary methods. For more on high-dim optimization paradigms, see our posts on genetic algorithms and mixed-integer programming.

Constrained BO

Don’t waste evaluations on infeasible configurations. If your model has a memory budget, latency budget, or hardware constraint, model the constraint as a separate GP and use a constrained acquisition function (e.g., expected feasible improvement, qNEHVI with constraints in BoTorch). Saves enormous trial budget.

The Cold-Start Problem

When tuning a new but related task, you typically have prior trials from similar tasks. Transferable BO initializes the GP using observations from prior studies (with some weighting), giving you an informative prior instead of starting from scratch. Available in Ax (multi-task BO) and the academic literature.

Trial Replication and Noise

For genuinely noisy objectives (RL training rewards, small-data classification), replicate the best candidates to reduce noise. The Central Limit Theorem guide covers the math: averaging k noisy observations reduces standard error by √k. Spend 20% of trial budget on replication and you’ll get a much more reliable best.

Caution: The most common HPO failure mode is not “wrong method”,it’s “wrong objective.” If your validation loss isn’t a good proxy for test loss (small validation set, leakage, distribution shift), no optimizer can save you. Audit your evaluation pipeline before tuning. Cross-validation, held-out validation, and the techniques in semi-supervised learning matter more than the optimizer choice.

Frequently Asked Questions

Why is GP-based BO better than random search for HPO?

GP-based BO uses information from prior trials to pick the next one. Random search throws that information away. On benchmark HPO problems with 5–20 hyperparameters, GP-BO typically reaches the same accuracy as random search using 3–10× fewer trials. When each trial costs hours of GPU time, that compounds into significant compute savings—typically 60–90% of the budget.

When does TPE beat GP-based BO?

Three regimes: (1) high-dimensional spaces (30+ hyperparameters) where GPs degrade, (2) heavily conditional spaces (this hyperparameter only exists if that one is true) where TPE handles structure natively, (3) when you need very fast wall-clock per BO iteration because TPE’s sampling is cheaper than GP fitting + acquisition optimization. For most “normal” HPO with ≤20 dims, GP-BO is more sample-efficient.

How many initial random trials should I run before starting BO?

Rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension. For a 4-dimensional space, 8–10 random trials. For 10 dimensions, 20 random trials. Without enough seeds, the GP has no signal and BO collapses to picking the box center repeatedly.

Can GP-BO handle categorical hyperparameters like activation function or optimizer choice?

Yes, three approaches: (1) one-hot encode and treat as continuous (works, slight efficiency loss), (2) use a custom kernel like Hamming distance for categoricals (cleaner, BoTorch’s MixedSingleTaskGP does this), (3) switch to TPE which handles categoricals natively. For 1–2 categorical dimensions, one-hot is fine. For many categoricals, use TPE or a properly mixed kernel.

BoTorch vs Optuna—which should I use?

For most production HPO, start with Optuna: cleaner API, better tooling (dashboard, study persistence, distributed trials), and you can swap in the BoTorch sampler for GP-BO when needed. Use BoTorch directly when you need custom acquisition functions, multi-task GPs, advanced features (qNEHVI, qKG, MES), or are doing research. Many production setups use both: Optuna for orchestration, BoTorch sampler under the hood.

References and Further Reading

  • Bergstra & Bengio (2012). Random Search for Hyper-Parameter Optimization. JMLR. The paper that established random search as the baseline.
  • Frazier (2018). A Tutorial on Bayesian Optimization. arXiv:1807.02811. The clearest intro to BO mathematics.
  • Falkner et al. (2018). BOHB: Robust and Efficient Hyperparameter Optimization at Scale. ICML. The BOHB paper.
  • Eriksson et al. (2019). Scalable Global Optimization via Local Bayesian Optimization. NeurIPS. TuRBO.
  • Wang & Jegelka (2017). Max-value Entropy Search for Efficient Bayesian Optimization. ICML.
  • BoTorch documentation,official docs for Meta’s Bayesian optimization library.
  • Optuna documentation—practical HPO framework with TPE and GP samplers.
  • scikit-optimize documentation—sklearn-style GP and forest-based BO.
  • Ax (Adaptive Experimentation Platform),Meta’s higher-level wrapper around BoTorch.
Related Reading:

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *