Summary
What this post covers: A practitioner’s guide to tuning ML hyperparameters with Gaussian Process Bayesian Optimization, walking through the full BayesOpt pipeline, acquisition functions, search-space design, and four working Python implementations (scikit-optimize, BoTorch, qNEHVI multi-objective, Optuna+BoTorch).
Key insights:
- GP-based BayesOpt typically reaches a good configuration in ~20 trials versus ~60 for random search and millions for grid search, making it the right default whenever each training run costs serious GPU time.
- GPs win for HPO because they natively model observation noise, quantify uncertainty everywhere, and produce a smooth surrogate that an acquisition function can exploit; this is the source of their sample efficiency in low-to-moderate dimensions.
- Acquisition function choice matters: Expected Improvement is the safe default, UCB exposes an explicit explore/exploit knob, and Thompson Sampling or qNEHVI dominate when you need parallel batches or multi-objective Pareto fronts.
- Search-space design (log-uniform priors for learning rate, integer dimensions, conditional parameters) often determines success more than the optimizer itself, and mixing GP-BO with Hyperband (BOHB) is the practical winner once you have tens of GPUs.
- For most teams the right stack is Optuna with the BoTorch sampler: it handles mixed/conditional spaces, parallelizes, and gives you GP-grade sample efficiency without writing BoTorch directly.
Main topics: Why Hyperparameter Tuning Is Hard, The HPO Landscape: A Method Tour, Why Gaussian Processes Win for HPO, The Full BayesOpt Pipeline for HPO, Acquisition Functions Deep Dive, Search Space Design, Full Python Implementation, Multi-Fidelity and Parallel HPO, Tools Comparison, Real-World Case Studies, Practical Guide and Pitfalls.
Tuning a 10-hyperparameter neural network by grid search with 5 values each requires 9.7 million experiments. Random search needs about 60 to find a good config. Gaussian Process Bayesian Optimization needs 20. Same accuracy, 500,000 times less compute.
That gap is the entire reason GP-based hyperparameter optimization moved from a curious research idea to the production default at Google, Meta, and OpenAI. If a single training run takes hours and costs hundreds of dollars in GPU time, you literally cannot afford grid search. You also cannot afford random search to “stumble onto” something good. You need the optimizer to think between trials, picking the next configuration with knowledge of every previous one.
Gaussian Processes are the mathematical machinery that makes this possible. A GP fits a smooth surrogate over your validation loss landscape, quantifies its own uncertainty everywhere, and an acquisition function turns that uncertainty into a principled “where should I look next?” decision.
This post is the practitioner guide. We’re not re-deriving GP regression—for the underlying math (kernels, posterior, marginal likelihood), see the Gaussian Process fundamentals post with Python and GPyTorch. Here we focus on the applied question: how do you actually tune XGBoost, a CNN, or a transformer using GP-based BayesOpt in production?
You’ll get four working code examples (scikit-optimize on XGBoost, BoTorch on a CNN, multi-objective BO with qNEHVI, and Optuna with the BoTorch sampler), a walk through every common acquisition function, three SVG diagrams, and an opinionated tools recommendation. Let’s go.
Why Hyperparameter Tuning Is Hard
Before we praise GPs, it’s worth being honest about what makes HPO genuinely difficult—because the difficulty is what justifies all the BayesOpt machinery in the first place.
The Combinatorial Explosion
A typical modern ML model has somewhere between 10 and 30 tunable hyperparameters. A baseline XGBoost has 10–15 (learning rate, max depth, n_estimators, subsample, colsample_bytree, min_child_weight, gamma, reg_alpha, reg_lambda, scale_pos_weight…). A vision transformer has more (depth, width, heads, MLP ratio, patch size, dropout, attention dropout, learning rate, weight decay, warmup, label smoothing, mixup alpha, drop path, EMA…).
If you grid-search 10 hyperparameters with 5 values each, that’s 510 ≈ 9.77 million configurations. At 30 minutes per training run on a single GPU, that’s 5,580 GPU-years. Even with massive parallelism, this is dead on arrival.
Non-Trivial Interactions
Hyperparameters are not independent. The optimal learning rate depends on batch size (linear scaling rule), depends on optimizer (Adam vs SGD), depends on weight initialization, depends on architecture depth. Grid search assumes you can study them one at a time, which is exactly wrong.
Random search handles this better, by sampling jointly, it sees interactions. But it still wastes compute on unpromising regions because it has no memory between trials.
Each Evaluation Is Expensive
Training a single config can take anywhere from minutes (small XGBoost) to days (large language model fine-tune). When each evaluation costs $50–$500 in cloud GPU time, sample efficiency stops being academic and becomes a budget line item.
Noise
The same hyperparameters give different validation losses across random seeds. Variance from data shuffling, dropout randomness, weight initialization, and stochastic optimization means every observation is noisy. A naive optimizer treats this noise as signal and chases ghosts. GPs handle observation noise natively through the kernel—that’s a built-in advantage.
Mixed Types and Conditional Spaces
Real search spaces include continuous parameters (learning rate), integers (max depth, number of layers), categoricals (activation function, optimizer choice), and conditional dimensions (dropout rate only matters if dropout is enabled; momentum only matters for SGD, not Adam). Standard GPs assume continuous Euclidean inputs, so this is a real engineering challenge that we’ll address in the search space section.
The HPO Landscape: A Method Tour
Before zooming into GPs, here’s the practical taxonomy of methods you’ll see in the wild.
Grid Search
Cartesian product of values for each hyperparameter. Easy to implement, easy to parallelize, embarrassingly inefficient. Breaks past 4–5 hyperparameters because of the curse of dimensionality. Use only for tiny problems or final pinning of 2–3 parameters.
Random Search
Sample uniformly at random from the search space. Bergstra & Bengio (2012) famously showed it dominates grid search because most hyperparameters don’t matter equally—random search projects effectively onto the “important” axes. It’s the baseline every other method should beat. If a fancy method can’t beat random search, it’s broken.
Evolutionary / Genetic Algorithms
Maintain a population of configurations, mutate and recombine them, select the fittest. Parallelizes beautifully, no gradient needed, handles weird search spaces. Sample efficiency is moderate, better than random, usually worse than BO. Used heavily in NAS (Regularized Evolution, AmoebaNet). For a deeper dive, see the genetic algorithm Python implementation guide.
Bandit-Based: Hyperband and ASHA
Frame HPO as a multi-armed bandit problem. Run many configs for a small budget, kill the worst, double the budget for survivors, repeat. Successive Halving is the core idea. Hyperband sweeps over different initial budgets to hedge against bad fidelity choices. ASHA is the asynchronous variant that scales to massive parallelism. These are multi-fidelity methods—they use cheap proxies (early epochs) to filter expensive ones.
Bayesian Optimization with GPs
Fit a GP surrogate over (hyperparameter, validation_loss) pairs. Use an acquisition function to pick the next trial. Sample-efficient, principled uncertainty, smooth and theoretically grounded. The focus of this post.
TPE (Tree-Structured Parzen Estimator)
Bayesian optimization with a different surrogate: instead of a GP, model two densities, p(x | y < threshold) and p(x | y ≥ threshold), and pick x maximizing their ratio. Handles conditional spaces natively, scales well in dimensions, dominates the Optuna and HyperOpt defaults. Less sample-efficient than GPs in low-D, but more flexible in high-D and with mixed types.
Hybrid: BOHB
Falkner et al. 2018: combine Bayesian Optimization (TPE) with Hyperband. Best of both—Hyperband’s compute efficiency via early stopping, BO’s smart sampling instead of random sampling within rungs. Often the right default for deep learning HPO when you have tens of GPUs.
Quick Decision: When to Use What
| Method | Sample Efficiency | Parallelism | Complexity | Categorical Support | Best For |
|---|---|---|---|---|---|
| Grid Search | Very low | Trivial | Trivial | Native | ≤3 hyperparams, final pinning |
| Random Search | Low | Trivial | Trivial | Native | Baseline, exploration phase |
| Genetic Algorithm | Medium | Excellent | Medium | Native | NAS, irregular spaces |
| Hyperband / ASHA | Medium | Excellent | Medium | Native | Big compute, slow training |
| TPE | High | Good | Medium | Native, conditional | Mixed types, conditional spaces |
| GP-BO | Highest | Good (qEI/Thompson) | High | Custom kernels needed | ≤20 dims, expensive evals |
| BOHB | Highest | Excellent | High | Native (TPE-based) | Deep learning at scale |
Why Gaussian Processes Win for HPO
For the majority of “real” HPO problems, under 20 dimensions, expensive evaluations, mostly continuous space—GP-based BO is the strongest method on every benchmark we’ve seen. Here’s why specifically.
Sample Efficiency Is Everything
When each evaluation costs hours of GPU time, you don’t care about wall-clock overhead of fitting a GP (a few seconds). You care about making every trial count. GPs use the full information of every prior observation to decide the next one. Random search throws that information away.
Principled Uncertainty
A GP doesn’t just predict the loss—it predicts the loss and a confidence interval. This is what makes intelligent exploration possible. The GP knows where it doesn’t know, and the acquisition function exploits that. Without a probabilistic surrogate, “exploration” becomes random sampling with extra steps.
Smooth Surrogate, Smooth Landscape
Hyperparameter loss landscapes are usually smooth, especially in log-space (learning rate, weight decay). The Matérn 5/2 kernel is a near-perfect inductive bias for this. GPs interpolate cleanly between observations, giving you a believable map of the space after just 10–20 trials.
Calibrated Exploration vs Exploitation
Acquisition functions like Expected Improvement automatically balance “go where we think it’s good” (exploitation) and “go where we don’t know yet” (exploration). The trade-off emerges from the math, not from a hand-tuned epsilon-greedy.
Sweet Spot: ≤20 Dimensions
GPs become unwieldy past ~20 dimensions because the kernel struggles to model meaningful similarity in high-D Euclidean space. The good news: the vast majority of HPO problems are in this regime. For higher dimensions, see the discussion of TuRBO and random embeddings.
The Full BayesOpt Pipeline for HPO
Here’s what GP-based BayesOpt actually does, step by step. We’re describing the loop you’ll see in BoTorch, scikit-optimize, and Optuna’s GP sampler.
Step 1: Define the Search Space
Specify the bounds and type for each hyperparameter. Continuous (with optional log scale), integer, categorical. This is where most production mistakes happen, bounds too tight (miss the optimum), bounds too wide (waste trials in bad regions), wrong scale (linear instead of log for learning rate).
Step 2: Initial Random Trials
Run 5–10 random configurations to seed the GP. Without seed observations, the GP has no signal and the acquisition function picks the center of the box, every time. A common rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension.
Step 3: Fit the GP Surrogate
Given observations (x1, y1), …, (xn, yn), fit a GP with Matérn 5/2 kernel (default for HPO). Optimize kernel hyperparameters (lengthscales, signal variance, noise) via maximum marginal likelihood. This takes seconds for n < 1000.
Step 4: Optimize the Acquisition Function
The acquisition function α(x) takes the GP posterior and outputs a scalar—”how valuable is it to evaluate at x?” Maximize α(x) over the search space (using L-BFGS, multi-start, or random sampling for non-smooth cases). The argmax is your next trial.
Step 5: Run the Trial
Train the model with the proposed hyperparameters. Record (xn+1, yn+1).
Step 6: Update and Repeat
Append the new observation, refit the GP, optimize the acquisition again, propose the next trial. Loop until budget exhausted.
Acquisition Functions Deep Dive
The acquisition function is where exploration vs exploitation lives. Choosing the right one matters less than people think, Expected Improvement is fine 90% of the time—but understanding the options helps you debug when things go wrong.
Expected Improvement (EI)
EI(x) = E[max(0, fbest − f(x))]. “How much do we expect to improve over the current best?” For a Gaussian posterior with mean μ(x) and standard deviation σ(x), this has a closed form:
EI(x) = (fbest − μ(x)) · Φ(z) + σ(x) · φ(z), where z = (fbest − μ(x)) / σ(x)
Φ is the standard normal CDF, φ is the PDF. Smooth, differentiable, well-behaved. The default choice. Slight bias toward exploitation, but in practice it explores plenty because σ(x) is large in unexplored regions.
Upper Confidence Bound (UCB)
UCB(x) = μ(x) − β · σ(x) (for minimization; flip the sign for maximization). The β coefficient explicitly controls exploration. Larger β → more exploration. Theoretical regret bounds (Srinivas et al. 2010): with βt growing logarithmically, UCB has sublinear cumulative regret. In practice, β = 2 is a fine default. UCB is more aggressive about exploring than EI when σ is large.
Probability of Improvement (PI)
PI(x) = P(f(x) < fbest) = Φ(z). Just the probability you’ll beat the current best by any amount. Purely greedy—picks any tiny improvement, can stagnate by exploiting near the current best forever. Rarely used in modern HPO except as a teaching example.
Thompson Sampling
Sample a function from the GP posterior, take its argmin. This is naturally diverse: independent samples from the posterior pick different points. The killer feature: trivial parallelization. For batch HPO of size k, draw k posterior samples and run all argmins simultaneously. Used heavily in production systems with many parallel workers.
Knowledge Gradient (KG)
EI is myopic, it only cares about immediate improvement. KG looks one step ahead: “what’s the expected best after I observe at x and update the GP?” More principled, more expensive (requires nested optimization), and roughly 10–20% better in practice for noisy problems. BoTorch’s qKnowledgeGradient is the standard implementation.
Max-Value Entropy Search (MES)
Information-theoretic: pick x that gives maximum mutual information about the location of the optimum. Robust to noise, handles batches well, but more complex to compute. Wang & Jegelka 2017. Available as qMaxValueEntropy in BoTorch.
| Acquisition | Formula Intuition | Strength | Weakness | When to Use |
|---|---|---|---|---|
| EI | Expected gain over best so far | Closed-form, balanced | Slight exploitation bias | Default—start here |
| UCB | μ − β·σ | Tunable exploration, regret bounds | Need to set β | When EI underexplores |
| PI | Probability of any improvement | Simplest | Stagnates, no exploration | Almost never |
| Thompson | argmin of posterior sample | Trivial parallelization | Higher variance | Batch / parallel HPO |
| KG | Look-ahead expected best | Robust to noise | Expensive to compute | Very noisy objectives |
| MES | Mutual info about optimum | Strong batch behavior | Implementation complexity | Research / best-of-best |
Search Space Design
This is the most underappreciated part of HPO. A GP can only optimize what you tell it to optimize, and most HPO failures trace back to a poorly-defined search space.
Log-Scale for Multiplicative Parameters
Learning rates, weight decay, regularization coefficients. These have an effect that is fundamentally multiplicative—going from 1e-3 to 1e-4 is “the same kind of step” as going from 1e-4 to 1e-5. Use log-uniform sampling. Bounds typically 1e-5 to 1e-1 for learning rate.
Linear Scale for Additive Parameters
Layer sizes, number of estimators, batch size, number of layers. These are additive and roughly linear in their effect.
Integer Handling
Most BO libraries treat integers as continuous and round at evaluation time. This works but creates plateaus in the objective. BoTorch’s OneHotToNumeric and Round input transforms handle it cleanly. For Optuna and skopt, just declare the param as integer and the library handles rounding.
Categorical Handling
Three options: (1) one-hot encode and treat as continuous (works, slight efficiency loss), (2) use a custom kernel like the categorical Hamming kernel (cleaner), (3) use TPE which handles categoricals natively. BoTorch’s MixedSingleTaskGP handles mixed continuous-categorical.
Conditional Spaces
“Dropout rate is only meaningful if dropout is enabled.” “Momentum is only relevant for SGD, not Adam.” TPE handles this natively, it learns the conditional structure. GP-based BO needs custom handling: typically you flatten to the union and let the optimizer learn that some dimensions are irrelevant. For deeply conditional spaces, TPE often wins.
| Hyperparameter Type | Recommended Representation | Typical Range |
|---|---|---|
| Learning rate | Log-uniform continuous | 1e-5 to 1e-1 |
| Weight decay / L2 | Log-uniform continuous | 1e-6 to 1e-2 |
| Dropout rate | Linear continuous | 0.0 to 0.5 |
| Hidden size / width | Log-uniform integer | 32 to 1024 |
| Number of layers | Linear integer | 2 to 12 |
| Batch size | Log-uniform integer (powers of 2) | 8 to 512 |
| Optimizer choice | Categorical | {Adam, SGD, AdamW, RMSprop} |
| Activation | Categorical | {ReLU, GELU, SiLU, Mish} |
| XGBoost max_depth | Linear integer | 3 to 12 |
| XGBoost subsample | Linear continuous | 0.5 to 1.0 |
Full Python Implementation
Four working examples, in increasing complexity. Use any of these as a starting template for your own HPO.
Example 1: Tuning XGBoost with scikit-optimize
scikit-optimize is the gentlest entry point—pip install, sklearn-style API, GP-based by default. Great for tabular ML.
"""
GP-BO for XGBoost using scikit-optimize.
pip install scikit-optimize xgboost scikit-learn matplotlib
"""
import numpy as np
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
from skopt.plots import plot_convergence, plot_objective
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
# Load a real tabular dataset
data = fetch_openml("adult", version=2, as_frame=True)
X = data.data.select_dtypes(include=[np.number]).fillna(0).values
y = (data.target == ">50K").astype(int).values
# Define the search space
space = [
Real(1e-3, 0.3, prior="log-uniform", name="learning_rate"),
Integer(3, 12, name="max_depth"),
Integer(50, 500, name="n_estimators"),
Real(0.5, 1.0, name="subsample"),
Real(0.5, 1.0, name="colsample_bytree"),
Real(1e-6, 1.0, prior="log-uniform", name="reg_alpha"),
Real(1e-6, 1.0, prior="log-uniform", name="reg_lambda"),
Real(0.0, 5.0, name="gamma"),
]
@use_named_args(space)
def objective(**params):
"""We minimize negative ROC AUC (skopt minimizes)."""
clf = XGBClassifier(
**params,
tree_method="hist",
eval_metric="logloss",
n_jobs=-1,
random_state=42,
verbosity=0,
)
score = cross_val_score(
clf, X, y, cv=3, scoring="roc_auc", n_jobs=1
).mean()
return -score
# Run GP-BO with EI acquisition
result = gp_minimize(
objective,
space,
n_calls=50, # total trials
n_initial_points=10, # random seed trials
acq_func="EI", # Expected Improvement
random_state=42,
verbose=True,
)
print(f"Best AUC: {-result.fun:.4f}")
print("Best hyperparameters:")
for name, val in zip([s.name for s in space], result.x):
print(f" {name}: {val}")
# Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plot_convergence(result, ax=axes[0])
axes[0].set_title("Convergence")
plot_objective(result, ax=axes[1] if False else None) # separate fig
plt.tight_layout()
plt.savefig("xgb_bo_convergence.png", dpi=120)
What’s happening: 10 random seed trials, then 40 GP-guided trials using Expected Improvement. The plot_convergence shows the running best score vs trial number—the classic “BO crushes random search” plot. plot_objective shows partial dependence plots for each hyperparameter, revealing which ones actually mattered.
On the Adult dataset with 50 trials, you’ll typically beat random search’s 50-trial best by 0.5–1.5% AUC. That sounds small until you remember it’s “free” (same trial budget) and reproducible.
Example 2: Tuning a PyTorch CNN with BoTorch
BoTorch is what you reach for when you outgrow scikit-optimize. It’s PyTorch-native, GPU-accelerated, and built on top of GPyTorch (the same library used in the GP fundamentals post). For research and production deep learning HPO, this is the standard.
"""
GP-BO for a PyTorch CNN using BoTorch.
pip install botorch gpytorch torch torchvision
"""
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition import qExpectedImprovement
from botorch.optim import optimize_acqf
from gpytorch.mlls import ExactMarginalLogLikelihood
device = "cuda" if torch.cuda.is_available() else "cpu"
# Search space: [log_lr, log_wd, dropout, log_hidden]
# Bounds in normalized space [0,1] mapped to actual ranges below.
BOUNDS = torch.tensor(
[[0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0]],
device=device, dtype=torch.double,
)
def unnormalize(x):
"""Map [0,1]^4 to actual hyperparameter ranges."""
log_lr = -5.0 + x[..., 0] * 3.0 # 1e-5 to 1e-2
log_wd = -6.0 + x[..., 1] * 4.0 # 1e-6 to 1e-2
dropout = x[..., 2] * 0.5 # 0 to 0.5
log_hidden = 5.0 + x[..., 3] * 4.0 # 32 to 512 (log2)
return {
"lr": float(10 ** log_lr),
"wd": float(10 ** log_wd),
"dropout": float(dropout),
"hidden": int(2 ** round(log_hidden.item())),
}
class SmallCNN(nn.Module):
def __init__(self, hidden, dropout):
super().__init__()
self.net = nn.Sequential(
nn.Conv2d(1, 16, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(32 * 7 * 7, hidden), nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(hidden, 10),
)
def forward(self, x):
return self.net(x)
# Load FashionMNIST (small enough to iterate quickly)
tfm = transforms.Compose([transforms.ToTensor()])
train_ds = datasets.FashionMNIST("./data", train=True, download=True, transform=tfm)
val_ds = datasets.FashionMNIST("./data", train=False, download=True, transform=tfm)
train_loader = DataLoader(train_ds, batch_size=256, shuffle=True, num_workers=2)
val_loader = DataLoader(val_ds, batch_size=512, num_workers=2)
def train_eval(params, epochs=3):
"""Train CNN with given hyperparams, return validation accuracy."""
model = SmallCNN(params["hidden"], params["dropout"]).to(device)
opt = optim.AdamW(model.parameters(), lr=params["lr"], weight_decay=params["wd"])
crit = nn.CrossEntropyLoss()
for _ in range(epochs):
model.train()
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
opt.zero_grad()
crit(model(xb), yb).backward()
opt.step()
# Evaluate
model.eval()
correct = total = 0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
preds = model(xb).argmax(1)
correct += (preds == yb).sum().item()
total += yb.size(0)
return correct / total
# Initial random trials
N_INIT = 8
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=torch.double)
Y_obs = torch.tensor(
[[train_eval(unnormalize(x))] for x in X_obs],
device=device, dtype=torch.double,
)
print(f"Init complete. Best so far: {Y_obs.max().item():.4f}")
# BO loop
N_BO_ITERS = 20
for it in range(N_BO_ITERS):
# Fit GP (BoTorch handles standardization, kernel, MLL)
gp = SingleTaskGP(X_obs, Y_obs)
mll = ExactMarginalLogLikelihood(gp.likelihood, gp)
fit_gpytorch_mll(mll)
# qEI acquisition (q=1 for sequential)
acq = qExpectedImprovement(model=gp, best_f=Y_obs.max())
candidate, _ = optimize_acqf(
acq_function=acq,
bounds=BOUNDS,
q=1,
num_restarts=10,
raw_samples=512,
)
# Evaluate candidate
new_y = train_eval(unnormalize(candidate.squeeze(0)))
X_obs = torch.cat([X_obs, candidate], dim=0)
Y_obs = torch.cat([Y_obs, torch.tensor([[new_y]], device=device, dtype=torch.double)], dim=0)
print(f"Iter {it+1}: y={new_y:.4f} | best={Y_obs.max().item():.4f}")
best_idx = Y_obs.argmax()
print("\nBest hyperparameters:")
print(unnormalize(X_obs[best_idx]))
Notes on this implementation:
- We work in normalized [0,1]d space and unnormalize for the actual training. BoTorch strongly prefers normalized inputs.
- BoTorch’s
SingleTaskGPuses Matérn 5/2 kernel by default with automatic relevance determination (ARD), which learns per-dimension lengthscales. optimize_acqfuses 10 multi-start L-BFGS optimizations with 512 random initial points to find the global optimum of the acquisition function.- This loop runs 28 trials total (8 random + 20 BO). On a single GPU with 3-epoch FashionMNIST, that’s ~30 minutes total.
Example 3: Multi-Objective BO with qNEHVI
Real-world deployment cares about more than accuracy. Latency matters. Memory matters. With multi-objective BO, you find the entire Pareto frontier between competing objectives.
"""
Multi-objective HPO: maximize accuracy AND minimize latency.
Returns the Pareto frontier instead of a single best.
"""
import time
import torch
from botorch.models import SingleTaskGP, ModelListGP
from botorch.fit import fit_gpytorch_mll
from botorch.acquisition.multi_objective.monte_carlo import qNoisyExpectedHypervolumeImprovement
from botorch.optim import optimize_acqf
from botorch.utils.multi_objective.box_decompositions.dominated import DominatedPartitioning
from gpytorch.mlls import ExactMarginalLogLikelihood
device = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.double
# Search space: same 4-dim CNN tuning problem
BOUNDS = torch.tensor([[0.0]*4, [1.0]*4], device=device, dtype=DTYPE)
# Two objectives: accuracy (maximize) and -latency_ms (maximize, since BoTorch maximizes)
REF_POINT = torch.tensor([0.5, -200.0], device=device, dtype=DTYPE) # worst-case bounds
def objective_2d(x_norm):
"""Returns [accuracy, -latency_ms]."""
params = unnormalize(x_norm) # reuse from Example 2
acc = train_eval(params, epochs=3)
# Measure latency on a batch
model = SmallCNN(params["hidden"], params["dropout"]).to(device).eval()
dummy = torch.randn(64, 1, 28, 28, device=device)
# Warm up
with torch.no_grad():
_ = model(dummy)
torch.cuda.synchronize() if device == "cuda" else None
t0 = time.perf_counter()
with torch.no_grad():
for _ in range(20):
_ = model(dummy)
torch.cuda.synchronize() if device == "cuda" else None
latency_ms = (time.perf_counter() - t0) * 1000 / 20
return torch.tensor([acc, -latency_ms], device=device, dtype=DTYPE)
# Initial design
N_INIT = 10
torch.manual_seed(0)
X_obs = torch.rand(N_INIT, 4, device=device, dtype=DTYPE)
Y_obs = torch.stack([objective_2d(x) for x in X_obs])
# Multi-objective BO loop
for it in range(20):
# Fit independent GPs for each objective
models = [SingleTaskGP(X_obs, Y_obs[:, i:i+1]) for i in range(2)]
model_list = ModelListGP(*models)
for m in models:
mll = ExactMarginalLogLikelihood(m.likelihood, m)
fit_gpytorch_mll(mll)
# qNEHVI acquisition
acq = qNoisyExpectedHypervolumeImprovement(
model=model_list,
ref_point=REF_POINT,
X_baseline=X_obs,
prune_baseline=True,
)
candidate, _ = optimize_acqf(
acq_function=acq, bounds=BOUNDS,
q=2, num_restarts=10, raw_samples=512,
)
new_y = torch.stack([objective_2d(x) for x in candidate])
X_obs = torch.cat([X_obs, candidate])
Y_obs = torch.cat([Y_obs, new_y])
# Compute hypervolume
hv = DominatedPartitioning(ref_point=REF_POINT, Y=Y_obs).compute_hypervolume()
print(f"Iter {it+1}: HV={hv.item():.3f} | n_obs={len(X_obs)}")
# Extract Pareto frontier
from botorch.utils.multi_objective.pareto import is_non_dominated
mask = is_non_dominated(Y_obs)
pareto = Y_obs[mask]
print(f"\nPareto frontier: {len(pareto)} points")
for acc, neg_lat in pareto.cpu().numpy():
print(f" acc={acc:.4f}, latency={-neg_lat:.2f}ms")
The output is not “the best config” but a frontier: configs that are Pareto-optimal, for each one, you can’t improve accuracy without sacrificing latency, and vice versa. The hypervolume metric quantifies the size of the dominated region; bigger is better.
Example 4: Optuna with BoTorch Sampler
Optuna is the most popular HPO library by adoption—and the underrated detail is that you can swap its default TPE sampler for a GP-based BoTorch sampler in one line.
"""
Optuna with GP (BoTorch) sampler vs default TPE.
pip install optuna botorch
"""
import optuna
from optuna.samplers import TPESampler
from optuna.integration import BoTorchSampler
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
def objective(trial):
params = {
"learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"n_estimators": trial.suggest_int("n_estimators", 50, 500),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-6, 1.0, log=True),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-6, 1.0, log=True),
}
clf = xgb.XGBClassifier(
**params, tree_method="hist", eval_metric="logloss",
n_jobs=-1, random_state=42, verbosity=0,
)
return cross_val_score(clf, X, y, cv=5, scoring="roc_auc").mean()
# A: TPE sampler (Optuna default)
study_tpe = optuna.create_study(
direction="maximize",
sampler=TPESampler(seed=42, n_startup_trials=10),
)
study_tpe.optimize(objective, n_trials=50, show_progress_bar=True)
# B: BoTorch (GP) sampler
study_gp = optuna.create_study(
direction="maximize",
sampler=BoTorchSampler(n_startup_trials=10, seed=42),
)
study_gp.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"TPE best AUC: {study_tpe.best_value:.4f}")
print(f"GP-BO best AUC: {study_gp.best_value:.4f}")
# Visualize convergence
import matplotlib.pyplot as plt
def running_best(trials):
vals = [t.value for t in trials]
return np.maximum.accumulate(vals)
plt.figure(figsize=(10, 5))
plt.plot(running_best(study_tpe.trials), label="TPE", linewidth=2)
plt.plot(running_best(study_gp.trials), label="GP-BO (BoTorch)", linewidth=2)
plt.xlabel("Trial")
plt.ylabel("Best AUC so far")
plt.legend()
plt.grid(alpha=0.3)
plt.title("TPE vs GP-BO convergence")
plt.savefig("tpe_vs_gp.png", dpi=120, bbox_inches="tight")
Empirically on smaller search spaces (≤10 dims) and noisy objectives, GP-BO converges faster than TPE in trial count. On bigger spaces or with conditional dimensions, TPE catches up. The key benefit of Optuna is the framework: pruning, distributed trials, web dashboard, and easy sampler swapping.
Multi-Fidelity and Parallel HPO
The compute reality of modern deep learning: full training is expensive, but partial training is informative. A 100-epoch run is 10× more expensive than a 10-epoch run, but the 10-epoch result correlates strongly with the 100-epoch result. Multi-fidelity HPO exploits this.
BOHB (Falkner et al. 2018)
Combine Hyperband (early stopping based on partial training curves) with BO (smart sampling instead of random). Hyperband decides when to kill a trial; BO decides which configs to try at each rung. Empirically dominates either method alone for deep learning HPO.
BOHB uses TPE rather than GP for the BO part because the sampling-based density model handles the high-dimensional and conditional spaces of NN architectures well. There are GP variants (Falkner mentions trade-offs) but TPE is the default.
Multi-Fidelity BO (MFBO)
Add fidelity (e.g., training epochs, dataset fraction) as an extra dimension in the GP. The GP learns the relationship between fidelity and final performance. The acquisition function picks both x AND a fidelity, balancing information gained against compute cost. BoTorch has qMultiFidelityKnowledgeGradient for this.
Asynchronous BO (Kriging Believer)
For batch parallelism: when a trial is running, “fantasize” its result using the GP posterior mean, add the hallucinated observation to the training set, fit a temporary GP, and pick the next trial assuming the in-flight one will land at its predicted value. Update for real when the trial finishes. This decouples scheduling from observations, enabling many parallel workers without serializing on the GP fit.
Trust Region BO (TuRBO)
Eriksson et al. 2019. For high-D HPO (50+ dimensions), maintain a small “trust region” around the current best, fit a local GP, optimize within. Expand if successful, shrink if not. Effectively decomposes a high-D problem into many local low-D problems. Available in BoTorch.
Tools Comparison
| Tool | Default Backend | Multi-Objective | Constraints | Conditional Spaces | Best For |
|---|---|---|---|---|---|
| Optuna | TPE (GP via BoTorch) | Yes | Limited | Native | Production engineering |
| Ax | GP (BoTorch) | Yes (Pareto) | Yes | Yes | Adaptive experimentation |
| BoTorch | GP (PyTorch) | Yes | Yes | Custom | Research, custom algorithms |
| scikit-optimize | GP / RF | No | No | No | Quickstart, sklearn integration |
| HyperOpt | TPE | Limited | No | Native | Mature distributed TPE |
| Ray Tune | Pluggable (BO/TPE/PBT/ASHA) | Yes (via Ax) | Via backend | Via backend | Distributed orchestration |
| W&B Sweeps | Bayes / Random / Grid | No | No | Limited | Experiment tracking integration |
| Vertex AI Vizier | GP (Google) | Yes | Yes | Yes | Managed, GCP-native |
| SageMaker AMT | GP / Hyperband | No | No | Limited | Managed, AWS-native |
Opinionated Recommendation
For 90% of practitioners doing 90% of HPO problems:
- Start with Optuna. The API is the cleanest, defaults are sensible, the dashboard is great, and you can swap in BoTorch sampler when you outgrow TPE.
- Move to Ax when you need multi-objective with constraints, or you want a higher-level service-style API for ongoing experimentation.
- Use BoTorch directly when you’re writing custom acquisition functions, doing research, or need fine control over GP fitting (custom kernels, priors, multi-task).
- Use scikit-optimize for one-off tabular ML tuning where simplicity beats power.
- Use Ray Tune when distributed orchestration is the bottleneck—you have hundreds of workers and need scheduling.
Real-World Case Studies
Google Vizier
Vizier is Google’s internal Bayesian Optimization service, used to tune everything from ad models to ranking systems to LLM training pipelines. The original 2017 paper reports thousands of studies per day across the company. The default algorithm is GP-based BO with batched parallel evaluation. Vertex AI Vizier exposes this externally on GCP.
Meta’s Ax / BoTorch
Meta open-sourced Ax and BoTorch from the work of optimizing ranking models. Public reports show ranking-quality lifts of >40% relative to random search, with significantly fewer trials required. The same stack tunes hyperparameters in their video encoding research, ad auction simulators, and infrastructure scheduling.
AlphaGo and AlphaFold
DeepMind has used Bayesian optimization in the inner loop for years. AlphaGo reportedly used GP-based BO to tune the MCTS hyperparameters and training schedule. AlphaFold 2’s training pipeline used multi-fidelity BO for architecture-related hyperparameters where each evaluation was prohibitively expensive.
Drug Discovery and Protein Design
Beyond ML hyperparameters, GP-BO is the standard for “real” experimental design, choosing which molecules to synthesize next, which protein variants to screen, which experimental conditions to test. Each “trial” costs days of lab time and thousands of dollars in reagents. Sample efficiency stops being a nice-to-have.
Practical Guide and Pitfalls
Initial Design: Don’t Start Cold
Run 5–10 random trials before BO kicks in. Without seed observations, the GP has no signal and the acquisition function picks the geometric center of the box. Rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension.
Parallelize 4–8 Trials per BO Step
Modern HPO at scale uses batch acquisition (qEI, qNEI, qNEHVI) to propose 4–8 candidates per BO iteration. This is the sweet spot—enough parallelism to use a multi-GPU node, not so much that the GP information gain saturates within a batch.
Stopping Criteria
- Trial budget (most common): “run 100 trials.” Simple and reproducible.
- Time budget: “run for 24 hours.” Useful in production where wallclock matters more than trial count.
- Convergence: stop when running-best improvement < ε for k consecutive trials. Risky alone—BO can stall before finding the global optimum.
- Combination: max(trial_budget, no_improvement_for_k_trials). Practical default.
Reproducibility
Seed everything: numpy, torch, the BO library, the model training. Log every (config, score, wallclock, seed) tuple. The cheapest way to lose value from HPO is being unable to reproduce the best config. Pair with experiment tracking (W&B, MLflow) and you’re set.
Debugging GP Fits
If BO recommendations look weird (clustered in a corner, wildly oscillating), check:
- Lengthscales: are they reasonable? Tiny lengthscales mean the GP thinks every observation is noise; huge lengthscales mean it thinks the function is constant.
- Output standardization: BoTorch handles this internally; some libraries don’t. Standardize y manually if in doubt.
- Input normalization: always normalize inputs to [0,1]d before passing to a GP.
- Noise: is observation noise too low? Refit with a slightly higher noise prior.
High-Dimensional Pitfalls
Past ~20 dimensions, vanilla GPs degrade. Symptoms: BO doesn’t beat random search, GP lengthscales hit boundary values. Solutions: TuRBO (trust regions), random embeddings (REMBO), dimensionality reduction via PCA on a random sample, or switch to evolutionary methods. For more on high-dim optimization paradigms, see our posts on genetic algorithms and mixed-integer programming.
Constrained BO
Don’t waste evaluations on infeasible configurations. If your model has a memory budget, latency budget, or hardware constraint, model the constraint as a separate GP and use a constrained acquisition function (e.g., expected feasible improvement, qNEHVI with constraints in BoTorch). Saves enormous trial budget.
The Cold-Start Problem
When tuning a new but related task, you typically have prior trials from similar tasks. Transferable BO initializes the GP using observations from prior studies (with some weighting), giving you an informative prior instead of starting from scratch. Available in Ax (multi-task BO) and the academic literature.
Trial Replication and Noise
For genuinely noisy objectives (RL training rewards, small-data classification), replicate the best candidates to reduce noise. The Central Limit Theorem guide covers the math: averaging k noisy observations reduces standard error by √k. Spend 20% of trial budget on replication and you’ll get a much more reliable best.
Frequently Asked Questions
Why is GP-based BO better than random search for HPO?
GP-based BO uses information from prior trials to pick the next one. Random search throws that information away. On benchmark HPO problems with 5–20 hyperparameters, GP-BO typically reaches the same accuracy as random search using 3–10× fewer trials. When each trial costs hours of GPU time, that compounds into significant compute savings—typically 60–90% of the budget.
When does TPE beat GP-based BO?
Three regimes: (1) high-dimensional spaces (30+ hyperparameters) where GPs degrade, (2) heavily conditional spaces (this hyperparameter only exists if that one is true) where TPE handles structure natively, (3) when you need very fast wall-clock per BO iteration because TPE’s sampling is cheaper than GP fitting + acquisition optimization. For most “normal” HPO with ≤20 dims, GP-BO is more sample-efficient.
How many initial random trials should I run before starting BO?
Rule of thumb: n_init = max(5, 2 · d) where d is the search space dimension. For a 4-dimensional space, 8–10 random trials. For 10 dimensions, 20 random trials. Without enough seeds, the GP has no signal and BO collapses to picking the box center repeatedly.
Can GP-BO handle categorical hyperparameters like activation function or optimizer choice?
Yes, three approaches: (1) one-hot encode and treat as continuous (works, slight efficiency loss), (2) use a custom kernel like Hamming distance for categoricals (cleaner, BoTorch’s MixedSingleTaskGP does this), (3) switch to TPE which handles categoricals natively. For 1–2 categorical dimensions, one-hot is fine. For many categoricals, use TPE or a properly mixed kernel.
BoTorch vs Optuna—which should I use?
For most production HPO, start with Optuna: cleaner API, better tooling (dashboard, study persistence, distributed trials), and you can swap in the BoTorch sampler for GP-BO when needed. Use BoTorch directly when you need custom acquisition functions, multi-task GPs, advanced features (qNEHVI, qKG, MES), or are doing research. Many production setups use both: Optuna for orchestration, BoTorch sampler under the hood.
References and Further Reading
- Bergstra & Bengio (2012). Random Search for Hyper-Parameter Optimization. JMLR. The paper that established random search as the baseline.
- Frazier (2018). A Tutorial on Bayesian Optimization. arXiv:1807.02811. The clearest intro to BO mathematics.
- Falkner et al. (2018). BOHB: Robust and Efficient Hyperparameter Optimization at Scale. ICML. The BOHB paper.
- Eriksson et al. (2019). Scalable Global Optimization via Local Bayesian Optimization. NeurIPS. TuRBO.
- Wang & Jegelka (2017). Max-value Entropy Search for Efficient Bayesian Optimization. ICML.
- BoTorch documentation,official docs for Meta’s Bayesian optimization library.
- Optuna documentation—practical HPO framework with TPE and GP samplers.
- scikit-optimize documentation—sklearn-style GP and forest-based BO.
- Ax (Adaptive Experimentation Platform),Meta’s higher-level wrapper around BoTorch.
- Gaussian Processes Explained: Python and GPyTorch Guide—the foundational math behind everything in this post
- Genetic Algorithm Explained: Python Implementation Guide—alternative HPO method for high-D and irregular spaces
- Apache Airflow Data Pipeline Orchestration Guide,orchestrate HPO sweeps across compute clusters
- Central Limit Theorem Explained: Python Guide—the statistics behind trial replication and noisy HPO
- Self-Supervised Learning Pretraining Guide—pretraining strategies that change what hyperparameters matter
Leave a Reply