Two years ago, training an LLM meant either renting time at a research lab or accepting that fine-tuning was for billion-dollar companies. By May 2026, you can take Qwen3.6-27B from a Hugging Face download to a domain-specialized model on a single rented H100 for under $15. The tools changed. The math did not, but the people who use it did. This post walks through how to actually train an open-source LLM today — what hardware you need, which model to pick, how to format your data so the trainer does not silently throw it away, and how to put the result behind a serving endpoint that responds in milliseconds.
Summary
What this post covers: A working 2026 playbook for fine-tuning open-source LLMs using three concrete anchors — the dense Qwen3.6-27B, the MoE Qwen3.5-122B-A10B, and OpenAI’s GPT-OSS-120B — from environment setup through deployment.
Key insights:
- QLoRA on a single H100 (80GB) now fine-tunes a 27B dense model in 8 to 12 hours for $10 to $16 of cloud rental, retaining 80 to 90 percent of full fine-tuning quality.
- MoE models like Qwen3.5-122B-A10B (10B active) and GPT-OSS-120B (5.1B active) need VRAM to hold all 122B or 117B weights, even though per-token compute is small — the “active parameter” headline number is a runtime FLOPs claim, not a memory one.
- Chat-template mismatch between training and inference is the single most common cause of a “trained but acts untrained” model — Qwen’s
<|im_start|>markers and GPT-OSS’s harmony format are not interchangeable. - GPT-OSS-120B ships post-trained with MXFP4 quantization on the MoE weights, which is why a 117B-total-parameter model fits in a single 80GB H100 at inference time.
- For anything past 70B at full precision, FSDP2 or DeepSpeed ZeRO-3 sharding is no longer optional — single-node training caps out around 32B dense in FP16 even on H200 (141GB) hardware.
Main topics: The State of Open-Source LLM Training in 2026, Meet the Three Anchor Models, Choosing Full Fine-Tune LoRA or QLoRA, Setting Up the Training Environment, Preparing the Dataset, The Actual Training Run, Evaluation That Isn’t Theatre, Deployment, Common Pitfalls and Debugging.
The State of Open-Source LLM Training in 2026
The open-source LLM landscape in May 2026 looks nothing like it did in early 2024. Two structural shifts changed what a single engineer can do alone.
The first shift is architectural. Mixture-of-Experts (MoE) models — where each token activates only a small subset of total parameters — became the dominant shape for any model past 30B. A dense model uses every weight on every token. An MoE model uses a router to send each token to a small fraction of “expert” sub-networks. Qwen3.5-122B-A10B has 122B total parameters but only ~10B active per forward pass. GPT-OSS-120B is 117B total, 5.1B active. The runtime FLOPs look like a small model. The VRAM footprint does not.
The second shift is post-training tooling. QLoRA — fine-tuning where the base weights are frozen at 4-bit NF4 (NormalFloat-4, a quantization format optimized for the distribution of neural network weights) and only a small low-rank adapter is trained — went from “research curiosity” in 2023 to “default starting point” in 2026. LoRA (Low-Rank Adaptation) retains 90 to 95 percent of full fine-tuning performance. QLoRA retains 80 to 90 percent while cutting VRAM by roughly 75 percent versus FP16.
What that means in practice: a 7B model that needed about 14GB of VRAM to fine-tune in FP16 now fits in 5 to 6GB under QLoRA. A 70B model that needed about 140GB now squeezes into 46GB. The hardware bar dropped enough that the question shifted from “can I afford to train this” to “what should I train it on.”
What this means for someone wanting to actually train a model today: prosumer hardware (a single H100 or H200, or even a 48GB consumer card like the RTX 6000 Ada) can handle QLoRA on models up to 70B. Past that, you are looking at multi-GPU LoRA or sharded full fine-tuning. We will work through specific recipes for each.
Pretraining from scratch — the 2.1 million H100-hour run that produced GPT-OSS-120B — is still out of reach for almost everyone. What’s well within reach is taking one of these three checkpoints and shaping it to your data, your domain, or your task. That’s what “training an open-source LLM” means in practice in 2026.
Meet the Three Anchor Models
Three models cover the practical range of what people fine-tune today: a dense 27B that fits comfortably on prosumer hardware, a sparse 122B that needs cluster-class memory but cheap compute, and a 117B MoE that ships pre-quantized to fit on a single 80GB card.
Qwen3.6-27B
Released on April 22, 2026 by Alibaba’s Qwen team. Dense — every one of the 27 billion parameters participates in every forward pass. Uses Gated DeltaNet, a hybrid attention scheme that combines a linear-attention path (constant memory cost per token) with traditional softmax self-attention. The linear path handles long-range context; the softmax path keeps short-range precision sharp.
Native context is 262,144 tokens, extensible to 1 million via position-encoding extrapolation. Natively multimodal — the same checkpoint takes images and text. There is a “Thinking Preservation” mechanism that keeps a chain-of-thought reasoning mode and a fast non-thinking mode in one set of weights.
Benchmark numbers from the Qwen team: SWE-bench Verified 77.2 (compared to Qwen3.5-397B-A17B at 76.2), SWE-bench Pro 53.5 (vs 50.9), Terminal-Bench 2.0 59.3 (vs 52.5), SkillsBench 48.2 (vs 30.0). A 27B dense model beating its 397B MoE predecessor on code-related work is the kind of result that makes architecture choice matter again.
Download from the QwenLM/Qwen3.6 official repo or the Hugging Face Qwen/Qwen3.6-27B mirror. Apache 2.0 license — commercial use permitted, attribution required.
Qwen3.5-122B-A10B
Released on February 24, 2026. Sparse MoE: 122 billion total parameters, approximately 10 billion active per forward pass. The “A10B” suffix is the active-parameter count. Each token gets routed through a small subset of experts; the rest of the network stays idle for that token.
Same Gated DeltaNet hybrid attention as 3.6-27B, same 262K native context with extension to 1M+. Text-only at this size. The MoE structure means inference compute looks like a 10B model, but VRAM must still hold all 122B weights — the router doesn’t know in advance which expert any given token will need.
This is the model to pick when you want strong quality but cheap per-token serving. The active-parameter count is what determines latency and energy cost; the total parameter count is what determines hardware purchasing decisions. Most people get this trade-off backwards on first encounter.
GPT-OSS-120B
OpenAI’s first open-weight LLMs since GPT-2 (2019), released August 2025. 117 billion total parameters, 5.1 billion active. Apache 2.0 license. Trained on NVIDIA H100 GPUs using PyTorch with expert-optimized Triton kernels. The training run consumed 2.1 million H100-hours — at $2/hour cloud pricing, that’s roughly $4.2 million in compute alone.
What makes GPT-OSS-120B unusual: it ships post-trained with MXFP4 quantization on the MoE weights. MXFP4 is a 4-bit floating-point format with a shared scale per micro-block. Because the bulk of the parameter count lives in the MoE expert layers, quantizing those to 4-bit pulls the on-disk and in-VRAM footprint low enough to fit on a single 80GB GPU (H100 or AMD MI300X). The non-expert layers stay in higher precision.
Benchmark posture: near-parity with OpenAI’s o4-mini on core reasoning. For a model you can run on a single rented GPU, that’s a notable result. Model card and weights at huggingface.co/openai/gpt-oss-120b; official repo at github.com/openai/gpt-oss; launch announcement at openai.com/index/introducing-gpt-oss.
| Attribute | Qwen3.6-27B | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| Total params | 27B | 122B | 117B |
| Active params | 27B (dense) | ~10B | 5.1B |
| Architecture | Dense, Gated DeltaNet | MoE, Gated DeltaNet | MoE, grouped-query attn |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Release date | 2026-04-22 | 2026-02-24 | August 2025 |
| Native context | 262K (extensible to 1M) | 262K (extensible to 1M+) | 128K |
| Multimodal | Yes (vision + text) | Text only | Text only |
| Download | HF: Qwen/Qwen3.6-27B | HF: Qwen/Qwen3.5-122B-A10B | HF: openai/gpt-oss-120b |
Choosing Full Fine-Tune, LoRA, or QLoRA
Three fine-tuning methods cover essentially the whole field. They sit on a cost-versus-quality spectrum, and the right choice depends on how much data you have and how distinct your domain is from the base model’s training distribution.
Full fine-tuning updates every parameter. It needs roughly four times the model’s memory footprint during training — model weights, gradients, optimizer states (two for AdamW: first and second moment), and activations. A 7B model needs ~14GB in FP16 just for weights, but with optimizer states and gradients you’re closer to 60GB peak.
LoRA (Low-Rank Adaptation) freezes the base weights and inserts trainable low-rank matrices into the attention projection layers. Instead of updating the full weight matrix W (say 4096×4096 = ~16.7M parameters), you train two small matrices B (4096×r) and A (r×4096), where r is typically 8, 16, or 32. The model effectively learns ΔW = B·A, which is added to the frozen W at inference. For r=16, that’s about 131K trainable parameters per layer instead of 16.7M — roughly 128× fewer.
QLoRA takes LoRA a step further. The frozen base weights are quantized to 4-bit NF4 (NormalFloat-4, designed to match the typical Gaussian distribution of neural network weights). LoRA adapters sit on top in FP16 or BF16. The weights are de-quantized on the fly only during forward and backward passes. Memory drops by roughly 75 percent compared to FP16 training.
| Method | VRAM (7B) | VRAM (70B) | Wall time (1 H100) | Cost (cloud) | Quality retention |
|---|---|---|---|---|---|
| Full FT | ~60 GB | ~560 GB (needs 8×H100) | 24-48h on 8×H100 | $250-510 | 100% (baseline) |
| LoRA | ~16 GB | ~160 GB (2-4 GPUs) | 10-15h | $20-40 | 90-95% |
| QLoRA | ~6 GB | ~46 GB (1 H100/H200) | 8-12h | $10-16 | 80-90% |
The picking heuristic in practice: start with QLoRA. If quality is not enough after a sweep over rank, learning rate, and data size, step up to LoRA. Reserve full fine-tuning for cases where the domain shift is so large that the base model’s representation is genuinely wrong (a model trained mostly on English needing to operate in a low-resource language, for instance). The 80-90 percent quality retention number for QLoRA is enough for the vast majority of production tasks.
Note how GPT-OSS-120B’s 4-bit inference number (~35 GB) is dramatically lower than Qwen3.5-122B’s 62 GB despite similar total parameter counts. That’s the MXFP4-native quantization advantage. Qwen3.5 has to be quantized after the fact (AWQ or GPTQ), with some additional accuracy loss; GPT-OSS-120B was post-trained with that 4-bit format already in mind.
Setting Up the Training Environment
Three years ago this section would have been a nightmare. CUDA versions, PyTorch builds, mismatched Triton, broken bitsandbytes. In May 2026 it’s still finicky but the recipe is more stable.
You need: CUDA 12.6 or newer (CUDA 12.8 ships well with the H100/H200 SXM5 drivers), cuDNN 9.5+, PyTorch 2.7 stable or 2.8 nightly, and a recent transformers, peft, accelerate, trl, bitsandbytes, and vllm. Flash Attention 3 needs Hopper (H100/H200) or newer; on Ampere (A100) you fall back to Flash Attention 2.
The cleanest approach is a Docker container that pins all of this. Building locally is the second-cleanest. Doing it in a bare Python environment is asking for an evening of debugging mismatched CUDA symbols. Containerizing the training environment with a known-good base image — typically nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04 — is the standard play.
Here’s a working pyproject.toml for a fine-tuning project as of May 2026:
[project]
name = "llm-finetune"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"torch==2.7.0",
"transformers==4.50.2",
"peft==0.14.1",
"bitsandbytes==0.46.0",
"accelerate==1.4.0",
"trl==0.16.0",
"datasets==3.5.0",
"unsloth==2026.5.3",
"flash-attn==3.0.1",
"vllm==0.9.2",
"wandb==0.19.5",
"sentencepiece==0.2.0",
"tiktoken==0.7.0",
"lm-eval==0.4.7",
]
[tool.uv]
index-strategy = "unsafe-best-match"
[[tool.uv.index]]
name = "pytorch-cuda128"
url = "https://download.pytorch.org/whl/cu128"
And a Dockerfile that produces a known-good training image:
FROM nvidia/cuda:12.8.0-cudnn-devel-ubuntu24.04
ENV DEBIAN_FRONTEND=noninteractive \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
HF_HOME=/workspace/.cache/huggingface \
TORCH_CUDA_ARCH_LIST="9.0;10.0"
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3.11-venv python3-pip git curl ca-certificates \
build-essential ninja-build cmake \
&& rm -rf /var/lib/apt/lists/*
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.local/bin:${PATH}"
WORKDIR /workspace
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
# Flash Attention 3 needs to compile against the installed torch
RUN uv pip install --no-build-isolation flash-attn==3.0.1
COPY . .
CMD ["uv", "run", "python", "-m", "train"]
The framework picture in 2026: TRL is HuggingFace’s official trainer for SFT (supervised fine-tuning) and reinforcement learning post-training. Axolotl is a YAML-config layer on top of TRL that handles much of the data-prep boilerplate. Unsloth is a Triton-optimized custom kernel package that claims up to 2× faster training and 60 percent less VRAM through hand-tuned kernels — it’s now stable enough for production use. torchtitan is Meta’s reference scaffolding for large-scale pretraining and full fine-tuning with FSDP2.
| Framework | Primary use case | Scaling target | Ergonomics | Recent activity |
|---|---|---|---|---|
| TRL | SFT, DPO, GRPO, PPO | 1-8 GPUs, single node | Python API, flexible | Very active |
| Axolotl | SFT, DPO with YAML config | 1-8 GPUs | YAML, low boilerplate | Active |
| Unsloth | Single-GPU QLoRA/LoRA, speed | 1 GPU (multi-GPU in 2025) | Drop-in trainer wrapper | Very active |
| torchtitan | Full FT, pretraining at scale | Multi-node FSDP2 | Lower-level, reference impl | Active |
For 99 percent of fine-tuning work, the pick is TRL (or Unsloth-wrapped TRL for single-GPU work). Reserve torchtitan for actual full fine-tuning at 70B+ where FSDP2 sharding becomes mandatory.
Preparing the Dataset
The training loop is the easy part. Dataset preparation is where most fine-tuning runs silently fail. The model trains for hours and the loss curve looks fine, but at eval time the outputs are wrong, off-topic, or use the wrong format. Almost always the problem is upstream of the trainer.
A modern instruction-tuning dataset is a list of conversations. Each conversation is a list of turns, each turn has a role (system, user, assistant) and content. JSONL is the standard on-disk format: one JSON object per line. Parquet is the high-performance alternative for large datasets where you want columnar compression.
A single training example for Qwen3.6-27B looks like this:
{"messages": [
{"role": "system", "content": "You are an industrial controls expert."},
{"role": "user", "content": "Explain why a PID loop on a cobot joint might oscillate after a payload change."},
{"role": "assistant", "content": "When the end-effector payload changes, the effective inertia at the joint changes. The PID gains were tuned for the original inertia. With higher effective mass, the proportional term applies the same torque correction to a slower-responding system, which can overshoot the setpoint and drive sustained oscillation. Two practical fixes: gain scheduling (load the right PID set when the gripper picks up the part) or an adaptive controller (online inertia estimation). The first is easier to validate in safety review."}
]}
The tokenizer then applies the model’s chat template — a Jinja-style template defined inside tokenizer_config.json — to turn that list of turns into a single tokenized sequence with the model’s special tokens. For Qwen3.6, the chat template wraps each turn in <|im_start|>role\ncontent<|im_end|>. For GPT-OSS-120B, it uses the harmony format with <|start_of_turn|> and channel markers. These are not interchangeable. A model trained with the wrong template at training time and the correct template at inference time will behave as if it had not been trained at all.
The standard loss masking pattern: the model is trained to predict assistant tokens, but the loss is masked (set to -100, the standard ignore_index for PyTorch’s CrossEntropyLoss) on system and user tokens. You don’t want to teach the model to generate user messages.
Here’s how a real data-loading pipeline looks for Qwen3.6-27B, using the HuggingFace datasets library:
from datasets import load_dataset
from transformers import AutoTokenizer
MODEL_ID = "Qwen/Qwen3.6-27B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
def format_example(example):
"""Apply Qwen's chat template and tokenize."""
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
ds = load_dataset("json", data_files="data/train.jsonl", split="train")
ds = ds.map(format_example, remove_columns=ds.column_names)
# Train/eval split with a fixed seed for reproducibility
split = ds.train_test_split(test_size=0.05, seed=42)
train_ds, eval_ds = split["train"], split["test"]
print(f"Train: {len(train_ds)}, Eval: {len(eval_ds)}")
print("Sample formatted text:")
print(train_ds[0]["text"][:500])
Before training, run two more passes on your dataset. First, dedup — exact-match dedup is cheap (compute a hash per example), MinHash/SimHash near-dedup catches paraphrases. Duplicates inflate the loss curve and bias the model toward memorizing common patterns.
Second, contamination check — make sure none of your training data overlaps with your eval benchmarks. If your eval is MMLU and your training data was scraped from Common Crawl, there’s a real chance MMLU questions are in there. Run a substring search of eval questions against your training set. Drop anything that matches.
If your data prep is complex enough to warrant orchestration, Airflow data pipelines are a reasonable fit — the dedup, contamination check, and tokenization steps map well to a DAG.
tokenizer.apply_chat_template that the output you’re feeding the trainer matches the format the model expects. Print the first 1000 characters of a tokenized example before starting a long run.
The Actual Training Run
Three concrete recipes covering the three anchor models, at three hardware budgets. Each is a known-working starting point — tune learning rate, rank, and data mixture from there.
Recipe 1: QLoRA on Qwen3.6-27B, single H100 (80GB)
The most accessible setup. One rented H100 from Lambda Labs, RunPod, or a cloud provider runs about $1.80-$2.50/hour as of May 2026. 50,000 training examples, 3 epochs, target wall time 8 to 12 hours, total bill $10-16. This is the recipe most teams actually use.
# train_qlora_qwen36.py
import torch
from transformers import (
AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
MODEL_ID = "Qwen/Qwen3.6-27B"
OUTPUT_DIR = "out/qwen36-27b-qlora"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat-4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantization of the quant constants
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer.padding_side = "right" # important: right-pad for SFT
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_3",
device_map="auto",
trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # cache is not used during training; saves VRAM
peft_config = LoraConfig(
r=16,
lora_alpha=32, # alpha/r = 2 is a common starting ratio
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
train_ds = load_dataset("json", data_files="data/train.jsonl", split="train")
eval_ds = load_dataset("json", data_files="data/eval.jsonl", split="train")
sft_config = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
gradient_checkpointing=True, # trade compute for VRAM
learning_rate=2e-4, # LoRA-typical; full FT would use ~1e-5
lr_scheduler_type="cosine",
warmup_ratio=0.03,
optim="paged_adamw_8bit", # 8-bit optimizer to save more VRAM
bf16=True,
max_seq_length=4096,
packing=True, # pack short examples to maximize GPU use
eval_strategy="steps",
eval_steps=500,
save_steps=1000,
save_total_limit=3,
logging_steps=20,
report_to="wandb",
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=sft_config,
train_dataset=train_ds,
eval_dataset=eval_ds,
peft_config=peft_config,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
Key choices in that script worth understanding:
- NF4 + double quantization: NF4 quantizes the weights themselves; double quantization also quantizes the per-block scaling constants, saving another ~0.4 bits per parameter on average.
- Gradient checkpointing: re-computes activations during the backward pass instead of storing them. Cuts activation memory by roughly the square root of the sequence length, at a cost of about 30 percent more compute. Almost always worth it for LoRA/QLoRA.
- Gradient accumulation: with per-device batch size 2 and accumulation steps 8, the effective batch is 16. Useful when VRAM caps your per-step batch but you want the optimization signal of a larger one.
- Paged AdamW 8-bit: optimizer states (first and second moments) at 8-bit precision with paging to CPU when not in use. Cuts optimizer state memory by 4× vs FP32 AdamW.
- Packing: concatenates multiple short examples into one sequence up to
max_seq_length. Without packing, padding to 4096 tokens wastes most of the compute on short examples.
Recipe 2: Multi-GPU LoRA on Qwen3.5-122B-A10B
122B total parameters means roughly 244GB in FP16 just for the weights. Two H200s (141GB each = 282GB total) or four H100s (320GB total) handle this comfortably with tensor parallelism. The accelerate config below configures FSDP2 with the model sharded across 8 GPUs.
# accelerate_config_fsdp.yaml
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
mixed_precision: bf16
num_processes: 8
num_machines: 1
machine_rank: 0
gpu_ids: all
fsdp_config:
fsdp_version: 2
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_offload_params: false
fsdp_use_orig_params: true
fsdp_sync_module_states: true
fsdp_cpu_ram_efficient_loading: true
fsdp_activation_checkpointing: true
Launch with: accelerate launch --config_file accelerate_config_fsdp.yaml train_lora_qwen35.py
The training script is structurally the same as Recipe 1, with three changes: no BitsAndBytesConfig (LoRA, not QLoRA), device_map=None (FSDP handles placement), and per-device batch size dropped to 1 with accumulation steps raised to keep effective batch around 32. Wall time for 50K examples × 3 epochs on 8× H100: roughly 18-24 hours.
Recipe 3: Multi-node full fine-tune on GPT-OSS-120B
Full fine-tuning a 117B MoE is genuinely expensive. The model weights in BF16 alone are ~234GB. Add gradients, optimizer states (AdamW = 2× the parameter count, in FP32 = 8 bytes each = ~940GB), and activations, and you need cluster-class storage. 32 H100 GPUs across 4 nodes is the lower bound. Use torchtitan with FSDP2 sharding across all 32 GPUs, plus tensor parallelism within each node.
For most use cases this is not the right call. Even with full fine-tuning, you risk losing the post-training calibration and safety tuning baked into the released checkpoint. The pragmatic path for GPT-OSS-120B is LoRA with rank 32, with the adapter applied to attention and MoE expert gate projections only.
| Setup | Combined VRAM | What it can train |
|---|---|---|
| Single H100 QLoRA | 80 GB | Up to ~70B with QLoRA; Qwen3.6-27B comfortably |
| Single H200 QLoRA | 141 GB | Up to ~120B with QLoRA; comfortable 70B LoRA |
| 2× H200 LoRA | 282 GB | Full LoRA on Qwen3.5-122B-A10B with FSDP2 |
| 8× H100 LoRA | 640 GB | LoRA on any model up to ~200B with sharding |
| 8× H100 full FT | 640 GB | Full FT up to ~70B with FSDP2 + activation checkpointing |
| 32× H100 multi-node | 2,560 GB | Full FT on 120B+ MoE; small pretraining runs |
Across all three recipes, the optimizer choice matters more than people expect. AdamW with a cosine learning rate schedule and 3 percent warmup is the strong default. For LoRA the learning rate is typically 1e-4 to 2e-4 — much higher than the 1e-5 to 5e-5 you’d use for full fine-tuning, because LoRA’s adapter layers start near zero and need bigger steps to learn meaningful deltas. Checkpoint every 1000 steps. Save adapter-only (PEFT) checkpoints, not full model — they’re 100× smaller.
If you want to optimize the choice of learning rate and rank systematically, Bayesian hyperparameter optimization with Gaussian processes handles this efficiently. Random search is fine if you don’t want the extra complexity; grid search is almost never worth it for LoRA.
Evaluation That Isn’t Theatre
Most fine-tuning eval is theatre. The model is trained, the training loss goes down, an “eval” runs on a sliver of the training set (or the same data slightly shuffled), and the team declares victory. Then the model goes to production and embarrasses everyone.
Real eval needs three properties: (1) the eval data was never seen during training, (2) the eval metric measures the actual task, not a proxy, and (3) the metric is reproducible across runs.
For general language understanding and reasoning, the standard benchmarks are MMLU (multi-task language understanding, 57 subjects), HumanEval (function-completion code), GSM8K (grade-school math word problems), and MT-Bench (multi-turn instruction following, judged by a strong LLM). For code-heavy use cases, SWE-bench Verified and Terminal-Bench 2.0 are the current standards.
The community-standard tool is lm-evaluation-harness from EleutherAI. It runs your model against a registered benchmark suite in a reproducible way:
lm_eval \
--model hf \
--model_args pretrained=out/qwen36-27b-qlora,trust_remote_code=True \
--tasks mmlu,gsm8k,humaneval \
--batch_size auto \
--output_path eval_results/qwen36-qlora.json
The contamination problem is real and frequently ignored. If your training data was scraped from the public web, there’s a non-trivial chance benchmark questions are in there. The decontamination check is to run an n-gram (typically 8-gram) overlap test between your training set and each benchmark’s question text. Anything that matches gets dropped from training. Without this check, your eval scores are an upper bound that hides the contamination effect.
Beyond benchmarks, hold out a domain-specific eval set that you constructed yourself, with realistic prompts from your actual use case. The benchmark suites measure general capability; your eval set measures whether the model is actually better at your job. The two metrics frequently disagree, and your eval set is the one that matters.
Deployment
Training finishes; the adapter or full checkpoint sits in a directory. Now serve it.
The two standard serving stacks in 2026 are vLLM and SGLang. vLLM is the broadest-supported and the production default for most teams. SGLang is faster for structured-output workloads (JSON, regex-constrained generation) and has better RadixAttention KV-cache reuse for repeated-prefix workloads (RAG, multi-turn chat).
Both implement continuous batching — a serving technique that keeps the GPU saturated by dynamically inserting new requests into the batch as old ones complete, rather than waiting for the whole batch to finish. The throughput multiplier from continuous batching versus static batching is typically 3-5×, sometimes more.
For a fine-tuned Qwen3.6-27B served on a single H100, the launch command is:
vllm serve out/qwen36-27b-qlora \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 32768 \
--dtype bfloat16 \
--enable-lora \
--lora-modules my-adapter=out/qwen36-27b-qlora \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--tensor-parallel-size 1
The serving endpoint exposes an OpenAI-compatible API at http://localhost:8000/v1. Client-side, it’s a drop-in for the OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY", # vLLM ignores the key by default
)
response = client.chat.completions.create(
model="my-adapter",
messages=[
{"role": "system", "content": "You are an industrial controls expert."},
{"role": "user", "content": "What causes oscillation after a payload change on a cobot joint?"},
],
temperature=0.2,
max_tokens=512,
)
print(response.choices[0].message.content)
If the deployment is part of a larger application, consider running the serving pods on Kubernetes with a GPU-aware scheduler. For tool-augmented workflows, tool calling support in vLLM via Hermes-style JSON output works out of the box for Qwen3.6 and GPT-OSS. To bridge to broader integrations, the Model Context Protocol (MCP) is becoming the de facto integration standard for tool-using LLM applications.
Common Pitfalls and Debugging
Most training failures come from a small set of recurring mistakes. Knowing them in advance saves days.
Chat template mismatch. Already mentioned, but it bears repeating because it’s the most common silent failure. The training-time template and the inference-time template must be identical. Always print a fully tokenized example with special tokens visible (tokenizer.decode(input_ids, skip_special_tokens=False)) before kicking off a long run.
OOM mid-training. Loss curve was fine for 5,000 steps, then a single long sequence in the batch pushed the activation memory over the edge. Fix: lower max_seq_length, or enable packing=True with a sequence cap, or drop per-device batch size and raise gradient accumulation to compensate.
Tokenizer drift. You loaded the base model with one tokenizer revision and inference with another. The vocabulary or special-token IDs shifted. Lock the tokenizer commit hash explicitly: AutoTokenizer.from_pretrained(MODEL_ID, revision="abc123def...").
Loss spikes. Big jump upward in loss at a specific step. Almost always a bad batch — corrupted data, a tokenization error on one example, or an unusually long sequence. Inspect the data at that step. If recurrence is rare, add gradient clipping (max_grad_norm=1.0) and resume from the last good checkpoint.
Eval/train distribution mismatch. Training loss low, eval loss high and not improving. Your eval set is drawn from a different distribution than your training set. Either generate eval from the same source as training (with a fresh seed split) or accept that your eval represents the generalization gap, not a training failure.
Gradient explosion. Loss goes to NaN within a few steps. Learning rate is too high for the task, or you forgot gradient clipping, or your data has an extreme outlier in numerical features. Restart with learning_rate halved and max_grad_norm=1.0.
MoE-specific: expert collapse. Specific to MoE training (Qwen3.5-122B, GPT-OSS-120B). The router learns to route everything to one or two experts; the rest of the model atrophies. Mitigation is an auxiliary load-balancing loss, which TRL and torchtitan include by default — but verify it’s actually enabled and not silently dropped by a config override.
FAQ
Can I fine-tune any of these models on a consumer GPU like an RTX 4090?
Qwen3.6-27B yes, with QLoRA — the 24GB of VRAM on a 4090 is tight but workable with gradient checkpointing, paged 8-bit optimizer, and a short sequence length (around 2048 tokens). Qwen3.5-122B-A10B and GPT-OSS-120B require at least 80GB of VRAM, which means H100/H200/MI300X class hardware. The released GPT-OSS-120B can be served (not trained) on a single 80GB card thanks to MXFP4 quantization.
How much data do I actually need?
Less than people expect. For domain adaptation with LoRA or QLoRA, 5,000 to 20,000 high-quality examples is enough for most domains. Quality matters far more than quantity — a tightly curated 10K is consistently better than a noisy 100K. For format adaptation (teaching the model a new structured output schema), 1,000-2,000 examples often suffice.
How does this compare to using a managed API?
Different problem space. Managed APIs (OpenAI, Anthropic) win on convenience and latest-model access. Self-hosted fine-tuned models win on cost per million tokens at scale, data sovereignty, custom domain adaptation, and predictable cost (no per-call billing). The crossover point is typically around 100M tokens per month — below that, managed wins; above that, self-hosted is usually cheaper.
What’s the difference between LoRA and full fine-tuning at quality?
LoRA retains 90-95 percent of full fine-tuning quality across most tasks. QLoRA retains 80-90 percent. The remaining gap is largest on tasks that require substantial representational shift from the base model — for example, fine-tuning an English-pretrained model to operate fluently in a low-resource language. For typical instruction tuning, code adaptation, or structured-output tasks, the gap is small enough that the cost savings of LoRA dominate.
Should I do continued pretraining before instruction tuning?
Only if your domain is genuinely far from the base model’s training distribution — medical literature, legal contracts in a non-English language, highly specialized scientific notation. For most domains, the base model has enough coverage that instruction tuning alone closes the gap. Continued pretraining is expensive and easy to do wrong (catastrophic forgetting of the base model’s general competence).
Related Reading
- Self-supervised learning is the foundation underneath every modern LLM pretraining run
- Transfer learning and fine-tuning for domain adaptation — a working applied example
- Containerizing the training environment with Docker
- Orchestrating data prep pipelines with Apache Airflow
- Bayesian hyperparameter optimization for tuning learning rate and rank
- Kubernetes for distributed training and serving
- Tool calling and function calling for post-trained models
- The Model Context Protocol for deployment integration
References
- Qwen Team — Qwen3.6-27B announcement (April 22, 2026)
- QwenLM — Qwen3.6 official repository
- OpenAI — Introducing GPT-OSS (August 2025)
- OpenAI — GPT-OSS-120B model card on Hugging Face
- OpenAI — openai/gpt-oss GitHub repository
Conclusion
Training open-source LLMs in 2026 is no longer the closed shop it was two years ago. The combination of Apache 2.0 base models with frontier-class reasoning (GPT-OSS-120B near o4-mini), QLoRA on a single rented GPU, and serving infrastructure that handles thousands of concurrent users on commodity hardware put production-grade LLM customization within reach of any team with a modest budget and a clear use case.
The three anchor models cover the practical range: Qwen3.6-27B for the single-GPU dense workflow, Qwen3.5-122B-A10B for cheap MoE serving when you have multi-GPU capacity, and GPT-OSS-120B for single-GPU serving of a frontier-class reasoner thanks to MXFP4. None of them is the “best” — they answer different questions about hardware, latency, and quality.
The hard part is no longer the technology. It’s the data — assembling, deduplicating, formatting, and contamination-checking a dataset that actually teaches the model what you want it to do. The trainer runs in eight hours. The dataset takes eight weeks. Plan accordingly.
Leave a Reply