PatchTST set the bar for transformer-based time series forecasting. Then a paper from KAIST showed something uncomfortable: a non-transformer model with two simple streams — an MLP and a CNN — beats it. xPatch does this with 4× less compute and an old idea: exponential moving averages.
That paper is xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition by Artyom Stitsyuk and Jaesik Choi, published at AAAI 2025 (arXiv:2412.17323). It is the kind of paper that quietly recalibrates the field. No new attention variant. No 100B-parameter foundation model. Just a careful re-examination of which inductive biases actually pay off when you forecast electricity load, traffic, weather, or stock returns.
This deep-dive walks through every load-bearing piece of the paper: the EMA decomposition, the dual-stream architecture, the arctangent loss, the sigmoid learning-rate schedule, the experimental results, and what it all means for the practitioner shipping forecasts to production.
Summary
What this post covers: A deep-dive into the AAAI 2025 xPatch paper by Stitsyuk and Choi, breaking down its EMA decomposition, dual-stream MLP+CNN architecture, training tricks (arctangent loss, sigmoid LR, RevIN), benchmark results, and what it implies for transformer-dominated time-series forecasting.
Key insights:
- A non-transformer dual-stream model (linear stream for trend, depthwise-separable CNN for seasonal) beats CARD, the previous SOTA, by an average of 2.46% MSE and 2.34% MAE across 8 standard benchmarks while running roughly 4x faster.
- The right inductive bias (EMA trend-seasonal decomposition + patching + dual specialization) consistently outperforms brute-force attention for typical multivariate forecasting, echoing DLinear’s earlier “are transformers effective?” critique.
- Training-side tricks do real work: the arctangent loss (horizon-weighted MAE that prevents any single horizon from dominating the gradient) and sigmoid LR schedule transfer to PatchTST and CARD as well, suggesting many architecture comparisons in the literature have used suboptimal training recipes.
- Default the EMA alpha to 0.3 for large benchmarks (Weather, Traffic, Electricity) and sweep {0.1, 0.3, 0.5, 0.7, 0.9} on smaller or noisier datasets; smaller alpha gives smoother trends, larger alpha gives more reactive trends.
- Use xPatch by default over PatchTST in production unless you have heavy channel correlations that require cross-channel attention or need a look-back longer than 96 steps, it is faster to train, faster to infer, slightly more accurate, and easier to debug because the two streams are individually interpretable.
Main topics: Why this paper matters, The EMA decomposition: heart of xPatch, The dual-stream architecture, Training tricks: arctangent loss, sigmoid LR, RevIN, Results that hurt the transformers, Ablations: what actually drives performance, How to use xPatch (PyTorch sketch), When to use xPatch vs alternatives, Limitations and open questions, What this means for the field, Frequently asked questions.
Why this paper matters
For about three years, time series forecasting has been a transformer story. Informer (2021) made attention practical for long sequences. Autoformer (2021) plugged in series decomposition. FEDformer (2022) moved attention to the frequency domain. PatchTST (2023) borrowed the patching trick from Vision Transformers and made it the strongest model on a long list of benchmarks. iTransformer (2024) inverted the embedding dimension. CARD (2024) tightened the channel-aligned attention design.
Then DLinear came along in 2022 and asked an awkward question: do you actually need attention for forecasting? A two-line linear model — literally a single fully-connected layer with a moving-average decomposition — could match or beat several transformer variants on standard benchmarks. The community responded with a wave of “are transformers effective?” papers, and the answer that emerged was nuanced: transformers help on some datasets, hurt on others, and the gains are often smaller than the speedups you give up.
xPatch takes the next logical step. Instead of dropping the transformer entirely (DLinear) or sticking with a transformer and tuning attention (CARD, iTransformer), it builds a dual-stream non-transformer model with stronger inductive biases. One stream is a simple MLP. The other is a small depthwise-separable CNN. Glue them together with EMA-based decomposition and a smarter loss function, and the result lands ahead of CARD — the previous current best — while training roughly 4× faster.
For an end-to-end primer on the broader landscape these models live in, see our companion overview of time series forecasting models in 2026; xPatch is one of the cleanest examples of a non-foundation-model approach that still pulls its weight on real benchmarks.
The EMA decomposition: heart of xPatch
If you have to remember one thing about xPatch, remember this: the model’s first move is to split every channel of the input series into a slow part and a fast part, and then learn each part with a different kind of network. That split is done with an exponential moving average.
Why decomposition matters
Trend and seasonality have fundamentally different dynamics. A trend is slow, often nearly linear over short windows, and dominated by accumulating shifts in level. A seasonal component is fast, often locally periodic, frequently bursty (think traffic spikes or weather fronts). If you ask one network to model both at once, it has to compromise — smooth filters blur the seasonal spikes; sharp filters chase the trend’s drift. Decomposition removes that conflict by handing each component to a specialist.
This is hardly a new idea. Classical statistics has been doing it for decades:
- STL (Seasonal-Trend decomposition using Loess)—local polynomial regression to extract seasonality.
- Holt-Winters—three exponential smoothers (level, trend, seasonal) chained together.
- X-11 / X-13ARIMA-SEATS,government-statistics workhorse, iterative moving averages.
Recent ML approaches kept the spirit but used different tools: DLinear used a simple moving-average filter, and FEDformer projected into the frequency domain via Fourier transforms. xPatch makes a different choice: an exponential moving average.
The recursive formula
The EMA decomposition is defined by Equation 2 of the paper:
s₀ = x₀
sₐ = α · xₐ + (1 - α) · sₐ₋₁ for t > 0
X_T = EMA(X) (trend)
X_S = X − X_T (seasonal residual)
Here α is the smoothing factor in (0, 1). Small α (like 0.1) gives a very smooth trend dominated by old observations; large α (like 0.9) makes the trend track the latest value almost immediately. The seasonal stream is whatever the trend cannot explain.
The recursion looks expensive — it is sequential by definition — but Appendix D of the paper shows a vectorized form with O(1) per-step cost in terms of GPU operations. The trick is to expand the recursion into a closed-form weighted sum and compute it as a single matrix multiply with a Toeplitz-style weight matrix. In practice, the EMA pre-processing is essentially free compared to the rest of the forward pass.
Why α = 0.3 wins for big datasets
The paper sweeps α over {0.1, 0.3, 0.5, 0.7, 0.9}. On Weather, Traffic, and Electricity — the larger, more channel-rich benchmarks — α = 0.3 is consistently optimal. Why? With many noisy channels, you want the trend to be genuinely slow so it filters short-lived noise but still tracks the multi-step drift. Smaller α oversmooths and starves the seasonal stream of bandwidth; larger α lets too much high-frequency content leak into the “trend.” 0.3 sits in a sweet spot.
On smaller and noisier datasets the picture is murkier — sometimes α = 0.5 or 0.7 wins, because the trend has to react faster to abrupt regime changes. The paper treats α as a hyperparameter, not a learnable parameter; that is one of the obvious extensions for follow-up work.
Simple MA vs EMA
| Property | Simple Moving Average (DLinear-style) | Exponential Moving Average (xPatch) |
|---|---|---|
| Weight scheme | Uniform inside a window | Geometric decay, recent > old |
| Hyperparameter | Window length k | Smoothing factor α |
| Edge effects | Hard window boundary | Smooth, no boundary discontinuity |
| Reactivity to recent shocks | Slow (averaged equally with old data) | Fast (recent point gets weight α) |
| Implementation cost | O(k) per step | O(1) per step (vectorized) |
The dual-stream architecture
Once we have X_T (trend) and X_S (seasonal), xPatch processes them in two specialized streams. The design philosophy: use the right tool for each component, then glue them together at the end.
The linear stream (handles X_T)
The trend is, by construction, smooth. There is not much non-linear structure left in it after the EMA filter. So xPatch processes it with two MLP-style blocks, each consisting of:
- A fully-connected (FC) projection
- A 1D average pooling with kernel size
k = 2 - A LayerNorm
Critically, there is no non-linear activation function anywhere in the linear stream. The whole stream is — up to the LayerNorm — a sequence of linear operators. The final output is projected to dimension T (the forecast horizon). If you have read the DLinear paper, this should feel familiar; xPatch is essentially saying “DLinear had the right idea for the trend, so let’s keep that as our trend model.”
Why the LayerNorm? It is the only nonlinear-flavored operator in the stream (LayerNorm divides by an instance-computed std, which is data-dependent), and it stabilizes training when the trend’s scale changes between samples. The average pooling acts as additional smoothing, defensively reducing the chance that the linear stream over-fits to high-frequency noise that leaked through the decomposition.
The CNN stream (handles X_S)
The seasonal stream is where the action is. Seasonal residuals are bursty, locally periodic, and channel-correlated. xPatch handles them with a depthwise-separable CNN:
- Patching: the input is segmented into patches of length
P = 16with strideS = 8. The number of patches isN = ⌊(L − P) / S⌋ + 2, matching the PatchTST setup. With L = 96, that gives roughly 12 patches per channel. - Depthwise convolution: kernel size
P = 16, strideP = 16, groups equal to the number of channels N. Each channel gets its own filter aligned to patch boundaries; no cross-channel mixing happens here. - Pointwise convolution: a 1×1 convolution that mixes information across channels.
- GELU activation: the only major nonlinearity in the entire model. GELU's smooth saturating shape works well for the spiky residuals.
- BatchNorm: for training stability across batches.
- Residual connection: the input is added back to the output, which makes optimization easier and lets the stream behave like an identity if the seasonal component happens to be near-zero.
The depthwise + pointwise pattern is the classic MobileNet-style separable convolution. It dramatically reduces parameters versus a full convolution while keeping a similar receptive field. For time series with many channels (Traffic has 862, Electricity has 321), this is essential — a full Conv1D would be enormous.
Why this division of labor works
An MLP can learn arbitrary linear projections but has to spend capacity to “discover” any local structure. A patch-aligned CNN bakes locality and translation-equivariance into the architecture from day one. By feeding only the seasonal residual into the CNN, xPatch lets the CNN focus on what it is good at — local patterns — without wasting capacity re-learning the trend. Conversely, the linear stream is not asked to model the seasonal spikes that would force it to compromise.
This is the same lesson that graph attention networks teach in a different domain: the architecture’s inductive biases should match the structure of the signal you are modeling. Attention is a powerful general-purpose mixer, but generality is not free.
Combining the two streams
The outputs of the linear and CNN streams are concatenated and passed through a final linear layer (Equation 12 in the paper) to produce the forecast of horizon T. This is intentionally simple. The model is not asked to learn a complex gating mechanism; it just learns a linear combination of the two specialists’ outputs.
Training tricks: arctangent loss, sigmoid LR, RevIN
The architecture is half the story. The other half is the training recipe, and the paper makes a strong case that some of the gains come from techniques that any forecasting model can adopt.
RevIN (Reversible Instance Normalization)
Distribution shift is endemic to time series. The mean and variance of a channel during training rarely match those at inference time — especially in non-stationary domains like finance, traffic, or weather. RevIN solves this with a deceptively simple trick:
- Before the model: subtract the per-instance mean and divide by the per-instance standard deviation. The instance is a single look-back window.
- After the model: multiply by the same std and add back the same mean (plus learnable affine parameters).
The model only ever sees standardized inputs, so it does not have to memorize the level or scale of any particular channel. The de-normalization at the output puts the forecast back on the original scale. RevIN is now standard equipment in modern forecasting models, and xPatch uses it exactly as PatchTST and CARD do.
Arctangent loss: the smart twist
This is one of the most novel parts of the paper. CARD popularized a horizon-weighted loss that gives more importance to longer-horizon predictions, with weights that grow exponentially. The intuition is reasonable — long-horizon errors compound — but exponential weighting blows up quickly and can dominate the optimization.
xPatch replaces it with a slower-growing function based on the arctangent (Equations 16-17):
ρ_arctan(i) = −arctan(i) + π/4 + 1
L_arctan = (1/T) · Σᵢ ρ_arctan(i) · ||Ŷᵢ − yᵢ||₁
Why arctangent? It is bounded (its growth slows asymptotically), monotonic, and smooth. Unlike exponential weighting, it does not let any single horizon dominate the gradient. The result is more uniform attention across the entire forecast window, which empirically translates to better performance on long horizons without hurting short ones.
The paper’s most striking ablation finding is that arctangent loss helps even when applied to other models. Drop it into PatchTST or CARD and accuracy improves. This makes it a genuinely transferable trick — a free upgrade for any forecasting pipeline.
Sigmoid learning-rate schedule
Standard schedules in this literature are step decay (cut LR by 0.5 every K epochs) or cosine annealing. xPatch introduces a sigmoid-shaped schedule (Equation 23) with a warmup parameter w. The shape is a smooth ramp-up from a low initial value, a flat plateau in the middle, and a gentle ramp-down. Compared to step decay, it avoids the discontinuities that can destabilize training; compared to cosine, the explicit warmup gives the optimizer time to find a good basin before the LR is high.
Like the arctangent loss, the paper shows the sigmoid schedule transfers cleanly to other models. It is a reminder that learning-rate schedules are often under-tuned in benchmark comparisons — everyone uses the same default, and any model that wants to win has to prove its architecture beats every competitor’s also-suboptimal training.
Compute footprint
xPatch is trained for 100 epochs on a single NVIDIA Quadro RTX 6000. That is a single mid-range GPU and a short schedule by modern standards. There is no foundation-model pre-training, no distributed setup, no clever quantization. This is part of the paper’s argument: current best forecasting does not require current best compute.
Results that hurt the transformers
The experimental setup is the standard long-horizon forecasting suite that has dominated the literature since Informer.
Datasets
| Dataset | Dim | Frequency | Forecast horizons |
|---|---|---|---|
| ETTh1, ETTh2 | 7 | Hourly | 96, 192, 336, 720 |
| ETTm1, ETTm2 | 7 | 15 min | 96, 192, 336, 720 |
| Weather | 21 | 10 min | 96, 192, 336, 720 |
| Traffic | 862 | Hourly | 96, 192, 336, 720 |
| Electricity | 321 | Hourly | 96, 192, 336, 720 |
| Exchange-rate | 8 | Daily | 96, 192, 336, 720 |
| Solar | 137 | 10 min | 96, 192, 336, 720 |
| ILI | 7 | Weekly | 24, 36, 48, 60 |
Look-back window is L = 96 for all datasets except ILI, which uses L = 36. The baselines are the heavy-hitters of the last few years: Autoformer, FEDformer, ETSformer, TimesNet, DLinear, RLinear, MICN, PatchTST, iTransformer, TimeMixer, and CARD.
Headline numbers
| Dataset | Horizon | xPatch MSE | xPatch MAE |
|---|---|---|---|
| ETTh1 | 96 | 0.428 | 0.419 |
| Weather | 720 | 0.310 | 0.322 |
Across all 8 datasets and all 4 horizons, xPatch beats CARD — the previous SOTA — by an average of 2.46% in MSE and 2.34% in MAE. That is a small-but-clear margin given how saturated these benchmarks have become; gains of 1-3% are now considered meaningful in the literature, and they are won at the cost of new attention variants, larger models, or longer training.
Speed: the real punchline
Accuracy is the headline; speed is the body blow. Table 3 in the paper reports per-step training and inference times.
| Model | Training (msec/step) | Inference (msec/step) | Relative speed vs xPatch |
|---|---|---|---|
| xPatch | 3.099 | 1.303 | 1.0× |
| CARD | 14.877 | — | 4.8× slower |
Training is roughly 4.8× faster than CARD per step. For PatchTST and DLinear, the paper does not give the same precise per-step numbers, but the general ordering reported is: DLinear < xPatch < PatchTST < CARD in training time. In production, where you may retrain forecasting models daily on streaming data, this kind of speed-up matters more than the marginal MSE gain.
Ablations: what actually drives performance
The ablation studies are where you learn whether a paper’s gains are robust or fragile. xPatch’s ablations are honest and informative.
EMA α sweep
| α | Weather | Traffic | Electricity | Notes |
|---|---|---|---|---|
| 0.1 | slightly worse | slightly worse | slightly worse | Trend too smooth, leaks structure |
| 0.3 | best | best | best | Optimal balance for big datasets |
| 0.5 | close | close | close | Reasonable fallback |
| 0.7 | worse | worse | worse | Trend tracks too fast |
| 0.9 | worst | worst | worst | Trend ~= input, decomposition fails |
The pattern is clear: 0.3 dominates on the larger datasets. The paper does report that smaller and noisier datasets sometimes prefer higher α values, so do not blindly fix α = 0.3 for every problem — sweep it on a held-out validation split.
Dual-stream necessity
The paper ablates removing each stream. Removing the linear stream (so the CNN handles both trend and seasonal) hurts. Removing the CNN stream (so the linear stream tries to capture seasonality) hurts more. The two streams are genuinely complementary; neither is dispensable.
Arctangent loss is transferable
This is, in my view, the most important ablation in the paper. When you swap the standard MSE loss in PatchTST or CARD for the arctangent loss, those models also improve. That makes the loss a free upgrade for the entire field. If you are running an existing forecasting pipeline today, you can ship a new loss function as a one-line change and probably gain a few percentage points.
Sigmoid LR is also transferable
Same story: the sigmoid schedule helps other models too. The implication is uncomfortable for the literature: a non-trivial fraction of “architecture wins” in past papers may have been confounded by suboptimal training schedules. xPatch is at least transparent about this, isolating how much of its margin comes from the loss and the schedule versus the dual-stream design itself.
How to use xPatch (PyTorch sketch)
The official implementation lives at github.com/stitsyuk/xPatch and follows the structure of the well-known long-horizon forecasting library scaffolds. The full code includes data loaders, evaluation harnesses, and configurations for each benchmark, but the model itself is small enough to sketch in one screen.
Here is a minimal-but-faithful PyTorch outline. It is not a drop-in replacement for the official repo — use the official code for benchmarking — but it captures the architecture clearly.
import torch
import torch.nn as nn
import torch.nn.functional as F
class EMADecomp(nn.Module):
"""Exponential moving-average decomposition (Eq. 2)."""
def __init__(self, alpha: float = 0.3):
super().__init__()
self.alpha = alpha
def forward(self, x):
# x shape: (B, L, N) batch, look-back, channels
B, L, N = x.shape
trend = torch.zeros_like(x)
trend[:, 0, :] = x[:, 0, :]
for t in range(1, L):
trend[:, t, :] = (
self.alpha * x[:, t, :]
+ (1.0 - self.alpha) * trend[:, t - 1, :]
)
seasonal = x - trend
return trend, seasonal
class LinearStream(nn.Module):
"""2 FC + AvgPool + LayerNorm blocks, no activation."""
def __init__(self, L: int, T: int, hidden: int = 128):
super().__init__()
self.fc1 = nn.Linear(L, hidden)
self.pool1 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
self.ln1 = nn.LayerNorm(hidden + 1)
self.fc2 = nn.Linear(hidden + 1, hidden)
self.pool2 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
self.ln2 = nn.LayerNorm(hidden + 1)
self.proj = nn.Linear(hidden + 1, T)
def forward(self, x):
# x: (B, L, N) -> (B, N, L)
x = x.transpose(1, 2)
h = self.pool1(self.fc1(x).transpose(1, 2)).transpose(1, 2)
h = self.ln1(h)
h = self.pool2(self.fc2(h).transpose(1, 2)).transpose(1, 2)
h = self.ln2(h)
return self.proj(h) # (B, N, T)
class CNNStream(nn.Module):
"""Patch -> depthwise -> pointwise -> GELU -> BN -> residual."""
def __init__(self, N: int, L: int, T: int,
P: int = 16, S: int = 8):
super().__init__()
self.P, self.S = P, S
n_patches = (L - P) // S + 2
self.depthwise = nn.Conv1d(
in_channels=N, out_channels=N,
kernel_size=P, stride=P, groups=N,
)
self.pointwise = nn.Conv1d(N, N, kernel_size=1)
self.bn = nn.BatchNorm1d(N)
self.proj = nn.Linear(n_patches * P, T)
def forward(self, x):
# x: (B, L, N) -> (B, N, L)
x = x.transpose(1, 2)
h = self.depthwise(x)
h = self.pointwise(h)
h = F.gelu(h)
h = self.bn(h)
# residual: pad and add (omitted for brevity)
h = h.flatten(start_dim=2)
h = F.pad(h, (0, max(0, self.proj.in_features - h.size(-1))))
return self.proj(h[..., :self.proj.in_features])
class XPatch(nn.Module):
def __init__(self, L: int, T: int, N: int, alpha: float = 0.3):
super().__init__()
self.decomp = EMADecomp(alpha)
self.linear_stream = LinearStream(L, T)
self.cnn_stream = CNNStream(N, L, T)
self.fuse = nn.Linear(2 * T, T)
def forward(self, x):
# RevIN
mean = x.mean(dim=1, keepdim=True)
std = x.std(dim=1, keepdim=True) + 1e-5
x_norm = (x - mean) / std
trend, seasonal = self.decomp(x_norm)
y_lin = self.linear_stream(trend) # (B, N, T)
y_cnn = self.cnn_stream(seasonal) # (B, N, T)
y = torch.cat([y_lin, y_cnn], dim=-1)
y = self.fuse(y).transpose(1, 2) # (B, T, N)
# de-RevIN
return y * std + mean
def arctangent_loss(pred, target):
"""L_arctan from Eq. 16-17."""
T = pred.size(1)
i = torch.arange(T, device=pred.device, dtype=torch.float32)
rho = -torch.atan(i) + torch.pi / 4 + 1.0
abs_err = (pred - target).abs().mean(dim=-1) # (B, T)
return (rho * abs_err).mean()
A few practical notes:
- Replace the Python loop in
EMADecompwith the vectorized closed-form for a real speed-up — the paper’s Appendix D has the math, and the official repo implements it. - The CNN stream’s output projection is sketched lazily here; the real implementation handles the patching dimensions more carefully.
- For a clean start, use
L = 96,P = 16,S = 8, α = 0.3, 100 epochs, sigmoid LR with a warmup of about 10 epochs, and the arctangent loss.
If you are also experimenting with anomaly detection on the same series, see our overview of time series anomaly detection models — many of the same training tricks (RevIN, patching, decomposition) apply.
Hyperparameter cheat sheet
| Hyperparameter | Default | When to change |
|---|---|---|
| Look-back L | 96 (36 for ILI) | Increase if your seasonality is longer than 96 steps |
| Patch size P | 16 | Should align with your series’ natural local period |
| Stride S | 8 | Smaller for more overlap, larger for fewer patches |
| EMA α | 0.3 | Sweep {0.1, 0.3, 0.5, 0.7, 0.9} on small/noisy data |
| Epochs | 100 | Use early stopping to cut wasted compute |
| Loss | Arctangent | Switch to standard MAE if all horizons matter equally |
When to use xPatch vs alternatives
No model is a universal answer. xPatch sits in a specific corner of the design space: low-latency, accuracy-competitive, supervised, point-forecast, multivariate. Here is how I think about choosing.
| Need | Recommended approach | Why |
|---|---|---|
| Fastest training/inference, good accuracy | xPatch | Beats CARD, ~5× faster than CARD per training step |
| Foundation model / zero-shot | TimesFM, Chronos, Moirai | Pretrained at scale, generalize across domains without fine-tuning |
| Calibrated uncertainty estimates | Gaussian processes | Native posterior variances, principled credible intervals |
| Long-context attention reasoning | PatchTST, iTransformer | When channel relationships are essential and context exceeds ~512 steps |
| Tabular-style features without temporal structure | XGBoost / LightGBM | If you can engineer good lag/window features, GBMs are unbeatable on tabular forecasting |
| Linear/stationary signal, minimal compute | DLinear, classical ARIMA | If the data is genuinely simple, simpler is better |
| High-throughput streaming infra | xPatch + Kafka time-series engine | Low-latency model fits well with streaming pipelines |
For tuning the hyperparameters of any of these alternatives in a principled way, our note on Bayesian hyperparameter optimization is worth reading.
Limitations and open questions
xPatch is a strong paper, but no paper is perfect. The honest weak spots:
- α is a hyperparameter, not learned. A natural extension is to make α differentiable (or even per-channel and per-timescale). The paper acknowledges this and leaves it for future work.
- Datasets are relatively small. The largest is Traffic with 862 channels and ~17k timesteps. That is small compared to what foundation models like Chronos and TimesFM are pretrained on. xPatch’s behavior on much larger streams is untested in the paper.
- Two streams = two forward passes. Inference is still fast, but a fused single-pass implementation would be even faster, and might be feasible with a careful architectural redesign.
- Point forecasts only. xPatch produces a single-trajectory forecast with no probabilistic interpretation. For risk-sensitive applications — finance, energy, healthcare — you want quantiles or distributions, which xPatch does not natively provide. You would need to bolt on a quantile head or wrap it in a Bayesian framework.
- Benchmark saturation. The community has been honest that ETTh, Weather, and the related benchmarks are showing signs of saturation — gains of 2-3% may not transfer to messier real-world data with more drift, missing values, and concept shift. xPatch’s results are current best on these benchmarks; whether they generalize to a finance trading desk’s tick data is an empirical question.
- No theoretical analysis. The paper is empirical. There is no guarantee about generalization, no convergence proof for the recursion, no analysis of the loss landscape. That is fine for an applied paper but leaves room for follow-up theory.
What this means for the field
Step back from the architecture details and the broader story is more interesting:
- Inductive biases keep winning. Decomposition (separating trend from seasonality) has been valuable since the 1950s, and it remains valuable in 2025. Patching, locality, and dual-specialization all encode useful priors. Brute-force attention without these priors is rarely the right move for time series.
- Loss functions and LR schedules are underrated. The fact that arctangent loss and sigmoid LR transfer to other models suggests the field has been comparing architectures under suboptimal training. Future benchmark papers should probably standardize the training recipe before claiming architectural wins.
- The Pareto frontier is the right axis. A model that is 1% more accurate but 10× slower may not be worth deploying. xPatch sits in the corner where accuracy is competitive and speed is meaningfully better. That is the right place to be for production systems.
- Foundation models are not the only path forward. The same year that brought TimesFM and Chronos also brought xPatch, which is task-specific, small, fast, and competitive. Both styles will coexist; choose based on your deployment constraints.
- Self-supervised pre-training is still on the menu. xPatch is fully supervised. There is an open question whether SSL pre-training of the CNN stream — analogous to what TS2Vec and similar methods do — would unlock further gains. Our overview of self-supervised pretraining covers the relevant techniques.
For a quick reminder of the statistical foundations that all of these models stand on (independence, the role of variance, why sample sizes matter for stable estimators), see our explainer on the Central Limit Theorem. And if you are about to put a forecasting model into production, the data layer matters too — our comparison of databases for preprocessed time series walks through the trade-offs.
Frequently asked questions
Why does a non-transformer model outperform PatchTST?
Three reasons stack: (1) the EMA decomposition gives the model two cleaner sub-signals instead of one mixed signal, (2) the dual-stream architecture matches the right tool to each component (linear for the smooth trend, CNN for the bursty seasonal), and (3) the arctangent loss and sigmoid LR schedule give a free training-side boost. PatchTST does have channel-independent attention and learnable patching, but it asks one stack of attention layers to handle both trend and seasonal at once. xPatch’s specialization wins on average by 2.46% MSE while running about 4.8× faster than CARD.
Should I use xPatch or PatchTST in production?
Default to xPatch unless you have a specific reason not to. It is faster to train, faster to infer, slightly more accurate on the standard benchmarks, and easier to debug because the streams are interpretable. Use PatchTST if you have a heavily channel-correlated dataset where the attention’s cross-channel mixing is essential, or if you need a longer look-back than 96 steps and want attention’s global receptive field.
How do I tune the EMA alpha parameter?
Start with α = 0.3, which is optimal for the largest paper benchmarks (Weather, Traffic, Electricity). For smaller or noisier datasets, sweep {0.1, 0.3, 0.5, 0.7, 0.9} on a held-out validation split. Smaller α produces smoother trends (good when noise is dominant); larger α produces more reactive trends (good when regime changes are abrupt). The paper deliberately keeps α non-learnable; making it learnable is a sensible research extension.
What is the arctangent loss and why does it help?
It replaces the standard MSE/MAE loss with a horizon-weighted MAE where the weights follow ρ(i) = −arctan(i) + π/4 + 1. The arctangent grows much more slowly than the exponential weighting CARD uses, which means no single horizon dominates the gradient. The result is more uniform learning signal across all forecast horizons. Empirically, the loss helps not just xPatch but also other models (PatchTST, CARD), which makes it a transferable upgrade for any forecasting pipeline.
Does xPatch support multivariate forecasting?
Yes. The architecture is designed for multivariate inputs. The depthwise convolution in the CNN stream operates per-channel (groups = N), and the pointwise convolution mixes across channels. The linear stream processes each channel through the same weights but maintains the channel dimension. The paper evaluates on datasets with up to 862 channels (Traffic) without modification.
Related reading
- Time Series Forecasting Models in 2026 — the broader landscape xPatch competes in.
- Time Series Anomaly Detection Models — the same primitives (RevIN, patching, decomposition) used for a different task.
- Gaussian Processes Explained — for when you need uncertainty estimates that xPatch does not provide.
- Self-Supervised Pretraining — the SSL angle that could complement xPatch’s supervised training.
- Apache Kafka for Multivariate Time Series — the streaming infrastructure to deploy a fast forecasting model.
External references
- xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition — Stitsyuk & Choi, AAAI 2025 (arXiv:2412.17323).
- Official xPatch implementation on GitHub.
- PatchTST: A Time Series is Worth 64 Words — Nie et al., the patching baseline xPatch dethrones.
- Are Transformers Effective for Time Series Forecasting? (DLinear) — Zeng et al., the paper that started the back-to-basics thread.
- CARD — Wang et al., the previous SOTA xPatch is benchmarked against.
This article is for informational and educational purposes only. It summarizes a publicly available academic paper and is not a substitute for reading the original. Implementation details should be verified against the official repository before production use.
Leave a Reply