xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

Q: Why does a non-transformer model outperform PatchTST?

Three reasons stack: (1) the EMA decomposition gives the model two cleaner sub-signals instead of one mixed signal, (2) the dual-stream architecture matches the right tool to each component (linear for the smooth trend, CNN for the bursty seasonal), and (3) the arctangent loss and sigmoid LR schedule give a free training-side boost. PatchTST does have channel-independent attention and learnable patching, but it asks one stack of attention layers to handle both trend and seasonal at once. xPatch's specialization wins on average by 2.46% MSE while running about 4.8x faster than CARD.

Q: Should I use xPatch or PatchTST in production?

Default to xPatch unless you have a specific reason not to. It is faster to train, faster to infer, slightly more accurate on standard benchmarks, and easier to debug because the streams are interpretable. Use PatchTST if you have a heavily channel-correlated dataset where attention's cross-channel mixing is essential, or if you need a look-back longer than 96 steps and want attention's global receptive field.

Q: How do I tune the EMA alpha parameter?

Start with alpha = 0.3, which is optimal for the largest benchmarks (Weather, Traffic, Electricity). For smaller or noisier datasets, sweep {0.1, 0.3, 0.5, 0.7, 0.9} on a held-out validation split. Smaller alpha produces smoother trends (good when noise dominates); larger alpha produces more reactive trends (good when regime changes are abrupt). The paper deliberately keeps alpha non-learnable.

Q: What is the arctangent loss and why does it help?

It replaces standard MSE/MAE with a horizon-weighted MAE where weights follow rho(i) = -arctan(i) + pi/4 + 1. The arctangent grows much more slowly than CARD's exponential weighting, so no single horizon dominates the gradient. The result is more uniform learning across all forecast horizons. Empirically, the loss helps not just xPatch but also PatchTST and CARD, making it a transferable upgrade for any forecasting pipeline.

Q: Does xPatch support multivariate forecasting?

Yes. The depthwise convolution in the CNN stream operates per-channel (groups = N) and the pointwise convolution mixes across channels. The linear stream processes each channel with shared weights while preserving the channel dimension. The paper evaluates on datasets with up to 862 channels (Traffic) without modification.

Last updated: May 17, 2026

By kongastral

Published May 7, 2026 · Updated May 17, 2026 · 29 min read

PatchTST set the bar for transformer-based time series forecasting. Then a paper from KAIST showed something uncomfortable: a non-transformer model with two simple streams — an MLP and a CNN — beats it. xPatch does this with 4× less compute and an old idea: exponential moving averages.

That paper is xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition by Artyom Stitsyuk and Jaesik Choi, published at AAAI 2025 (arXiv:2412.17323). It is the kind of paper that quietly recalibrates the field. No new attention variant. No 100B-parameter foundation model. Just a careful re-examination of which inductive biases actually pay off when you forecast electricity load, traffic, weather, or stock returns.

This deep-dive walks through every load-bearing piece of the paper: the EMA decomposition, the dual-stream architecture, the arctangent loss, the sigmoid learning-rate schedule, the experimental results, and what it all means for the practitioner shipping forecasts to production.

Summary

What this post covers: A deep-dive into the AAAI 2025 xPatch paper by Stitsyuk and Choi, breaking down its EMA decomposition, dual-stream MLP+CNN architecture, training tricks (arctangent loss, sigmoid LR, RevIN), benchmark results, and what it implies for transformer-dominated time-series forecasting.

Key insights:

A non-transformer dual-stream model (linear stream for trend, depthwise-separable CNN for seasonal) beats CARD, the previous SOTA, by an average of 2.46% MSE and 2.34% MAE across 8 standard benchmarks while running roughly 4x faster.
The right inductive bias (EMA trend-seasonal decomposition + patching + dual specialization) consistently outperforms brute-force attention for typical multivariate forecasting, echoing DLinear’s earlier “are transformers effective?” critique.
Training-side tricks do real work: the arctangent loss (horizon-weighted MAE that prevents any single horizon from dominating the gradient) and sigmoid LR schedule transfer to PatchTST and CARD as well, suggesting many architecture comparisons in the literature have used suboptimal training recipes.
Default the EMA alpha to 0.3 for large benchmarks (Weather, Traffic, Electricity) and sweep {0.1, 0.3, 0.5, 0.7, 0.9} on smaller or noisier datasets; smaller alpha gives smoother trends, larger alpha gives more reactive trends.
Use xPatch by default over PatchTST in production unless you have heavy channel correlations that require cross-channel attention or need a look-back longer than 96 steps, it is faster to train, faster to infer, slightly more accurate, and easier to debug because the two streams are individually interpretable.

Main topics: Why this paper matters, The EMA decomposition: heart of xPatch, The dual-stream architecture, Training tricks: arctangent loss, sigmoid LR, RevIN, Results that hurt the transformers, Ablations: what actually drives performance, How to use xPatch (PyTorch sketch), When to use xPatch vs alternatives, Limitations and open questions, What this means for the field, Frequently asked questions.

Why this paper matters

For about three years, time series forecasting has been a transformer story. Informer (2021) made attention practical for long sequences. Autoformer (2021) plugged in series decomposition. FEDformer (2022) moved attention to the frequency domain. PatchTST (2023) borrowed the patching trick from Vision Transformers and made it the strongest model on a long list of benchmarks. iTransformer (2024) inverted the embedding dimension. CARD (2024) tightened the channel-aligned attention design.

Then DLinear came along in 2022 and asked an awkward question: do you actually need attention for forecasting? A two-line linear model — literally a single fully-connected layer with a moving-average decomposition — could match or beat several transformer variants on standard benchmarks. The community responded with a wave of “are transformers effective?” papers, and the answer that emerged was nuanced: transformers help on some datasets, hurt on others, and the gains are often smaller than the speedups you give up.

xPatch takes the next logical step. Instead of dropping the transformer entirely (DLinear) or sticking with a transformer and tuning attention (CARD, iTransformer), it builds a dual-stream non-transformer model with stronger inductive biases. One stream is a simple MLP. The other is a small depthwise-separable CNN. Glue them together with EMA-based decomposition and a smarter loss function, and the result lands ahead of CARD — the previous current best — while training roughly 4× faster.

For an end-to-end primer on the broader landscape these models live in, see our companion overview of time series forecasting models in 2026; xPatch is one of the cleanest examples of a non-foundation-model approach that still pulls its weight on real benchmarks.

Key Takeaway: xPatch is evidence that for typical multivariate forecasting, the right inductive biases (decomposition + patching + dual specialization) matter more than attention itself. Architecture is not the only frontier — loss functions and learning-rate schedules are doing a lot of the work too.

The EMA decomposition: heart of xPatch

If you have to remember one thing about xPatch, remember this: the model’s first move is to split every channel of the input series into a slow part and a fast part, and then learn each part with a different kind of network. That split is done with an exponential moving average.

Why decomposition matters

Trend and seasonality have fundamentally different dynamics. A trend is slow, often nearly linear over short windows, and dominated by accumulating shifts in level. A seasonal component is fast, often locally periodic, frequently bursty (think traffic spikes or weather fronts). If you ask one network to model both at once, it has to compromise — smooth filters blur the seasonal spikes; sharp filters chase the trend’s drift. Decomposition removes that conflict by handing each component to a specialist.

This is hardly a new idea. Classical statistics has been doing it for decades:

STL (Seasonal-Trend decomposition using Loess)—local polynomial regression to extract seasonality.
Holt-Winters—three exponential smoothers (level, trend, seasonal) chained together.
X-11 / X-13ARIMA-SEATS,government-statistics workhorse, iterative moving averages.

Recent ML approaches kept the spirit but used different tools: DLinear used a simple moving-average filter, and FEDformer projected into the frequency domain via Fourier transforms. xPatch makes a different choice: an exponential moving average.

The recursive formula

The EMA decomposition is defined by Equation 2 of the paper:

s₀ = x₀
sₐ = α · xₐ + (1 - α) · sₐ₋₁    for t > 0

X_T = EMA(X)         (trend)
X_S = X − X_T        (seasonal residual)

Here α is the smoothing factor in (0, 1). Small α (like 0.1) gives a very smooth trend dominated by old observations; large α (like 0.9) makes the trend track the latest value almost immediately. The seasonal stream is whatever the trend cannot explain.

The recursion looks expensive — it is sequential by definition — but Appendix D of the paper shows a vectorized form with O(1) per-step cost in terms of GPU operations. The trick is to expand the recursion into a closed-form weighted sum and compute it as a single matrix multiply with a Toeplitz-style weight matrix. In practice, the EMA pre-processing is essentially free compared to the rest of the forward pass.

Why α = 0.3 wins for big datasets

The paper sweeps α over {0.1, 0.3, 0.5, 0.7, 0.9}. On Weather, Traffic, and Electricity — the larger, more channel-rich benchmarks — α = 0.3 is consistently optimal. Why? With many noisy channels, you want the trend to be genuinely slow so it filters short-lived noise but still tracks the multi-step drift. Smaller α oversmooths and starves the seasonal stream of bandwidth; larger α lets too much high-frequency content leak into the “trend.” 0.3 sits in a sweet spot.

On smaller and noisier datasets the picture is murkier — sometimes α = 0.5 or 0.7 wins, because the trend has to react faster to abrupt regime changes. The paper treats α as a hyperparameter, not a learnable parameter; that is one of the obvious extensions for follow-up work.

Simple MA vs EMA

Property	Simple Moving Average (DLinear-style)	Exponential Moving Average (xPatch)
Weight scheme	Uniform inside a window	Geometric decay, recent > old
Hyperparameter	Window length k	Smoothing factor α
Edge effects	Hard window boundary	Smooth, no boundary discontinuity
Reactivity to recent shocks	Slow (averaged equally with old data)	Fast (recent point gets weight α)
Implementation cost	O(k) per step	O(1) per step (vectorized)

The dual-stream architecture

Once we have X_T (trend) and X_S (seasonal), xPatch processes them in two specialized streams. The design philosophy: use the right tool for each component, then glue them together at the end.

The linear stream (handles X_T)

The trend is, by construction, smooth. There is not much non-linear structure left in it after the EMA filter. So xPatch processes it with two MLP-style blocks, each consisting of:

A fully-connected (FC) projection
A 1D average pooling with kernel size k = 2
A LayerNorm

Critically, there is no non-linear activation function anywhere in the linear stream. The whole stream is — up to the LayerNorm — a sequence of linear operators. The final output is projected to dimension T (the forecast horizon). If you have read the DLinear paper, this should feel familiar; xPatch is essentially saying “DLinear had the right idea for the trend, so let’s keep that as our trend model.”

Why the LayerNorm? It is the only nonlinear-flavored operator in the stream (LayerNorm divides by an instance-computed std, which is data-dependent), and it stabilizes training when the trend’s scale changes between samples. The average pooling acts as additional smoothing, defensively reducing the chance that the linear stream over-fits to high-frequency noise that leaked through the decomposition.

The CNN stream (handles X_S)

The seasonal stream is where the action is. Seasonal residuals are bursty, locally periodic, and channel-correlated. xPatch handles them with a depthwise-separable CNN:

Patching: the input is segmented into patches of length P = 16 with stride S = 8. The number of patches is N = ⌊(L − P) / S⌋ + 2, matching the PatchTST setup. With L = 96, that gives roughly 12 patches per channel.
Depthwise convolution: kernel size P = 16, stride P = 16, groups equal to the number of channels N. Each channel gets its own filter aligned to patch boundaries; no cross-channel mixing happens here.
Pointwise convolution: a 1×1 convolution that mixes information across channels.
GELU activation: the only major nonlinearity in the entire model. GELU's smooth saturating shape works well for the spiky residuals.
BatchNorm: for training stability across batches.
Residual connection: the input is added back to the output, which makes optimization easier and lets the stream behave like an identity if the seasonal component happens to be near-zero.

The depthwise + pointwise pattern is the classic MobileNet-style separable convolution. It dramatically reduces parameters versus a full convolution while keeping a similar receptive field. For time series with many channels (Traffic has 862, Electricity has 321), this is essential — a full Conv1D would be enormous.

Why this division of labor works

An MLP can learn arbitrary linear projections but has to spend capacity to “discover” any local structure. A patch-aligned CNN bakes locality and translation-equivariance into the architecture from day one. By feeding only the seasonal residual into the CNN, xPatch lets the CNN focus on what it is good at — local patterns — without wasting capacity re-learning the trend. Conversely, the linear stream is not asked to model the seasonal spikes that would force it to compromise.

This is the same lesson that graph attention networks teach in a different domain: the architecture’s inductive biases should match the structure of the signal you are modeling. Attention is a powerful general-purpose mixer, but generality is not free.

Combining the two streams

The outputs of the linear and CNN streams are concatenated and passed through a final linear layer (Equation 12 in the paper) to produce the forecast of horizon T. This is intentionally simple. The model is not asked to learn a complex gating mechanism; it just learns a linear combination of the two specialists’ outputs.

Tip: If you are implementing xPatch from scratch and want a sanity check, start with just the linear stream and verify it matches DLinear performance on ETTh1. Then add the CNN stream and watch the gains appear on the noisier datasets like Weather and Traffic.

Training tricks: arctangent loss, sigmoid LR, RevIN

The architecture is half the story. The other half is the training recipe, and the paper makes a strong case that some of the gains come from techniques that any forecasting model can adopt.

RevIN (Reversible Instance Normalization)

Distribution shift is endemic to time series. The mean and variance of a channel during training rarely match those at inference time — especially in non-stationary domains like finance, traffic, or weather. RevIN solves this with a deceptively simple trick:

Before the model: subtract the per-instance mean and divide by the per-instance standard deviation. The instance is a single look-back window.
After the model: multiply by the same std and add back the same mean (plus learnable affine parameters).

The model only ever sees standardized inputs, so it does not have to memorize the level or scale of any particular channel. The de-normalization at the output puts the forecast back on the original scale. RevIN is now standard equipment in modern forecasting models, and xPatch uses it exactly as PatchTST and CARD do.

Arctangent loss: the smart twist

This is one of the most novel parts of the paper. CARD popularized a horizon-weighted loss that gives more importance to longer-horizon predictions, with weights that grow exponentially. The intuition is reasonable — long-horizon errors compound — but exponential weighting blows up quickly and can dominate the optimization.

xPatch replaces it with a slower-growing function based on the arctangent (Equations 16-17):

ρ_arctan(i) = −arctan(i) + π/4 + 1

L_arctan = (1/T) · Σᵢ ρ_arctan(i) · ||Ŷᵢ − yᵢ||₁

Why arctangent? It is bounded (its growth slows asymptotically), monotonic, and smooth. Unlike exponential weighting, it does not let any single horizon dominate the gradient. The result is more uniform attention across the entire forecast window, which empirically translates to better performance on long horizons without hurting short ones.

The paper’s most striking ablation finding is that arctangent loss helps even when applied to other models. Drop it into PatchTST or CARD and accuracy improves. This makes it a genuinely transferable trick — a free upgrade for any forecasting pipeline.

Sigmoid learning-rate schedule

Standard schedules in this literature are step decay (cut LR by 0.5 every K epochs) or cosine annealing. xPatch introduces a sigmoid-shaped schedule (Equation 23) with a warmup parameter w. The shape is a smooth ramp-up from a low initial value, a flat plateau in the middle, and a gentle ramp-down. Compared to step decay, it avoids the discontinuities that can destabilize training; compared to cosine, the explicit warmup gives the optimizer time to find a good basin before the LR is high.

Like the arctangent loss, the paper shows the sigmoid schedule transfers cleanly to other models. It is a reminder that learning-rate schedules are often under-tuned in benchmark comparisons — everyone uses the same default, and any model that wants to win has to prove its architecture beats every competitor’s also-suboptimal training.

Compute footprint

xPatch is trained for 100 epochs on a single NVIDIA Quadro RTX 6000. That is a single mid-range GPU and a short schedule by modern standards. There is no foundation-model pre-training, no distributed setup, no clever quantization. This is part of the paper’s argument: current best forecasting does not require current best compute.

Caution: The arctangent loss assumes you care equally about all horizons. If your downstream application weights the next-step forecast much more heavily (e.g., real-time anomaly detection on the next minute), you may want to flip the weighting back toward shorter horizons or use a custom ρ function. The paper’s choice is well-motivated for the standard MSE-on-all-horizons benchmark, not necessarily for every production setting.

Results that hurt the transformers

The experimental setup is the standard long-horizon forecasting suite that has dominated the literature since Informer.

Datasets

Dataset	Dim	Frequency	Forecast horizons
ETTh1, ETTh2	7	Hourly	96, 192, 336, 720
ETTm1, ETTm2	7	15 min	96, 192, 336, 720
Weather	21	10 min	96, 192, 336, 720
Traffic	862	Hourly	96, 192, 336, 720
Electricity	321	Hourly	96, 192, 336, 720
Exchange-rate	8	Daily	96, 192, 336, 720
Solar	137	10 min	96, 192, 336, 720
ILI	7	Weekly	24, 36, 48, 60

Look-back window is L = 96 for all datasets except ILI, which uses L = 36. The baselines are the heavy-hitters of the last few years: Autoformer, FEDformer, ETSformer, TimesNet, DLinear, RLinear, MICN, PatchTST, iTransformer, TimeMixer, and CARD.

Headline numbers

Dataset	Horizon	xPatch MSE	xPatch MAE
ETTh1	96	0.428	0.419
Weather	720	0.310	0.322

Across all 8 datasets and all 4 horizons, xPatch beats CARD — the previous SOTA — by an average of 2.46% in MSE and 2.34% in MAE. That is a small-but-clear margin given how saturated these benchmarks have become; gains of 1-3% are now considered meaningful in the literature, and they are won at the cost of new attention variants, larger models, or longer training.

Speed: the real punchline

Accuracy is the headline; speed is the body blow. Table 3 in the paper reports per-step training and inference times.

Model	Training (msec/step)	Inference (msec/step)	Relative speed vs xPatch
xPatch	3.099	1.303	1.0×
CARD	14.877	—	4.8× slower

Training is roughly 4.8× faster than CARD per step. For PatchTST and DLinear, the paper does not give the same precise per-step numbers, but the general ordering reported is: DLinear < xPatch < PatchTST < CARD in training time. In production, where you may retrain forecasting models daily on streaming data, this kind of speed-up matters more than the marginal MSE gain.

Ablations: what actually drives performance

The ablation studies are where you learn whether a paper’s gains are robust or fragile. xPatch’s ablations are honest and informative.

EMA α sweep

α	Weather	Traffic	Electricity	Notes
0.1	slightly worse	slightly worse	slightly worse	Trend too smooth, leaks structure
0.3	best	best	best	Optimal balance for big datasets
0.5	close	close	close	Reasonable fallback
0.7	worse	worse	worse	Trend tracks too fast
0.9	worst	worst	worst	Trend ~= input, decomposition fails

The pattern is clear: 0.3 dominates on the larger datasets. The paper does report that smaller and noisier datasets sometimes prefer higher α values, so do not blindly fix α = 0.3 for every problem — sweep it on a held-out validation split.

Dual-stream necessity

The paper ablates removing each stream. Removing the linear stream (so the CNN handles both trend and seasonal) hurts. Removing the CNN stream (so the linear stream tries to capture seasonality) hurts more. The two streams are genuinely complementary; neither is dispensable.

Arctangent loss is transferable

This is, in my view, the most important ablation in the paper. When you swap the standard MSE loss in PatchTST or CARD for the arctangent loss, those models also improve. That makes the loss a free upgrade for the entire field. If you are running an existing forecasting pipeline today, you can ship a new loss function as a one-line change and probably gain a few percentage points.

Sigmoid LR is also transferable

Same story: the sigmoid schedule helps other models too. The implication is uncomfortable for the literature: a non-trivial fraction of “architecture wins” in past papers may have been confounded by suboptimal training schedules. xPatch is at least transparent about this, isolating how much of its margin comes from the loss and the schedule versus the dual-stream design itself.

Key Takeaway: A meaningful share of xPatch’s gains come from training tricks, not architecture. The honest reading is that xPatch wins on multiple axes — better decomposition, better dual-stream design, better loss, better schedule — and you should think carefully about which of those you want to adopt independently.

How to use xPatch (PyTorch sketch)

The official implementation lives at github.com/stitsyuk/xPatch and follows the structure of the well-known long-horizon forecasting library scaffolds. The full code includes data loaders, evaluation harnesses, and configurations for each benchmark, but the model itself is small enough to sketch in one screen.

Here is a minimal-but-faithful PyTorch outline. It is not a drop-in replacement for the official repo — use the official code for benchmarking — but it captures the architecture clearly.

import torch
import torch.nn as nn
import torch.nn.functional as F


class EMADecomp(nn.Module):
    """Exponential moving-average decomposition (Eq. 2)."""
    def __init__(self, alpha: float = 0.3):
        super().__init__()
        self.alpha = alpha

    def forward(self, x):
        # x shape: (B, L, N)  batch, look-back, channels
        B, L, N = x.shape
        trend = torch.zeros_like(x)
        trend[:, 0, :] = x[:, 0, :]
        for t in range(1, L):
            trend[:, t, :] = (
                self.alpha * x[:, t, :]
                + (1.0 - self.alpha) * trend[:, t - 1, :]
            )
        seasonal = x - trend
        return trend, seasonal


class LinearStream(nn.Module):
    """2 FC + AvgPool + LayerNorm blocks, no activation."""
    def __init__(self, L: int, T: int, hidden: int = 128):
        super().__init__()
        self.fc1 = nn.Linear(L, hidden)
        self.pool1 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
        self.ln1 = nn.LayerNorm(hidden + 1)
        self.fc2 = nn.Linear(hidden + 1, hidden)
        self.pool2 = nn.AvgPool1d(kernel_size=2, stride=1, padding=1)
        self.ln2 = nn.LayerNorm(hidden + 1)
        self.proj = nn.Linear(hidden + 1, T)

    def forward(self, x):
        # x: (B, L, N) -> (B, N, L)
        x = x.transpose(1, 2)
        h = self.pool1(self.fc1(x).transpose(1, 2)).transpose(1, 2)
        h = self.ln1(h)
        h = self.pool2(self.fc2(h).transpose(1, 2)).transpose(1, 2)
        h = self.ln2(h)
        return self.proj(h)  # (B, N, T)


class CNNStream(nn.Module):
    """Patch -> depthwise -> pointwise -> GELU -> BN -> residual."""
    def __init__(self, N: int, L: int, T: int,
                 P: int = 16, S: int = 8):
        super().__init__()
        self.P, self.S = P, S
        n_patches = (L - P) // S + 2
        self.depthwise = nn.Conv1d(
            in_channels=N, out_channels=N,
            kernel_size=P, stride=P, groups=N,
        )
        self.pointwise = nn.Conv1d(N, N, kernel_size=1)
        self.bn = nn.BatchNorm1d(N)
        self.proj = nn.Linear(n_patches * P, T)

    def forward(self, x):
        # x: (B, L, N) -> (B, N, L)
        x = x.transpose(1, 2)
        h = self.depthwise(x)
        h = self.pointwise(h)
        h = F.gelu(h)
        h = self.bn(h)
        # residual: pad and add (omitted for brevity)
        h = h.flatten(start_dim=2)
        h = F.pad(h, (0, max(0, self.proj.in_features - h.size(-1))))
        return self.proj(h[..., :self.proj.in_features])


class XPatch(nn.Module):
    def __init__(self, L: int, T: int, N: int, alpha: float = 0.3):
        super().__init__()
        self.decomp = EMADecomp(alpha)
        self.linear_stream = LinearStream(L, T)
        self.cnn_stream = CNNStream(N, L, T)
        self.fuse = nn.Linear(2 * T, T)

    def forward(self, x):
        # RevIN
        mean = x.mean(dim=1, keepdim=True)
        std = x.std(dim=1, keepdim=True) + 1e-5
        x_norm = (x - mean) / std

        trend, seasonal = self.decomp(x_norm)
        y_lin = self.linear_stream(trend)        # (B, N, T)
        y_cnn = self.cnn_stream(seasonal)        # (B, N, T)
        y = torch.cat([y_lin, y_cnn], dim=-1)
        y = self.fuse(y).transpose(1, 2)         # (B, T, N)

        # de-RevIN
        return y * std + mean


def arctangent_loss(pred, target):
    """L_arctan from Eq. 16-17."""
    T = pred.size(1)
    i = torch.arange(T, device=pred.device, dtype=torch.float32)
    rho = -torch.atan(i) + torch.pi / 4 + 1.0
    abs_err = (pred - target).abs().mean(dim=-1)  # (B, T)
    return (rho * abs_err).mean()

A few practical notes:

Replace the Python loop in EMADecomp with the vectorized closed-form for a real speed-up — the paper’s Appendix D has the math, and the official repo implements it.
The CNN stream’s output projection is sketched lazily here; the real implementation handles the patching dimensions more carefully.
For a clean start, use L = 96, P = 16, S = 8, α = 0.3, 100 epochs, sigmoid LR with a warmup of about 10 epochs, and the arctangent loss.

If you are also experimenting with anomaly detection on the same series, see our overview of time series anomaly detection models — many of the same training tricks (RevIN, patching, decomposition) apply.

Hyperparameter cheat sheet

Hyperparameter	Default	When to change
Look-back L	96 (36 for ILI)	Increase if your seasonality is longer than 96 steps
Patch size P	16	Should align with your series’ natural local period
Stride S	8	Smaller for more overlap, larger for fewer patches
EMA α	0.3	Sweep {0.1, 0.3, 0.5, 0.7, 0.9} on small/noisy data
Epochs	100	Use early stopping to cut wasted compute
Loss	Arctangent	Switch to standard MAE if all horizons matter equally

When to use xPatch vs alternatives

No model is a universal answer. xPatch sits in a specific corner of the design space: low-latency, accuracy-competitive, supervised, point-forecast, multivariate. Here is how I think about choosing.

Need	Recommended approach	Why
Fastest training/inference, good accuracy	xPatch	Beats CARD, ~5× faster than CARD per training step
Foundation model / zero-shot	TimesFM, Chronos, Moirai	Pretrained at scale, generalize across domains without fine-tuning
Calibrated uncertainty estimates	Gaussian processes	Native posterior variances, principled credible intervals
Long-context attention reasoning	PatchTST, iTransformer	When channel relationships are essential and context exceeds ~512 steps
Tabular-style features without temporal structure	XGBoost / LightGBM	If you can engineer good lag/window features, GBMs are unbeatable on tabular forecasting
Linear/stationary signal, minimal compute	DLinear, classical ARIMA	If the data is genuinely simple, simpler is better
High-throughput streaming infra	xPatch + Kafka time-series engine	Low-latency model fits well with streaming pipelines

For tuning the hyperparameters of any of these alternatives in a principled way, our note on Bayesian hyperparameter optimization is worth reading.

Limitations and open questions

xPatch is a strong paper, but no paper is perfect. The honest weak spots:

α is a hyperparameter, not learned. A natural extension is to make α differentiable (or even per-channel and per-timescale). The paper acknowledges this and leaves it for future work.
Datasets are relatively small. The largest is Traffic with 862 channels and ~17k timesteps. That is small compared to what foundation models like Chronos and TimesFM are pretrained on. xPatch’s behavior on much larger streams is untested in the paper.
Two streams = two forward passes. Inference is still fast, but a fused single-pass implementation would be even faster, and might be feasible with a careful architectural redesign.
Point forecasts only. xPatch produces a single-trajectory forecast with no probabilistic interpretation. For risk-sensitive applications — finance, energy, healthcare — you want quantiles or distributions, which xPatch does not natively provide. You would need to bolt on a quantile head or wrap it in a Bayesian framework.
Benchmark saturation. The community has been honest that ETTh, Weather, and the related benchmarks are showing signs of saturation — gains of 2-3% may not transfer to messier real-world data with more drift, missing values, and concept shift. xPatch’s results are current best on these benchmarks; whether they generalize to a finance trading desk’s tick data is an empirical question.
No theoretical analysis. The paper is empirical. There is no guarantee about generalization, no convergence proof for the recursion, no analysis of the loss landscape. That is fine for an applied paper but leaves room for follow-up theory.

Caution: If your application has heavy concept drift (e.g., post-COVID demand forecasting, regime-changing financial markets), benchmark gains do not automatically transfer. Always evaluate on your own data with a realistic backtest before believing the leaderboard.

What this means for the field

Step back from the architecture details and the broader story is more interesting:

Inductive biases keep winning. Decomposition (separating trend from seasonality) has been valuable since the 1950s, and it remains valuable in 2025. Patching, locality, and dual-specialization all encode useful priors. Brute-force attention without these priors is rarely the right move for time series.
Loss functions and LR schedules are underrated. The fact that arctangent loss and sigmoid LR transfer to other models suggests the field has been comparing architectures under suboptimal training. Future benchmark papers should probably standardize the training recipe before claiming architectural wins.
The Pareto frontier is the right axis. A model that is 1% more accurate but 10× slower may not be worth deploying. xPatch sits in the corner where accuracy is competitive and speed is meaningfully better. That is the right place to be for production systems.
Foundation models are not the only path forward. The same year that brought TimesFM and Chronos also brought xPatch, which is task-specific, small, fast, and competitive. Both styles will coexist; choose based on your deployment constraints.
Self-supervised pre-training is still on the menu. xPatch is fully supervised. There is an open question whether SSL pre-training of the CNN stream — analogous to what TS2Vec and similar methods do — would unlock further gains. Our overview of self-supervised pretraining covers the relevant techniques.

For a quick reminder of the statistical foundations that all of these models stand on (independence, the role of variance, why sample sizes matter for stable estimators), see our explainer on the Central Limit Theorem. And if you are about to put a forecasting model into production, the data layer matters too — our comparison of databases for preprocessed time series walks through the trade-offs.

Frequently asked questions

Why does a non-transformer model outperform PatchTST?

Three reasons stack: (1) the EMA decomposition gives the model two cleaner sub-signals instead of one mixed signal, (2) the dual-stream architecture matches the right tool to each component (linear for the smooth trend, CNN for the bursty seasonal), and (3) the arctangent loss and sigmoid LR schedule give a free training-side boost. PatchTST does have channel-independent attention and learnable patching, but it asks one stack of attention layers to handle both trend and seasonal at once. xPatch’s specialization wins on average by 2.46% MSE while running about 4.8× faster than CARD.

Should I use xPatch or PatchTST in production?

Default to xPatch unless you have a specific reason not to. It is faster to train, faster to infer, slightly more accurate on the standard benchmarks, and easier to debug because the streams are interpretable. Use PatchTST if you have a heavily channel-correlated dataset where the attention’s cross-channel mixing is essential, or if you need a longer look-back than 96 steps and want attention’s global receptive field.

How do I tune the EMA alpha parameter?

Start with α = 0.3, which is optimal for the largest paper benchmarks (Weather, Traffic, Electricity). For smaller or noisier datasets, sweep {0.1, 0.3, 0.5, 0.7, 0.9} on a held-out validation split. Smaller α produces smoother trends (good when noise is dominant); larger α produces more reactive trends (good when regime changes are abrupt). The paper deliberately keeps α non-learnable; making it learnable is a sensible research extension.

What is the arctangent loss and why does it help?

It replaces the standard MSE/MAE loss with a horizon-weighted MAE where the weights follow ρ(i) = −arctan(i) + π/4 + 1. The arctangent grows much more slowly than the exponential weighting CARD uses, which means no single horizon dominates the gradient. The result is more uniform learning signal across all forecast horizons. Empirically, the loss helps not just xPatch but also other models (PatchTST, CARD), which makes it a transferable upgrade for any forecasting pipeline.

Does xPatch support multivariate forecasting?

Yes. The architecture is designed for multivariate inputs. The depthwise convolution in the CNN stream operates per-channel (groups = N), and the pointwise convolution mixes across channels. The linear stream processes each channel through the same weights but maintains the channel dimension. The paper evaluates on datasets with up to 862 channels (Traffic) without modification.

External references

xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition — Stitsyuk & Choi, AAAI 2025 (arXiv:2412.17323).
Official xPatch implementation on GitHub.
PatchTST: A Time Series is Worth 64 Words — Nie et al., the patching baseline xPatch dethrones.
Are Transformers Effective for Time Series Forecasting? (DLinear) — Zeng et al., the paper that started the back-to-basics thread.
CARD — Wang et al., the previous SOTA xPatch is benchmarked against.

This article is for informational and educational purposes only. It summarizes a publicly available academic paper and is not a substitute for reading the original. Implementation details should be verified against the official repository before production use.

AI/MLSelf-Supervised Learning (SSL) for Pretraining: A Complete Guide AI/MLUnderstanding Skills in Claude Code: What They Are, How They Work, and How to Build Your Own AI/MLTime-Series Forecasting in 2026: From ARIMA to Foundation Models — A Complete Guide

xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

Summary

Why this paper matters

The EMA decomposition: heart of xPatch

Why decomposition matters

The recursive formula

Why α = 0.3 wins for big datasets

Simple MA vs EMA

The dual-stream architecture

The linear stream (handles X_T)

The CNN stream (handles X_S)

Why this division of labor works

Combining the two streams

Training tricks: arctangent loss, sigmoid LR, RevIN

RevIN (Reversible Instance Normalization)

Arctangent loss: the smart twist

Sigmoid learning-rate schedule

Compute footprint

Results that hurt the transformers

Datasets

Headline numbers

Speed: the real punchline

Ablations: what actually drives performance

EMA α sweep

Dual-stream necessity

Arctangent loss is transferable

Sigmoid LR is also transferable

How to use xPatch (PyTorch sketch)

Hyperparameter cheat sheet

When to use xPatch vs alternatives

Limitations and open questions

What this means for the field

Frequently asked questions

Why does a non-transformer model outperform PatchTST?

Should I use xPatch or PatchTST in production?

How do I tune the EMA alpha parameter?

What is the arctangent loss and why does it help?

Does xPatch support multivariate forecasting?

Related reading

External references

You Might Also Like

Comments

Leave a Reply Cancel reply

More posts

Who Owns Anthropic? Public Company Stakes and Investor Map in 2026

AMD vs NVIDIA in 2026: Prospects, Risks, and Conditional Scenarios

xPatch Explained: Dual-Stream Time Series Forecasting with EMA Decomposition

Anomaly Detection Metrics Explained: AUROC, AUPRC, F1, Precision, Recall, FAR