1. Introduction: Why Time Series Forecasting Matters More Than Ever
Time series forecasting — the art and science of predicting future values based on historical patterns — has quietly become one of the most consequential applications of artificial intelligence. From predicting stock market movements and energy demand to forecasting supply chain bottlenecks and patient hospital admissions, accurate time series predictions can mean the difference between billions in profit and catastrophic losses.
Yet for decades, the field was dominated by classical statistical methods like ARIMA (AutoRegressive Integrated Moving Average), Exponential Smoothing, and Prophet. These methods, while reliable and interpretable, struggled with the complexity of modern datasets: thousands of interrelated variables, irregular sampling intervals, and the need to generalize across entirely different domains without retraining.
That changed dramatically between 2023 and 2026. A wave of innovation — driven by the same transformer architectures powering ChatGPT and other large language models — swept through the time series community. The result is a new generation of models that can forecast with remarkable accuracy, often with zero or minimal fine-tuning on the target data.
In this comprehensive guide, we will explore the latest and most impactful time series forecasting models, explain how they work in plain language, compare their strengths and weaknesses, and provide practical guidance for choosing the right model for your use case. Whether you are a data scientist, a quantitative investor, or a business leader trying to understand the technology, this article will give you the knowledge you need.
2. The Evolution from Statistical to Deep Learning Models
To appreciate the significance of the latest models, it helps to understand the journey that brought us here. Time series forecasting has evolved through several distinct eras, each building on the limitations of its predecessor.
2.1 The Classical Era (1970s-2010s): ARIMA, ETS, and Prophet
The workhorse of time series forecasting for nearly half a century was the ARIMA family of models. Developed by Box and Jenkins in the 1970s, ARIMA models decompose a time series into autoregressive (AR) components, integrated (differencing) components, and moving average (MA) components. They work beautifully for univariate, stationary time series with clear patterns.
Exponential Smoothing (ETS) offered a complementary approach, assigning exponentially decreasing weights to older observations. Facebook’s Prophet (released in 2017) made time series accessible to non-specialists by automatically handling seasonality, holidays, and trend changes.
However, all of these methods share a fundamental limitation: they are univariate (or handle multivariate data awkwardly), they require manual feature engineering, and they must be trained separately for each time series. If you have 10,000 product SKUs to forecast, you need 10,000 separate models.
2.2 The Early Deep Learning Era (2017-2022): DeepAR, N-BEATS, and Temporal Fusion Transformer
Deep learning entered the time series arena with Amazon’s DeepAR (2017), which used recurrent neural networks (RNNs) to produce probabilistic forecasts across related time series. N-BEATS (2019) from Element AI showed that pure deep learning architectures could beat statistical ensembles on the M4 competition benchmark, a prestigious forecasting competition.
The Temporal Fusion Transformer (TFT), published by Google in 2021, combined attention mechanisms with gating layers to handle multiple input types (static metadata, known future inputs, and observed past values). TFT became one of the most popular deep learning forecasting models, offering both accuracy and interpretability through its attention weights.
Despite these advances, these models still required substantial training data from the target domain and significant computational resources to train. They were not “general-purpose” forecasters.
2.3 The Foundation Model Era (2023-2026): Zero-Shot Forecasting
The breakthrough came when researchers applied the “foundation model” paradigm — pre-training on massive, diverse datasets and then applying the model to new tasks without fine-tuning — to time series data. Just as GPT-3 could answer questions about topics it was never explicitly trained on, these new models can forecast time series they have never seen before.
This paradigm shift was enabled by three key insights:
- Tokenization of time series: Converting continuous numerical values into discrete tokens (similar to how text is tokenized for language models) allows transformer architectures to process time series data effectively.
- Cross-domain pre-training: Training on hundreds of thousands of diverse time series (energy, finance, weather, retail, healthcare) teaches the model general patterns like seasonality, trends, and level shifts that transfer across domains.
- Scaling laws apply: Larger models trained on more data consistently produce better forecasts, following the same scaling behavior observed in large language models.
3. Foundation Models for Time Series: The 2024-2026 Revolution
Foundation models represent the most exciting development in time series forecasting. These models are pre-trained on vast collections of time series data and can generate forecasts for entirely new datasets without any task-specific training. Here are the most important ones.
3.1 Amazon Chronos
Released by Amazon Science in March 2024, Chronos is a family of pre-trained probabilistic time series forecasting models based on the T5 (Text-to-Text Transfer Transformer) architecture. What makes Chronos unique is its approach to tokenization: it converts real-valued time series into a sequence of discrete tokens using scaling and quantization, then trains a language model to predict the next token in the sequence.
How It Works
Chronos treats time series forecasting as a language modeling problem. Given a sequence of historical values [v1, v2, …, vT], the model:
- Scales the values using mean absolute scaling to normalize different magnitudes
- Quantizes the scaled values into a fixed vocabulary of bins (e.g., 4096 bins)
- Feeds the token sequence into a T5 encoder-decoder transformer
- Generates future tokens autoregressively, which are then mapped back to real values
- Produces probabilistic forecasts by sampling multiple trajectories
Key Strengths
- Zero-shot capability: Performs competitively with models trained specifically on the target dataset
- Multiple model sizes: Available in Mini (8M), Small (46M), Base (200M), and Large (710M) parameter variants
- Data augmentation: Uses synthetic data generated by Gaussian processes during pre-training to improve robustness
- Open source: Fully available on Hugging Face under Apache 2.0 license
Benchmark Results
On the extensive benchmark of 27 datasets compiled by the Chronos team, the Large model achieved the best aggregate zero-shot performance, outperforming task-specific models like DeepAR and AutoARIMA on many datasets. On the widely-used Monash Forecasting Archive, Chronos ranked first or second on the majority of datasets.
3.2 Google TimesFM
TimesFM (Time Series Foundation Model) was released by Google Research in February 2024. Unlike Chronos, which adapts a language model architecture, TimesFM was designed from scratch specifically for time series forecasting. It uses a decoder-only transformer architecture with a unique patched decoding approach.
How It Works
TimesFM introduces the concept of “input patches” — contiguous segments of the time series that are fed into the model as single tokens. Rather than processing one time step at a time, the model processes chunks of, say, 32 consecutive values as a single input patch. This dramatically reduces sequence length and allows the model to capture longer-range dependencies.
The key innovation is variable output patch lengths: during training, the model learns to output predictions at different granularities (e.g., 1 step, 16 steps, or 128 steps at a time), which gives it flexibility at inference time to handle arbitrary forecast horizons efficiently.
Key Strengths
- 200M parameters: Trained on a massive corpus of 100 billion time points from Google Trends, Wiki Pageviews, and synthetic data
- Handles variable horizons: A single model can forecast 1 step ahead or 1000 steps ahead without retraining
- Point and probabilistic forecasts: Provides both median forecasts and prediction intervals
- Very fast inference: The patched architecture makes it significantly faster than autoregressive models at long horizons
Benchmark Results
Google’s benchmarks show TimesFM achieving state-of-the-art zero-shot performance on the Darts, Monash, and Informer benchmarks, often matching or exceeding supervised baselines that were trained on the target data. It was particularly strong on long-horizon forecasting tasks (96 to 720 steps ahead).
3.3 Salesforce Moirai
Moirai (released by Salesforce AI Research in February 2024) takes yet another approach. It is built on a masked encoder architecture and is designed as a universal forecasting transformer that handles multiple frequencies, prediction lengths, and variable counts within a single model.
How It Works
Moirai’s key innovation is the Any-Variate Attention mechanism. Traditional transformers process multivariate time series by either flattening all variables into one sequence (which loses variable identity) or processing each variable independently (which misses cross-variable relationships). Moirai’s Any-Variate Attention allows the model to dynamically attend to any combination of variables and time steps, regardless of how many variables are present.
The model also uses multiple input/output projection layers for different data frequencies (minutely, hourly, daily, weekly, etc.), allowing a single model to handle data at any sampling rate.
Key Strengths
- True multivariate forecasting: Unlike Chronos and TimesFM (which are primarily univariate), Moirai natively handles multivariate time series
- Frequency-agnostic: A single model works across different sampling frequencies
- Three model sizes: Small (14M), Base (91M), and Large (311M) parameters
- Pre-trained on LOTSA: The Large-scale Open Time Series Archive, a curated collection of 27 billion observations across 9 domains
3.4 Nixtla TimeGPT
TimeGPT-1, developed by Nixtla, was actually one of the earliest time series foundation models (first announced in October 2023). Unlike the open-source models above, TimeGPT is offered as a commercial API service, similar to how OpenAI offers GPT access.
How It Works
TimeGPT uses a proprietary transformer-based architecture trained on over 100 billion data points from publicly available datasets spanning finance, weather, energy, web traffic, and more. The exact architecture details are not fully published, but the model follows an encoder-decoder design with attention mechanisms optimized for temporal patterns.
Key Strengths
- Easiest to use: Simple API call — no model loading, no GPU required
- Fine-tuning support: Can be fine-tuned on your data through the API for improved performance
- Anomaly detection: Built-in anomaly detection capabilities alongside forecasting
- Conformal prediction intervals: Statistically rigorous uncertainty quantification
4. Transformer-Based Architectures That Changed the Game
Beyond the foundation models, several transformer-based architectures have pushed the boundaries of supervised time series forecasting. These models require training on your specific dataset but often achieve the highest accuracy when sufficient training data is available.
4.1 PatchTST (Patch Time Series Transformer)
Published at ICLR 2023 by researchers from Princeton and IBM, PatchTST introduced two simple but powerful ideas that dramatically improved transformer performance on time series data.
The Two Key Innovations
Patching: Instead of feeding individual time steps as tokens to the transformer (which creates very long sequences for high-frequency data), PatchTST divides the time series into fixed-length patches (e.g., segments of 16 consecutive values). Each patch becomes a single token, reducing sequence length by a factor of 16 and allowing the attention mechanism to capture much longer-range dependencies within the same computational budget.
Channel Independence: Rather than mixing all variables together (which often confuses the model), PatchTST processes each variable independently through a shared transformer backbone. This counterintuitive design choice turned out to be remarkably effective, as it prevents the model from overfitting to spurious cross-variable correlations in the training data.
Why It Matters
PatchTST demonstrated that transformers can excel at time series forecasting when the tokenization strategy is right. Prior to PatchTST, several papers (notably “Are Transformers Effective for Time Series Forecasting?” by Zeng et al., 2023) had argued that simple linear models outperform transformers on long-term forecasting. PatchTST comprehensively refuted this claim, achieving state-of-the-art results on all major benchmarks at the time.
4.2 iTransformer
Published at ICLR 2024 by researchers from Tsinghua University and Ant Group, iTransformer (Inverted Transformer) takes a radically different approach to applying transformers to multivariate time series.
The Inversion Idea
In a standard transformer for time series, each token represents a time step across all variables. The attention mechanism then captures relationships between different time steps. iTransformer inverts this: each token represents an entire variable’s history, and the attention mechanism captures relationships between different variables.
Concretely, if you have a multivariate time series with 7 variables and 96 historical time steps:
- Standard transformer: 96 tokens, each containing 7 values
- iTransformer: 7 tokens, each containing 96 values
This inversion allows the feed-forward layers to learn temporal patterns within each variable, while the attention mechanism learns cross-variable dependencies — a much more natural decomposition of the problem.
Benchmark Results
iTransformer achieved state-of-the-art results on multiple long-term forecasting benchmarks including ETTh1, ETTh2, ETTm1, ETTm2, Weather, Electricity, and Traffic datasets. It showed particular strength on datasets with strong cross-variable correlations, where its inverted attention mechanism could exploit the relationships effectively.
4.3 TimeMixer
Published at ICLR 2024, TimeMixer from Zhejiang University introduces a unique multi-scale mixing architecture that decomposes time series at different temporal resolutions and mixes them together.
How It Works
TimeMixer operates on the insight that time series patterns exist at multiple scales: daily patterns, weekly patterns, monthly patterns, and so on. The model:
- Past Decomposable Mixing (PDM): Decomposes the historical data into multiple temporal resolutions using average pooling, then mixes seasonal and trend components across scales
- Future Multipredictor Mixing (FMM): Generates predictions at each scale independently, then combines them using learnable weights
This multi-scale approach is particularly effective for datasets with complex, multi-period seasonality (e.g., electricity consumption with daily, weekly, and annual patterns).
5. Lightweight Models That Rival Deep Learning
Not every use case requires a billion-parameter model. Recent research has shown that well-designed lightweight models can match or even exceed the performance of complex transformer architectures, while being orders of magnitude faster to train and deploy.
5.1 TSMixer and TSMixer-Rev
TSMixer, published by Google Research in 2023, is an MLP-based (Multi-Layer Perceptron) architecture that uses only simple fully-connected layers and achieves competitive performance with transformer models. The key innovation is alternating time-mixing and feature-mixing operations:
- Time-mixing MLPs: Apply shared weights across variables to capture temporal patterns
- Feature-mixing MLPs: Apply shared weights across time steps to capture cross-variable relationships
TSMixer-Rev (Revised), published in early 2024, added reversible instance normalization to handle distribution shifts in time series data more effectively, further improving performance.
Why Consider TSMixer
- 10-100x faster than transformer models to train
- Minimal memory footprint — runs on CPUs
- Competitive accuracy on most benchmarks
- Easy to understand, debug, and maintain
5.2 TiDE (Time-series Dense Encoder)
TiDE, also from Google Research (2023), is another MLP-based model that uses an encoder-decoder architecture with dense layers. It encodes the historical time series and covariates into a fixed-size representation, then decodes it into future predictions.
TiDE’s main advantage is its linear computational complexity with respect to both the lookback window and the forecast horizon. While transformers have quadratic complexity (O(n^2)) due to self-attention, TiDE’s MLP-based design scales linearly, making it practical for very long sequences and real-time applications.
6. Head-to-Head Comparison: Which Model Should You Use?
Choosing the right model depends on your specific requirements. The table below summarizes the key characteristics of each model discussed in this article.
| Model | Type | Zero-Shot | Multivariate | Open Source | Best For |
|---|---|---|---|---|---|
| Chronos | Foundation | Yes | No (univariate) | Yes | General-purpose, quick start |
| TimesFM | Foundation | Yes | No (univariate) | Yes | Long-horizon forecasting |
| Moirai | Foundation | Yes | Yes | Yes | Multivariate, mixed frequency |
| TimeGPT | Foundation | Yes | Yes | No (API) | Non-technical users, fast prototyping |
| PatchTST | Supervised | No | Yes (channel-ind.) | Yes | Long-term forecasting with training data |
| iTransformer | Supervised | No | Yes (native) | Yes | Cross-variable correlation datasets |
| TimeMixer | Supervised | No | Yes | Yes | Multi-scale seasonality |
| TSMixer | Supervised | No | Yes | Yes | Resource-constrained, fast training |
| TiDE | Supervised | No | Yes | Yes | Real-time, low-latency applications |
Decision Framework
Use the following decision framework to choose the right model for your situation:
Do you have training data for your specific use case?
- No (or very little): Use a foundation model (Chronos, TimesFM, or Moirai)
- Yes (substantial): Consider supervised models (PatchTST, iTransformer) for potentially higher accuracy
Do you need multivariate forecasting?
- Yes: Moirai (zero-shot) or iTransformer (supervised)
- No: Chronos or TimesFM for simplicity
Are you resource-constrained?
- Yes: TSMixer or TiDE (MLP-based, run on CPU)
- No: Any transformer-based model
Do you need interpretability?
- Yes: TFT (Temporal Fusion Transformer) remains the best choice for interpretable forecasting
- No: Choose based on accuracy
7. Practical Guide: Getting Started with Modern Time Series Models
Let us walk through how to get started with the two most accessible models: Chronos (for zero-shot forecasting) and PatchTST (for supervised forecasting).
7.1 Getting Started with Chronos
Chronos is available through the Hugging Face Transformers library, making it extremely easy to use:
# Install dependencies
# pip install chronos-forecasting torch
import torch
import numpy as np
from chronos import ChronosPipeline
# Load the pre-trained model (choose: tiny, mini, small, base, large)
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-small",
device_map="auto",
torch_dtype=torch.float32,
)
# Your historical data (just a 1D numpy array or list)
historical_data = torch.tensor([
112, 118, 132, 129, 121, 135, 148, 148, 136, 119,
104, 118, 115, 126, 141, 135, 125, 149, 170, 170,
158, 133, 114, 140, # ... more data points
], dtype=torch.float32)
# Generate forecasts (12 steps ahead, 20 sample paths)
forecast = pipeline.predict(
context=historical_data,
prediction_length=12,
num_samples=20,
)
# Get median forecast and prediction intervals
median_forecast = np.quantile(forecast[0].numpy(), 0.5, axis=0)
lower_bound = np.quantile(forecast[0].numpy(), 0.1, axis=0)
upper_bound = np.quantile(forecast[0].numpy(), 0.9, axis=0)
print("Median forecast:", median_forecast)
print("80% prediction interval:", lower_bound, "to", upper_bound)
That is it — no training, no feature engineering, no hyperparameter tuning. The model works out of the box on any univariate time series.
7.2 Key Libraries and Frameworks
The time series ecosystem has several excellent frameworks that implement many of these models under a unified API:
- NeuralForecast (Nixtla): Implements PatchTST, iTransformer, TimeMixer, TiDE, TSMixer, and more under a scikit-learn-like API. Great for supervised models.
- GluonTS (Amazon): Production-grade framework for probabilistic time series modeling. Includes DeepAR, TFT, and integrates with Chronos.
- Darts (Unit8): User-friendly library supporting both classical (ARIMA, ETS) and deep learning models. Good for beginners.
- UniTS: A unified framework from CMU for training and evaluating time series foundation models.
8. Investment and Business Implications
The rapid advancement in time series forecasting models has significant implications for investors and businesses across multiple sectors.
8.1 Companies Leading the Charge
Several publicly traded companies are at the forefront of time series AI development and deployment:
- Amazon (AMZN): Developer of Chronos, DeepAR, and GluonTS. Uses time series forecasting extensively in supply chain optimization and demand forecasting across its retail operations.
- Google/Alphabet (GOOGL): Developer of TimesFM, TiDE, TSMixer, and the original Temporal Fusion Transformer. Applies these models in Google Cloud’s Vertex AI forecasting service.
- Salesforce (CRM): Developer of Moirai and other AI research. Integrates forecasting capabilities into its CRM and analytics products.
- Palantir (PLTR): Uses advanced time series models in its Foundry platform for defense, healthcare, and commercial forecasting applications.
- Snowflake (SNOW): Offers time series forecasting as part of its Cortex AI capabilities within the data cloud platform.
8.2 Industries Being Transformed
| Industry | Application | Impact |
|---|---|---|
| Energy | Demand forecasting, renewable output prediction | 10-30% reduction in forecasting error |
| Finance | Volatility modeling, risk assessment, algorithmic trading | Improved risk-adjusted returns |
| Retail | Demand forecasting, inventory optimization | 15-25% reduction in stockouts |
| Healthcare | Patient admissions, resource planning | Better capacity planning, fewer bottlenecks |
| Manufacturing | Predictive maintenance, quality control | 20-40% reduction in unplanned downtime |
8.3 ETFs and Investment Vehicles
For investors interested in gaining exposure to the AI and data analytics companies driving time series forecasting innovation, consider these ETFs:
- Global X Artificial Intelligence & Technology ETF (AIQ): Broad exposure to AI companies including cloud providers
- iShares Exponential Technologies ETF (XT): Includes companies at the intersection of AI, big data, and cloud computing
- ARK Autonomous Technology & Robotics ETF (ARKQ): Focuses on companies leveraging AI for automation
- First Trust Cloud Computing ETF (SKYY): Cloud infrastructure providers that host and serve these models
9. Conclusion: The Future of Time Series Forecasting
The time series forecasting landscape has undergone a remarkable transformation in just a few years. We have moved from a world where every forecasting problem required building a custom model from scratch to one where pre-trained foundation models can generate competitive forecasts out of the box, across domains they have never seen before.
Here are the key takeaways from our exploration:
Foundation models are the most important development. Chronos, TimesFM, Moirai, and TimeGPT represent a paradigm shift comparable to what GPT did for natural language processing. They democratize forecasting by making state-of-the-art predictions accessible without deep machine learning expertise.
Transformers have proven their worth for time series. After initial skepticism about whether transformers could outperform simple linear models, architectures like PatchTST, iTransformer, and TimeMixer have conclusively demonstrated that transformer-based models excel at capturing complex temporal patterns when designed with the right inductive biases.
Lightweight models should not be overlooked. TSMixer and TiDE show that well-designed MLP architectures can match transformer performance at a fraction of the computational cost. For production systems where latency and resource efficiency matter, these models are invaluable.
The field is still rapidly evolving. New models and architectures continue to emerge at a remarkable pace. The integration of time series capabilities into multimodal foundation models (combining text, images, and time series) is an active area of research that could unlock even more powerful forecasting capabilities in the coming years.
For practitioners, the recommended approach is clear: start with a foundation model like Chronos for a quick zero-shot baseline, then experiment with supervised models if more accuracy is needed, and consider lightweight alternatives for production deployment. The barrier to entry for world-class time series forecasting has never been lower.
References
- Ansari, A. F., et al. (2024). “Chronos: Learning the Language of Time Series.” Amazon Science. arXiv:2403.07815
- Das, A., et al. (2024). “A Decoder-Only Foundation Model for Time-Series Forecasting.” Google Research. arXiv:2310.10688
- Woo, G., et al. (2024). “Unified Training of Universal Time Series Forecasting Transformers.” Salesforce AI Research. arXiv:2402.02592
- Garza, A. and Mergenthaler-Canseco, M. (2023). “TimeGPT-1.” Nixtla. arXiv:2310.03589
- Nie, Y., et al. (2023). “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.” ICLR 2023. arXiv:2211.14730
- Liu, Y., et al. (2024). “iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” ICLR 2024. arXiv:2310.06625
- Wang, S., et al. (2024). “TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting.” ICLR 2024. arXiv:2405.14616
- Chen, S., et al. (2023). “TSMixer: An All-MLP Architecture for Time Series Forecasting.” Google Research. arXiv:2303.06053
- Das, A., et al. (2023). “Long-term Forecasting with TiDE: Time-series Dense Encoder.” Google Research. arXiv:2304.08424
- Lim, B., et al. (2021). “Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting.” International Journal of Forecasting. arXiv:1912.09363
Leave a Reply