Apache Iceberg vs Delta Lake vs Hudi: Choosing a Table Format

By kongastral

Published June 13, 2026 · 26 min read

Open table formats turned commodity object storage into a transactional database layer, and the choice among the three principal implementations — Apache Iceberg, Delta Lake, and Apache Hudi — is one of the foundational decisions in a modern lakehouse. The decision is consequential because the table format governs how data is written, updated, queried, and shared across every engine that touches the storage layer. It is also a decision that has changed character over the past two years. Where engineers once treated the choice as a long-term lock-in, the three formats have begun to converge toward interoperability, so that the question is increasingly about defaults and operational fit rather than permanent commitment.

This article examines what a table format is and why it became necessary, how each of the three formats is designed internally, how they compare across the dimensions that matter in practice, and why the so-called format war is giving way to a model in which one physical dataset can be read as more than one format. It closes with concrete guidance on how to choose in 2026.

Summary

What this post covers: This post compares Apache Iceberg, Delta Lake, and Apache Hudi as lakehouse table formats — their internal architecture, multi-engine reach, and the convergence that is reshaping the selection decision — and offers practical guidance for choosing one in 2026.

Key insights:

A table format adds an atomicity, schema-evolution, and time-travel layer on top of plain Parquet files, replacing the fragile “directory of files” model with a transactional metadata layer.
Iceberg, Delta Lake, and Hudi differ most in their metadata models and their design centers of gravity: Iceberg favours engine-neutral interoperability, Delta Lake favours deep Spark and Databricks integration, and Hudi favours high-frequency streaming upserts.
Apache Iceberg has become the de-facto industry standard in 2026, adopted by every major cloud provider and query engine, owing to its vendor-neutral governance, partition evolution, and broad engine support.
The format war is ending: Databricks acquired Tabular for more than one billion dollars in 2024, Delta UniForm exposes Delta tables as Iceberg, Hudi can output Iceberg metadata, and Apache XTable translates metadata omni-directionally with no data copying.
For most new, multi-engine deployments Iceberg is the safe default; Delta Lake fits Databricks-centric stacks; Hudi fits upsert-heavy streaming ingestion; and XTable interoperability can serve consumers that expect different formats from one dataset.

Main topics: Why Table Formats Exist, Apache Iceberg, Delta Lake, Apache Hudi, A Head-to-Head Comparison, The Convergence Story, How to Choose in 2026.

Why Table Formats Exist

A data lake in its simplest form is a collection of files — most commonly Apache Parquet, a columnar file format — stored in an object store such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. A query engine reads those files and treats them collectively as a table. This arrangement is inexpensive and scalable, but the abstraction is weak. The storage layer knows only about files and directories; it has no concept of a table, a transaction, or a consistent snapshot. The problems that follow from this gap are what gave rise to table formats.

A table format is a specification and metadata layer that sits between a physical collection of data files and the query engines that read them, presenting that collection as a single logical table with database-like guarantees. The most important of those guarantees is ACID — atomicity, consistency, isolation, and durability — the set of properties that ensure a group of changes either fully applies or does not apply at all, that concurrent readers and writers do not observe partial state, and that committed data survives failures. Without a table format, the lake cannot offer these properties.

Consider what happens when a job appends a new day of data to a plain Parquet directory and fails halfway. Some files are written and some are not, and a reader that lists the directory during the failure observes an inconsistent table. The same fragility affects updates and deletes: changing a single record requires rewriting whole files, and there is no transactional boundary to make the change appear atomically. This is the “directory of files” problem — the table is defined by whatever files happen to be present when the engine lists the path, with no authoritative record of which files belong to a committed version.

Three further capabilities are absent from the plain-Parquet model and are central to why table formats were created. Schema evolution is the ability to add, rename, drop, or reorder columns without rewriting historical data or breaking existing readers. Time travel is the ability to query the table as it existed at a previous point in time, which supports reproducibility, auditing, and rollback. And efficient query planning depends on metadata that records which files contain which ranges of values, so the engine can skip files that cannot contain matching rows.

Table formats deliver these capabilities through a metadata layer — a set of files that record the table’s schema, its partitioning, and, critically, the exact list of data files that constitute each committed version of the table. A central concept in two of the three formats is the manifest: a metadata file that enumerates a group of data files together with statistics about their contents, such as the minimum and maximum value of each column within each file. With this information the engine performs file pruning, reading only the files that could contain rows matching a query predicate. The three formats examined below implement these ideas differently, and those differences drive their respective strengths.

Key Takeaway: A table format is the layer that turns a passive directory of Parquet files into a transactional table with atomic commits, schema evolution, and time travel. The differences among Iceberg, Delta Lake, and Hudi are largely differences in how that metadata layer is structured and what workloads it was optimized for.

Apache Iceberg: Hierarchical Metadata and Engine Neutrality

Apache Iceberg was created to provide a table format that no single engine owns and that scales to very large tables without the planning bottlenecks of earlier approaches. Its defining characteristic is a hierarchical metadata architecture and a specification-first design philosophy. The format is defined by a written specification, and any engine that implements the specification can read and write the tables, which is the basis for Iceberg’s wide adoption.

The metadata hierarchy

An Iceberg table is described by a tree of metadata files. At the root is a metadata.json file (often called the table metadata file) that records the current schema, partition specification, and a pointer to the current snapshot. Each snapshot references a manifest list, a file that enumerates the manifest files belonging to that snapshot. Each manifest file, in turn, tracks a set of individual data files and stores column-level statistics for them, such as per-column lower and upper bounds and null counts (Apache Iceberg documentation; Dremio, as of 2026).

This hierarchy has two practical consequences. First, query planning is efficient even for tables with millions of files, because the engine reads the manifest list and the relevant manifests rather than listing the object store. Second, because the statistics are recorded in the manifests, the engine can prune files during planning — eliminating files whose recorded value ranges cannot satisfy the query predicate — without opening the data files. This is the same principle that motivates careful storage selection for analytical workloads, a topic explored in the discussion of choosing databases for preprocessed time-series data.

Partition evolution and hidden partitioning

Two features distinguish Iceberg’s treatment of partitioning. Hidden partitioning means the table records how a column is transformed into partition values — for example, by truncating a timestamp to its day — so that queries filtering on the raw column automatically benefit from partition pruning without the writer or reader having to reference a separate partition column. Partition evolution means the partitioning scheme of a table can be changed over time without rewriting the historical data; new data is written under the new scheme while old data remains valid under the old one. This is significant operationally, because it removes one of the most expensive and disruptive migrations in traditional partitioned tables.

Engine reach

Iceberg’s specification-first, engine-agnostic design has produced the broadest multi-engine support of the three formats. As of 2026 it is read and written by Apache Spark, Apache Flink, Trino, and DuckDB, and is supported as a table format by managed warehouses including Snowflake and Google BigQuery (RisingWave; Dremio; Onehouse comparison guides, as of 2026). This breadth is the principal reason Iceberg has become the common denominator across the industry, a point developed in the convergence section. A practical example of landing data into Iceberg from an operational source is described in the guide to building an InfluxDB-to-Iceberg data pipeline with Telegraf, and stream processors such as Flink commonly write into Iceberg as part of complex event processing pipelines.

Delta Lake: A Transaction Log Born at Databricks

Delta Lake originated at Databricks and was designed first and foremost to bring transactional reliability to data stored for use with Apache Spark. Its integration with Spark is the deepest of the three formats, reflecting that shared origin (Databricks; Dremio, as of 2026). Where Iceberg organizes metadata as a tree of manifest files, Delta Lake organizes it as an ordered transaction log.

The transaction log

A Delta table keeps a directory named _delta_log alongside its data files. Each successful commit writes a new JSON file to this directory, numbered sequentially, recording the actions that the commit performed — which data files were added and which were removed, along with metadata changes. The current state of the table is obtained by replaying these JSON commits in order. To prevent the replay from growing unbounded, Delta Lake periodically writes a checkpoint in Parquet format that captures the full table state up to a given commit. A reader then loads the most recent checkpoint and applies only the JSON commits that follow it, which keeps state reconstruction efficient.

The transaction log model gives Delta Lake straightforward time travel — a reader can request the table state as of any commit version — and reliable concurrency control through optimistic commits against the log. Because the log is the single ordered authority on table state, the semantics map cleanly onto Spark’s execution model, which is part of why Delta Lake remains the most natural choice within Spark-centric and Databricks-centric environments. Transformation workflows that run on top of such tables, for instance with dbt-based transformation pipelines, treat the table as the consistent input and output of each model run.

UniForm and interoperability

For most of Delta Lake’s history its log-based metadata was readable only by Delta-aware engines, which limited reach relative to Iceberg. Delta UniForm addresses this directly. UniForm provides interoperability across Delta Lake, Iceberg, and Hudi by generating the metadata that other formats expect alongside the Delta log, and it supports the Iceberg REST catalog interface so that Iceberg clients can discover and read the tables (Databricks, as of 2026). In effect, a table written as Delta can be presented to an Iceberg reader without copying the underlying Parquet data. This is one of the developments that has softened the practical cost of choosing Delta in a mixed-engine environment, and it is examined further in the convergence section.

Apache Hudi: Built for Streaming Upserts

Apache Hudi was designed around a workload that the other two formats addressed only later: continuous ingestion of streaming data with frequent record-level updates. The name itself abbreviates “Hadoop Upserts Deletes and Incrementals,” which signals its orientation. As of 2026, Hudi offers the most mature tooling for pure streaming ingestion involving high-frequency upserts (Onehouse; lakeFS, as of 2026). An upsert is an operation that inserts a record if its key does not yet exist and updates the existing record if it does — the natural operation when a change-data-capture stream delivers a steady flow of inserts, updates, and deletes from an operational database. Feeds of this kind commonly arrive through change data capture with Debezium and Kafka or through a Kafka consumer that lands events into the lake.

Copy-on-Write and Merge-on-Read

Hudi offers two table types, and the choice between them is the central tuning decision when adopting the format. In a Copy-on-Write (CoW) table, an update rewrites the base file that contains the affected record, so every version of the data is materialized at write time. Reads are therefore fast and simple, because the engine reads finished base files, but writes are more expensive because they rewrite whole files. In a Merge-on-Read (MoR) table, an update is written to a delta log file associated with the base file rather than rewriting the base file immediately; readers merge the base file with its delta logs at query time. Writes are cheaper and lower-latency, which suits high-frequency upserts, at the cost of more work at read time until a background compaction merges the deltas into new base files.

Record-level indexing and table management

To apply an upsert, Hudi must locate the existing record for a given key quickly. It maintains record-level indexing for this purpose — a mapping from record keys to the files that contain them — so an incoming update can be routed to the correct file without scanning the whole table. This indexing is a meaningful part of why Hudi handles high-frequency update workloads efficiently. Hudi also includes built-in table management services: compaction merges MoR delta logs into base files, clustering reorganizes data to improve query locality, and cleaning removes obsolete file versions according to a retention policy. These services can run as part of the ingestion pipeline or as separate jobs, and orchestrating them alongside ingestion is a common use of a workflow scheduler such as Apache Airflow.

Hudi’s primary ingestion utility is its streaming ingestion tool, historically known as DeltaStreamer (now Hudi Streamer), which reads from sources such as Kafka and applies inserts, updates, and deletes to a Hudi table continuously. This tooling, together with record-level indexing and the MoR table type, is the basis for Hudi’s reputation in upsert-heavy streaming ingestion.

Native Iceberg output

Hudi has also moved toward interoperability. Tables created from Hudi version 0.14.0 onwards can be synced to Iceberg and/or Delta Lake through Apache XTable, and Hudi’s native Iceberg support lets a team use Hudi’s managed services — compaction, indexing, and Hudi Streamer ingestion — while outputting Iceberg-compatible tables for downstream consumers (Apache Hudi documentation, as of 2026). This means a pipeline can keep Hudi’s strengths on the write side while presenting Iceberg on the read side, a pattern that the convergence section places in context.

A Head-to-Head Comparison

The three formats overlap substantially in their core guarantees — all provide ACID transactions, schema evolution, and time travel — so the meaningful differences lie in their metadata models, their maturity for specific operations, and their ecosystem reach. The matrix below summarizes the breadth of engine support that distinguishes the formats, followed by a feature comparison and a workload mapping.

The grid is directional rather than a precise capability audit; engine support changes with each release, and warehouse vendors continue to add native read paths. The pattern it conveys is the one reported consistently across the comparison literature: Iceberg has the widest native reach, Delta Lake is strongest within Spark and reaches other engines primarily through UniForm, and Hudi is strong in Spark and Flink and reaches warehouses through translation. The detailed feature comparison follows.

Dimension	Apache Iceberg	Delta Lake	Apache Hudi
Design origin	Engine-neutral, specification-first	Databricks; deepest Spark integration	Streaming ingestion and upserts
Metadata model	Hierarchy: metadata.json → manifest list → manifests → data files	Ordered transaction log (_delta_log) + Parquet checkpoints	Timeline + base files and delta logs; record-level index
Upsert / CDC maturity	Supported; improving	Strong within Spark	Most mature for high-frequency upserts
Partition evolution	Yes, plus hidden partitioning	Limited	Limited
Engine support	Broadest: Spark, Flink, Trino, Snowflake, BigQuery, DuckDB	Spark-first; others via UniForm	Spark, Flink; warehouses via translation
Built-in table management	Via engine / catalog services	Via engine; optimize and vacuum	Built-in compaction, clustering, cleaning
Governance / catalog	Vendor-neutral; Iceberg REST catalog standard	Unity Catalog; UniForm exposes Iceberg REST	Hive / catalog integrations; XTable sync

The feature comparison clarifies that the formats are not interchangeable on every axis even though their core guarantees overlap. Partition evolution and hidden partitioning are genuine Iceberg differentiators; built-in table management is a genuine Hudi differentiator; and the deepest Spark integration remains a Delta Lake characteristic. The second table maps common workloads to the format that most naturally fits each, before interoperability is taken into account.

Workload	Best-fit format	Why
New multi-engine, vendor-neutral lakehouse	Iceberg	Broadest engine reach and neutral governance
Databricks-centric analytics and ML	Delta Lake	Deepest Spark integration; UniForm for outside reach
High-frequency CDC upserts from operational databases	Hudi	Record-level indexing, MoR, Hudi Streamer
Append-mostly analytical tables on a warehouse	Iceberg	Native support in Snowflake and BigQuery
One dataset consumed by several engines expecting different formats	Any + XTable / UniForm	Metadata translation avoids copying data

The Convergence Story

The most consequential change in this area is not a new capability in any single format but the erosion of the boundaries between them. Several developments, taken together, have moved the ecosystem from competition toward interoperability, and they explain why the choice of format is now less of a permanent commitment than it was.

The Databricks–Tabular acquisition

In a development that reframed the competitive landscape, Databricks acquired Tabular — the startup founded by the original creators of Apache Iceberg — for more than one billion dollars, a deal revealed on June 4, 2024 (Databricks blog; TechTarget, as of 2024-06-04). Because Databricks is the company behind Delta Lake, the acquisition brought the architects of the rival format inside the same organization and signaled an intent to support both formats rather than insist on one. It is widely read as the moment the “format war” framing began to lose force in favour of interoperability.

UniForm, Hudi output, and Apache XTable

Three technical mechanisms now allow a single physical dataset to be read as more than one format. Delta UniForm, described earlier, exposes Delta tables as Iceberg and supports the Iceberg REST catalog interface (Databricks, as of 2026). Hudi’s native Iceberg support lets a team manage a table with Hudi and serve it as Iceberg (Apache Hudi documentation, as of 2026). And Apache XTable — an incubating project backed by Microsoft, Google, and Onehouse — provides omni-directional metadata translation between all three formats: any format to any format, with no copying of the underlying data (Dremio; Onehouse, as of 2026). XTable works by generating the metadata that each target format expects, pointing it at the same Parquet files, so the cost of translation is metadata generation rather than data duplication.

The cumulative effect of these developments is that Apache Iceberg has become the de-facto common denominator. It is the format that the other two can be exposed as, the one with the broadest native reach, and the one adopted by every major cloud provider and query engine (RisingWave; Dremio; Onehouse, as of 2026). The reasons most often cited are the same throughout the literature: vendor-neutral governance, partition evolution, and the widest multi-engine support.

Caution: Convergence reduces lock-in but does not eliminate operational specialization. Interoperability layers translate metadata, not behaviour; a workload that depends on Hudi’s record-level upsert path, for example, still benefits from writing as Hudi even if it is read as Iceberg. Translation is a serving convenience, not a substitute for choosing the right write path.

How to Choose in 2026

Given convergence, the practical decision reduces to selecting the format that best fits the write path and the operational center of gravity, while relying on interoperability to satisfy diverse readers. The decision tree below summarizes the guidance, and the discussion that follows expands on each branch.

Iceberg as the safe default

For a new deployment that aims to remain vendor-neutral and to be queried by several engines, Iceberg is the safe default. It carries the lowest risk of lock-in, offers partition evolution and hidden partitioning, and is natively supported across the widest range of engines and warehouses. Choosing Iceberg also aligns with the direction of the ecosystem, since the other formats can be exposed as Iceberg but the reverse arrangement is less central to current tooling. For teams running engines on Kubernetes, where portability is already a goal, the neutrality argument extends naturally to the storage layer; the operational considerations of running such engines are discussed in the guide to database connections from Kubernetes pods.

Delta Lake for Databricks-centric stacks

When the platform is built around Databricks and Spark, Delta Lake remains the natural choice. Its integration is the deepest available, its tooling and governance through Unity Catalog are mature, and UniForm now mitigates the historical drawback of limited external reach by exposing the same tables as Iceberg. A team already invested in Databricks gains little by writing Hudi or Iceberg directly and may give up integration depth by doing so.

Hudi for streaming upserts

For ingestion dominated by high-frequency upserts and change-data-capture streams, Hudi remains the strongest fit. Its record-level indexing, Merge-on-Read table type, and built-in compaction and cleaning were designed for exactly this pattern, and its Hudi Streamer utility provides a tested ingestion path. The native Iceberg output then allows the same data to be served to analytical consumers as Iceberg, combining Hudi’s write strengths with Iceberg’s read reach.

When to rely on interoperability instead

Some organizations do not need to pick a single winner. When one dataset must serve consumers that expect different formats — for example, a Spark-on-Databricks team reading Delta and a Trino team reading Iceberg — relying on XTable or UniForm to translate metadata is often preferable to maintaining duplicate copies of the data. The decision then shifts from “which format” to “which write path produces the data, and which translations are needed for the readers.” This framing is the clearest sign of how far the field has moved from the original format competition.

Tip: Choose the format that fits the write path, not the one that fits a single reader. The write path determines update efficiency and operational tooling, which interoperability layers cannot retrofit; readers in other formats can be served afterward through UniForm or XTable.

Frequently Asked Questions

What is a lakehouse table format, and how does it differ from a file format like Parquet?

A file format such as Parquet defines how the bytes of a single data file are laid out. A table format is a metadata layer above many such files that presents them as one logical table with ACID transactions, schema evolution, and time travel. Parquet stores the data; the table format records which files belong to which committed version of the table and how to read them consistently.

Is Apache Iceberg replacing Delta Lake and Hudi?

Not exactly. Iceberg has become the de-facto common denominator that the other formats translate toward, and it is the safe default for new neutral deployments. Delta Lake and Hudi retain distinct strengths — deepest Spark integration and the most mature streaming-upsert tooling, respectively — and both can now expose their tables as Iceberg, so they continue to be used as write paths even where Iceberg is the serving format.

What is the difference between Hudi Copy-on-Write and Merge-on-Read?

Copy-on-Write rewrites the base file whenever a record is updated, which makes reads fast and writes heavier; it suits read-heavy tables with moderate update rates. Merge-on-Read appends updates to delta log files and merges them with the base file at query time, which makes writes light and low-latency; it suits high-frequency upserts, with background compaction periodically consolidating the deltas.

Does choosing one format lock an organization into one vendor or engine?

Less than it once did. Delta UniForm exposes Delta tables as Iceberg through the Iceberg REST catalog, Hudi can output Iceberg-compatible tables, and Apache XTable translates metadata among all three with no data copying. Iceberg in particular is vendor-neutral by design. Lock-in is now mainly a function of operational tooling and the write path rather than the storage format itself.

What did the Databricks acquisition of Tabular mean for the format landscape?

Databricks, the company behind Delta Lake, acquired Tabular — the startup founded by Iceberg’s original creators — for more than one billion dollars, with the deal revealed on June 4, 2024. Bringing Iceberg’s architects inside the Delta vendor signaled support for both formats and is widely read as the point at which the industry pivoted from a format competition toward interoperability.

References

Comparative characterizations of engine reach and format maturity reflect 2026 guidance from RisingWave, Dremio, Onehouse, and lakeFS; figures and dated claims are attributed inline.

Conclusion

The lakehouse table format is the layer that turns object storage into a transactional database surface, and the three principal implementations express three design priorities: Iceberg’s engine-neutral interoperability, Delta Lake’s depth within the Spark and Databricks ecosystem, and Hudi’s strength in high-frequency streaming upserts. Their core guarantees converge, but their metadata models and operational characters remain distinct enough to matter on the write path.

The decisive shift of the past two years is that selecting a format no longer means accepting permanent lock-in. The Databricks–Tabular acquisition, Delta UniForm, Hudi’s Iceberg output, and Apache XTable have made it possible for one physical dataset to be presented as more than one format, with Iceberg emerging as the de-facto common denominator. The practical recommendation that follows is straightforward: default to Iceberg for new, multi-engine, vendor-neutral deployments; choose Delta Lake within Databricks-centric stacks; choose Hudi for upsert-heavy streaming ingestion; and lean on interoperability layers when one dataset must serve consumers expecting different formats. The right choice is the one that fits the write path, because interoperability can serve the readers afterward.

Programmingdbt for Data Transformation Pipelines: From Raw to Analytics-Ready ProgrammingImplementing an Apache Kafka Consumer in Python ProgrammingHow to Transfer Data from InfluxDB to AWS Iceberg Using Telegraf: A Complete Data Pipeline Guide

Apache Iceberg vs Delta Lake vs Hudi: Choosing a Table Format

Summary

Why Table Formats Exist

Apache Iceberg: Hierarchical Metadata and Engine Neutrality

The metadata hierarchy

Partition evolution and hidden partitioning

Engine reach

Delta Lake: A Transaction Log Born at Databricks

The transaction log

UniForm and interoperability

Apache Hudi: Built for Streaming Upserts

Copy-on-Write and Merge-on-Read

Record-level indexing and table management

Native Iceberg output

A Head-to-Head Comparison

The Convergence Story

The Databricks–Tabular acquisition

UniForm, Hudi output, and Apache XTable

How to Choose in 2026

Iceberg as the safe default

Delta Lake for Databricks-centric stacks

Hudi for streaming upserts

When to rely on interoperability instead

Frequently Asked Questions

What is a lakehouse table format, and how does it differ from a file format like Parquet?

Is Apache Iceberg replacing Delta Lake and Hudi?

What is the difference between Hudi Copy-on-Write and Merge-on-Read?

Does choosing one format lock an organization into one vendor or engine?

What did the Databricks acquisition of Tabular mean for the format landscape?

Related Reading

References

Conclusion

You Might Also Like

Comments

Leave a Reply Cancel reply

More posts

Apache Iceberg vs Delta Lake vs Hudi: Choosing a Table Format

What Is a Hook in AI? Lifecycle, PyTorch, and Webhook Patterns

How to Train Open-Source LLMs in 2026: Qwen3.6, Qwen3.5, GPT-OSS

Kubernetes Pods Explained: Why Connecting to a Database Pod Is Hard