Open table formats turned commodity object storage into a transactional database layer, and the choice among the three principal implementations — Apache Iceberg, Delta Lake, and Apache Hudi — is one of the foundational decisions in a modern lakehouse. The decision is consequential because the table format governs how data is written, updated, queried, and shared across every engine that touches the storage layer. It is also a decision that has changed character over the past two years. Where engineers once treated the choice as a long-term lock-in, the three formats have begun to converge toward interoperability, so that the question is increasingly about defaults and operational fit rather than permanent commitment.
This article examines what a table format is and why it became necessary, how each of the three formats is designed internally, how they compare across the dimensions that matter in practice, and why the so-called format war is giving way to a model in which one physical dataset can be read as more than one format. It closes with concrete guidance on how to choose in 2026.
Summary
What this post covers: This post compares Apache Iceberg, Delta Lake, and Apache Hudi as lakehouse table formats — their internal architecture, multi-engine reach, and the convergence that is reshaping the selection decision — and offers practical guidance for choosing one in 2026.
Key insights:
- A table format adds an atomicity, schema-evolution, and time-travel layer on top of plain Parquet files, replacing the fragile “directory of files” model with a transactional metadata layer.
- Iceberg, Delta Lake, and Hudi differ most in their metadata models and their design centers of gravity: Iceberg favours engine-neutral interoperability, Delta Lake favours deep Spark and Databricks integration, and Hudi favours high-frequency streaming upserts.
- Apache Iceberg has become the de-facto industry standard in 2026, adopted by every major cloud provider and query engine, owing to its vendor-neutral governance, partition evolution, and broad engine support.
- The format war is ending: Databricks acquired Tabular for more than one billion dollars in 2024, Delta UniForm exposes Delta tables as Iceberg, Hudi can output Iceberg metadata, and Apache XTable translates metadata omni-directionally with no data copying.
- For most new, multi-engine deployments Iceberg is the safe default; Delta Lake fits Databricks-centric stacks; Hudi fits upsert-heavy streaming ingestion; and XTable interoperability can serve consumers that expect different formats from one dataset.
Main topics: Why Table Formats Exist, Apache Iceberg, Delta Lake, Apache Hudi, A Head-to-Head Comparison, The Convergence Story, How to Choose in 2026.
Why Table Formats Exist
A data lake in its simplest form is a collection of files — most commonly Apache Parquet, a columnar file format — stored in an object store such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. A query engine reads those files and treats them collectively as a table. This arrangement is inexpensive and scalable, but the abstraction is weak. The storage layer knows only about files and directories; it has no concept of a table, a transaction, or a consistent snapshot. The problems that follow from this gap are what gave rise to table formats.
A table format is a specification and metadata layer that sits between a physical collection of data files and the query engines that read them, presenting that collection as a single logical table with database-like guarantees. The most important of those guarantees is ACID — atomicity, consistency, isolation, and durability — the set of properties that ensure a group of changes either fully applies or does not apply at all, that concurrent readers and writers do not observe partial state, and that committed data survives failures. Without a table format, the lake cannot offer these properties.
Consider what happens when a job appends a new day of data to a plain Parquet directory and fails halfway. Some files are written and some are not, and a reader that lists the directory during the failure observes an inconsistent table. The same fragility affects updates and deletes: changing a single record requires rewriting whole files, and there is no transactional boundary to make the change appear atomically. This is the “directory of files” problem — the table is defined by whatever files happen to be present when the engine lists the path, with no authoritative record of which files belong to a committed version.
Three further capabilities are absent from the plain-Parquet model and are central to why table formats were created. Schema evolution is the ability to add, rename, drop, or reorder columns without rewriting historical data or breaking existing readers. Time travel is the ability to query the table as it existed at a previous point in time, which supports reproducibility, auditing, and rollback. And efficient query planning depends on metadata that records which files contain which ranges of values, so the engine can skip files that cannot contain matching rows.
Table formats deliver these capabilities through a metadata layer — a set of files that record the table’s schema, its partitioning, and, critically, the exact list of data files that constitute each committed version of the table. A central concept in two of the three formats is the manifest: a metadata file that enumerates a group of data files together with statistics about their contents, such as the minimum and maximum value of each column within each file. With this information the engine performs file pruning, reading only the files that could contain rows matching a query predicate. The three formats examined below implement these ideas differently, and those differences drive their respective strengths.
Apache Iceberg: Hierarchical Metadata and Engine Neutrality
Apache Iceberg was created to provide a table format that no single engine owns and that scales to very large tables without the planning bottlenecks of earlier approaches. Its defining characteristic is a hierarchical metadata architecture and a specification-first design philosophy. The format is defined by a written specification, and any engine that implements the specification can read and write the tables, which is the basis for Iceberg’s wide adoption.
The metadata hierarchy
An Iceberg table is described by a tree of metadata files. At the root is a metadata.json file (often called the table metadata file) that records the current schema, partition specification, and a pointer to the current snapshot. Each snapshot references a manifest list, a file that enumerates the manifest files belonging to that snapshot. Each manifest file, in turn, tracks a set of individual data files and stores column-level statistics for them, such as per-column lower and upper bounds and null counts (Apache Iceberg documentation; Dremio, as of 2026).
This hierarchy has two practical consequences. First, query planning is efficient even for tables with millions of files, because the engine reads the manifest list and the relevant manifests rather than listing the object store. Second, because the statistics are recorded in the manifests, the engine can prune files during planning — eliminating files whose recorded value ranges cannot satisfy the query predicate — without opening the data files. This is the same principle that motivates careful storage selection for analytical workloads, a topic explored in the discussion of choosing databases for preprocessed time-series data.
Partition evolution and hidden partitioning
Two features distinguish Iceberg’s treatment of partitioning. Hidden partitioning means the table records how a column is transformed into partition values — for example, by truncating a timestamp to its day — so that queries filtering on the raw column automatically benefit from partition pruning without the writer or reader having to reference a separate partition column. Partition evolution means the partitioning scheme of a table can be changed over time without rewriting the historical data; new data is written under the new scheme while old data remains valid under the old one. This is significant operationally, because it removes one of the most expensive and disruptive migrations in traditional partitioned tables.
Engine reach
Iceberg’s specification-first, engine-agnostic design has produced the broadest multi-engine support of the three formats. As of 2026 it is read and written by Apache Spark, Apache Flink, Trino, and DuckDB, and is supported as a table format by managed warehouses including Snowflake and Google BigQuery (RisingWave; Dremio; Onehouse comparison guides, as of 2026). This breadth is the principal reason Iceberg has become the common denominator across the industry, a point developed in the convergence section. A practical example of landing data into Iceberg from an operational source is described in the guide to building an InfluxDB-to-Iceberg data pipeline with Telegraf, and stream processors such as Flink commonly write into Iceberg as part of complex event processing pipelines.
Delta Lake: A Transaction Log Born at Databricks
Delta Lake originated at Databricks and was designed first and foremost to bring transactional reliability to data stored for use with Apache Spark. Its integration with Spark is the deepest of the three formats, reflecting that shared origin (Databricks; Dremio, as of 2026). Where Iceberg organizes metadata as a tree of manifest files, Delta Lake organizes it as an ordered transaction log.
The transaction log
A Delta table keeps a directory named _delta_log alongside its data files. Each successful commit writes a new JSON file to this directory, numbered sequentially, recording the actions that the commit performed — which data files were added and which were removed, along with metadata changes. The current state of the table is obtained by replaying these JSON commits in order. To prevent the replay from growing unbounded, Delta Lake periodically writes a checkpoint in Parquet format that captures the full table state up to a given commit. A reader then loads the most recent checkpoint and applies only the JSON commits that follow it, which keeps state reconstruction efficient.
The transaction log model gives Delta Lake straightforward time travel — a reader can request the table state as of any commit version — and reliable concurrency control through optimistic commits against the log. Because the log is the single ordered authority on table state, the semantics map cleanly onto Spark’s execution model, which is part of why Delta Lake remains the most natural choice within Spark-centric and Databricks-centric environments. Transformation workflows that run on top of such tables, for instance with dbt-based transformation pipelines, treat the table as the consistent input and output of each model run.
UniForm and interoperability
For most of Delta Lake’s history its log-based metadata was readable only by Delta-aware engines, which limited reach relative to Iceberg. Delta UniForm addresses this directly. UniForm provides interoperability across Delta Lake, Iceberg, and Hudi by generating the metadata that other formats expect alongside the Delta log, and it supports the Iceberg REST catalog interface so that Iceberg clients can discover and read the tables (Databricks, as of 2026). In effect, a table written as Delta can be presented to an Iceberg reader without copying the underlying Parquet data. This is one of the developments that has softened the practical cost of choosing Delta in a mixed-engine environment, and it is examined further in the convergence section.
Apache Hudi: Built for Streaming Upserts
Apache Hudi was designed around a workload that the other two formats addressed only later: continuous ingestion of streaming data with frequent record-level updates. The name itself abbreviates “Hadoop Upserts Deletes and Incrementals,” which signals its orientation. As of 2026, Hudi offers the most mature tooling for pure streaming ingestion involving high-frequency upserts (Onehouse; lakeFS, as of 2026). An upsert is an operation that inserts a record if its key does not yet exist and updates the existing record if it does — the natural operation when a change-data-capture stream delivers a steady flow of inserts, updates, and deletes from an operational database. Feeds of this kind commonly arrive through change data capture with Debezium and Kafka or through a Kafka consumer that lands events into the lake.
Copy-on-Write and Merge-on-Read
Hudi offers two table types, and the choice between them is the central tuning decision when adopting the format. In a Copy-on-Write (CoW) table, an update rewrites the base file that contains the affected record, so every version of the data is materialized at write time. Reads are therefore fast and simple, because the engine reads finished base files, but writes are more expensive because they rewrite whole files. In a Merge-on-Read (MoR) table, an update is written to a delta log file associated with the base file rather than rewriting the base file immediately; readers merge the base file with its delta logs at query time. Writes are cheaper and lower-latency, which suits high-frequency upserts, at the cost of more work at read time until a background compaction merges the deltas into new base files.
Record-level indexing and table management
To apply an upsert, Hudi must locate the existing record for a given key quickly. It maintains record-level indexing for this purpose — a mapping from record keys to the files that contain them — so an incoming update can be routed to the correct file without scanning the whole table. This indexing is a meaningful part of why Hudi handles high-frequency update workloads efficiently. Hudi also includes built-in table management services: compaction merges MoR delta logs into base files, clustering reorganizes data to improve query locality, and cleaning removes obsolete file versions according to a retention policy. These services can run as part of the ingestion pipeline or as separate jobs, and orchestrating them alongside ingestion is a common use of a workflow scheduler such as Apache Airflow.
Hudi’s primary ingestion utility is its streaming ingestion tool, historically known as DeltaStreamer (now Hudi Streamer), which reads from sources such as Kafka and applies inserts, updates, and deletes to a Hudi table continuously. This tooling, together with record-level indexing and the MoR table type, is the basis for Hudi’s reputation in upsert-heavy streaming ingestion.
Native Iceberg output
Hudi has also moved toward interoperability. Tables created from Hudi version 0.14.0 onwards can be synced to Iceberg and/or Delta Lake through Apache XTable, and Hudi’s native Iceberg support lets a team use Hudi’s managed services — compaction, indexing, and Hudi Streamer ingestion — while outputting Iceberg-compatible tables for downstream consumers (Apache Hudi documentation, as of 2026). This means a pipeline can keep Hudi’s strengths on the write side while presenting Iceberg on the read side, a pattern that the convergence section places in context.
A Head-to-Head Comparison
The three formats overlap substantially in their core guarantees — all provide ACID transactions, schema evolution, and time travel — so the meaningful differences lie in their metadata models, their maturity for specific operations, and their ecosystem reach. The matrix below summarizes the breadth of engine support that distinguishes the formats, followed by a feature comparison and a workload mapping.
The grid is directional rather than a precise capability audit; engine support changes with each release, and warehouse vendors continue to add native read paths. The pattern it conveys is the one reported consistently across the comparison literature: Iceberg has the widest native reach, Delta Lake is strongest within Spark and reaches other engines primarily through UniForm, and Hudi is strong in Spark and Flink and reaches warehouses through translation. The detailed feature comparison follows.
| Dimension | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Design origin | Engine-neutral, specification-first | Databricks; deepest Spark integration | Streaming ingestion and upserts |
| Metadata model | Hierarchy: metadata.json → manifest list → manifests → data files | Ordered transaction log (_delta_log) + Parquet checkpoints | Timeline + base files and delta logs; record-level index |
| Upsert / CDC maturity | Supported; improving | Strong within Spark | Most mature for high-frequency upserts |
| Partition evolution | Yes, plus hidden partitioning | Limited | Limited |
| Engine support | Broadest: Spark, Flink, Trino, Snowflake, BigQuery, DuckDB | Spark-first; others via UniForm | Spark, Flink; warehouses via translation |
| Built-in table management | Via engine / catalog services | Via engine; optimize and vacuum | Built-in compaction, clustering, cleaning |
| Governance / catalog | Vendor-neutral; Iceberg REST catalog standard | Unity Catalog; UniForm exposes Iceberg REST | Hive / catalog integrations; XTable sync |
The feature comparison clarifies that the formats are not interchangeable on every axis even though their core guarantees overlap. Partition evolution and hidden partitioning are genuine Iceberg differentiators; built-in table management is a genuine Hudi differentiator; and the deepest Spark integration remains a Delta Lake characteristic. The second table maps common workloads to the format that most naturally fits each, before interoperability is taken into account.
| Workload | Best-fit format | Why |
|---|---|---|
| New multi-engine, vendor-neutral lakehouse | Iceberg | Broadest engine reach and neutral governance |
| Databricks-centric analytics and ML | Delta Lake | Deepest Spark integration; UniForm for outside reach |
| High-frequency CDC upserts from operational databases | Hudi | Record-level indexing, MoR, Hudi Streamer |
| Append-mostly analytical tables on a warehouse | Iceberg | Native support in Snowflake and BigQuery |
| One dataset consumed by several engines expecting different formats | Any + XTable / UniForm | Metadata translation avoids copying data |
The Convergence Story
The most consequential change in this area is not a new capability in any single format but the erosion of the boundaries between them. Several developments, taken together, have moved the ecosystem from competition toward interoperability, and they explain why the choice of format is now less of a permanent commitment than it was.
The Databricks–Tabular acquisition
In a development that reframed the competitive landscape, Databricks acquired Tabular — the startup founded by the original creators of Apache Iceberg — for more than one billion dollars, a deal revealed on June 4, 2024 (Databricks blog; TechTarget, as of 2024-06-04). Because Databricks is the company behind Delta Lake, the acquisition brought the architects of the rival format inside the same organization and signaled an intent to support both formats rather than insist on one. It is widely read as the moment the “format war” framing began to lose force in favour of interoperability.
UniForm, Hudi output, and Apache XTable
Three technical mechanisms now allow a single physical dataset to be read as more than one format. Delta UniForm, described earlier, exposes Delta tables as Iceberg and supports the Iceberg REST catalog interface (Databricks, as of 2026). Hudi’s native Iceberg support lets a team manage a table with Hudi and serve it as Iceberg (Apache Hudi documentation, as of 2026). And Apache XTable — an incubating project backed by Microsoft, Google, and Onehouse — provides omni-directional metadata translation between all three formats: any format to any format, with no copying of the underlying data (Dremio; Onehouse, as of 2026). XTable works by generating the metadata that each target format expects, pointing it at the same Parquet files, so the cost of translation is metadata generation rather than data duplication.
The cumulative effect of these developments is that Apache Iceberg has become the de-facto common denominator. It is the format that the other two can be exposed as, the one with the broadest native reach, and the one adopted by every major cloud provider and query engine (RisingWave; Dremio; Onehouse, as of 2026). The reasons most often cited are the same throughout the literature: vendor-neutral governance, partition evolution, and the widest multi-engine support.
How to Choose in 2026
Given convergence, the practical decision reduces to selecting the format that best fits the write path and the operational center of gravity, while relying on interoperability to satisfy diverse readers. The decision tree below summarizes the guidance, and the discussion that follows expands on each branch.
Iceberg as the safe default
For a new deployment that aims to remain vendor-neutral and to be queried by several engines, Iceberg is the safe default. It carries the lowest risk of lock-in, offers partition evolution and hidden partitioning, and is natively supported across the widest range of engines and warehouses. Choosing Iceberg also aligns with the direction of the ecosystem, since the other formats can be exposed as Iceberg but the reverse arrangement is less central to current tooling. For teams running engines on Kubernetes, where portability is already a goal, the neutrality argument extends naturally to the storage layer; the operational considerations of running such engines are discussed in the guide to database connections from Kubernetes pods.
Delta Lake for Databricks-centric stacks
When the platform is built around Databricks and Spark, Delta Lake remains the natural choice. Its integration is the deepest available, its tooling and governance through Unity Catalog are mature, and UniForm now mitigates the historical drawback of limited external reach by exposing the same tables as Iceberg. A team already invested in Databricks gains little by writing Hudi or Iceberg directly and may give up integration depth by doing so.
Hudi for streaming upserts
For ingestion dominated by high-frequency upserts and change-data-capture streams, Hudi remains the strongest fit. Its record-level indexing, Merge-on-Read table type, and built-in compaction and cleaning were designed for exactly this pattern, and its Hudi Streamer utility provides a tested ingestion path. The native Iceberg output then allows the same data to be served to analytical consumers as Iceberg, combining Hudi’s write strengths with Iceberg’s read reach.
When to rely on interoperability instead
Some organizations do not need to pick a single winner. When one dataset must serve consumers that expect different formats — for example, a Spark-on-Databricks team reading Delta and a Trino team reading Iceberg — relying on XTable or UniForm to translate metadata is often preferable to maintaining duplicate copies of the data. The decision then shifts from “which format” to “which write path produces the data, and which translations are needed for the readers.” This framing is the clearest sign of how far the field has moved from the original format competition.
Frequently Asked Questions
What is a lakehouse table format, and how does it differ from a file format like Parquet?
A file format such as Parquet defines how the bytes of a single data file are laid out. A table format is a metadata layer above many such files that presents them as one logical table with ACID transactions, schema evolution, and time travel. Parquet stores the data; the table format records which files belong to which committed version of the table and how to read them consistently.
Is Apache Iceberg replacing Delta Lake and Hudi?
Not exactly. Iceberg has become the de-facto common denominator that the other formats translate toward, and it is the safe default for new neutral deployments. Delta Lake and Hudi retain distinct strengths — deepest Spark integration and the most mature streaming-upsert tooling, respectively — and both can now expose their tables as Iceberg, so they continue to be used as write paths even where Iceberg is the serving format.
What is the difference between Hudi Copy-on-Write and Merge-on-Read?
Copy-on-Write rewrites the base file whenever a record is updated, which makes reads fast and writes heavier; it suits read-heavy tables with moderate update rates. Merge-on-Read appends updates to delta log files and merges them with the base file at query time, which makes writes light and low-latency; it suits high-frequency upserts, with background compaction periodically consolidating the deltas.
Does choosing one format lock an organization into one vendor or engine?
Less than it once did. Delta UniForm exposes Delta tables as Iceberg through the Iceberg REST catalog, Hudi can output Iceberg-compatible tables, and Apache XTable translates metadata among all three with no data copying. Iceberg in particular is vendor-neutral by design. Lock-in is now mainly a function of operational tooling and the write path rather than the storage format itself.
What did the Databricks acquisition of Tabular mean for the format landscape?
Databricks, the company behind Delta Lake, acquired Tabular — the startup founded by Iceberg’s original creators — for more than one billion dollars, with the deal revealed on June 4, 2024. Bringing Iceberg’s architects inside the Delta vendor signaled support for both formats and is widely read as the point at which the industry pivoted from a format competition toward interoperability.
Related Reading
- Building an InfluxDB-to-AWS-Iceberg Data Pipeline with Telegraf
- Choosing Databases for Preprocessed Time-Series Data
- Transformation Pipelines with dbt (data build tool)
- Orchestrating Data Pipelines with Apache Airflow
- Change Data Capture with Debezium and Kafka
- Complex Event Processing with Apache Flink
References
- Apache Iceberg — official documentation
- Delta Lake — official site and documentation
- Apache Hudi — official documentation
- Apache XTable (incubating) — omni-directional table-format interoperability
- Databricks — acquisition of Tabular (announced June 4, 2024)
Comparative characterizations of engine reach and format maturity reflect 2026 guidance from RisingWave, Dremio, Onehouse, and lakeFS; figures and dated claims are attributed inline.
Conclusion
The lakehouse table format is the layer that turns object storage into a transactional database surface, and the three principal implementations express three design priorities: Iceberg’s engine-neutral interoperability, Delta Lake’s depth within the Spark and Databricks ecosystem, and Hudi’s strength in high-frequency streaming upserts. Their core guarantees converge, but their metadata models and operational characters remain distinct enough to matter on the write path.
The decisive shift of the past two years is that selecting a format no longer means accepting permanent lock-in. The Databricks–Tabular acquisition, Delta UniForm, Hudi’s Iceberg output, and Apache XTable have made it possible for one physical dataset to be presented as more than one format, with Iceberg emerging as the de-facto common denominator. The practical recommendation that follows is straightforward: default to Iceberg for new, multi-engine, vendor-neutral deployments; choose Delta Lake within Databricks-centric stacks; choose Hudi for upsert-heavy streaming ingestion; and lean on interoperability layers when one dataset must serve consumers expecting different formats. The right choice is the one that fits the write path, because interoperability can serve the readers afterward.
Leave a Reply