How to Use Apache Iceberg Tables?

Abhishek Kumar Last Updated : 13 Mar, 2025
10 min read

Apache Iceberg is a modern table format designed to overcome the limitations of traditional Hive tables, offering improved performance, consistency, and scalability. In this article, we will explore the evolution of Iceberg, its key features like ACID transactions, partition evolution, and time travel, and how it integrates with modern data lakes. We’ll also dive into its architecture, metadata management, and catalog system while comparing it with other table formats like Delta Lake and Parquet. By the end, you’ll have a clear understanding of how Apache Iceberg enhances large-scale data management and analytics.

Learning Objectives

  • Understand the key features and architecture of Apache Iceberg.
  • Learn how Iceberg enables schema and partition evolution without rewriting data.
  • Explore how ACID transactions and time travel improve data consistency.
  • Compare Iceberg with other table formats like Delta Lake and Hudi.
  • Discover use cases where Apache Iceberg enhances data lake performance.

Introduction to Apache Iceberg

Apache Iceberg is a table format developed in 2017 by Ryan Blue and Daniel Weeks at Netflix to address performance bottlenecks, consistency issues, and limitations associated with the Hive table format. In 2018, the project was open-sourced and donated to the Apache Software Foundation, attracting contributions from major companies such as Apple, Dremio, AWS, Tencent, LinkedIn, and Stripe. Over time, many more organizations have joined in supporting and enhancing the project.

Evolution of Apache Iceberg

Netflix identified a fundamental flaw in the Hive table format: tables were tracked using directories and subdirectories, which restricted the level of granularity required for maintaining consistency, improving concurrency, and supporting features commonly found in data warehouses. To overcome these limitations, Netflix set out to develop a new table format with several key objectives:

Consistency

When updates span multiple partitions, users should never experience inconsistent data. Changes should be applied atomically and quickly, ensuring that users either see the data before or after an update, but never in an intermediate state.

Performance

Hive’s reliance on file and directory listings created query planning bottlenecks. The new format needed to provide efficient metadata handling, reducing unnecessary file scans and improving query execution speed.

Ease of Use

Users shouldn’t need to understand the physical structure of a table to benefit from partitioning. The system should automatically optimize queries without requiring additional filtering on derived partition columns.

Evolvability

Schema modifications in Hive often led to unsafe transactions, and changing a table’s partitioning required rewriting the entire dataset. The new format had to allow safe schema and partitioning updates without requiring a full table rewrite.

Scalability

All these improvements had to work at Netflix’s massive scale, handling petabytes of data efficiently.

Introducing the Iceberg Format

To address these challenges, Netflix designed Iceberg to track tables as a canonical list of files rather than directories. Apache Iceberg serves as a standardized table format that defines how metadata should be structured across multiple files. To drive adoption, the project provides libraries that integrate with popular compute engines like Apache Spark and Apache Flink.

Standard for Data Lakes

Apache Iceberg is built to seamlessly integrate with existing storage solutions and compute engines, allowing tools to adopt the standard without requiring major changes. The goal is for Iceberg to become a ubiquitous industry standard, enabling users to interact with tables without worrying about the underlying format.

Many data tools now offer native support for Iceberg, making it possible for users to work with Iceberg tables without even realizing it. Over time, as automated table optimization and ingestion tools evolve, even data engineers will be able to interact with data lake storage just as easily as they do with traditional data warehouses—without needing to manage the storage layer manually.

Also Read: Apache Spark 4.0: A New Era of Big Data Processing

Key Features of Apache Iceberg

Apache Iceberg is designed to go beyond merely addressing the limitations of the Hive table format—it introduces powerful capabilities that enhance data lake and data lakehouse workloads. Below is an overview of its key features:

ACID Transactions

Apache Iceberg provides ACID guarantees using optimistic concurrency control, ensuring that transactions are either fully committed or completely rolled back. Unlike traditional pessimistic locking, which can create bottlenecks, Iceberg’s approach minimizes conflicts while maintaining consistency. The catalog plays a crucial role in managing these transactions, preventing conflicting updates that could lead to data loss.

Partition Evolution

One of the challenges with traditional data lakes is the inability to modify partitioning without rewriting the entire table. Iceberg solves this by enabling partition evolution, allowing changes to the partitioning scheme without requiring expensive table rewrites. New data can be written using an updated partitioning strategy while old data remains unchanged, ensuring seamless optimization.

Hidden Partitioning

Users often don’t need to know how a table is physically partitioned. Iceberg introduces a more intuitive approach by allowing queries to benefit from partitioning automatically. Instead of requiring users to filter by derived partitioning columns (e.g., filtering by event_day when querying timestamps), Iceberg applies transformations such as bucket, truncate, year, month, day, and hour, ensuring efficient query execution without manual intervention.

Row-Level Table Operations

Iceberg supports two strategies for row-level updates:

  • Copy-on-Write (COW): When a row is updated, the entire data file is rewritten, ensuring strong consistency.
  • Merge-on-Read (MOR): Only the modified records are written to a new file, and changes are reconciled during query execution, optimizing for workloads with frequent updates and deletes.

Time Travel

Iceberg maintains immutable snapshots of data, enabling time travel queries. This feature allows users to analyze historical table states, making it useful for auditing, reproducing machine learning model outputs, or retrieving data as it appeared at a specific point in time—without requiring separate data copies.

Version Rollback

Beyond just querying historical data, Iceberg allows rolling back a table to a previous snapshot. This is particularly useful for undoing accidental modifications or restoring data to a known good state.

Schema Evolution

Tables naturally evolve over time, requiring changes such as adding or removing columns, renaming fields, or modifying data types. Iceberg supports schema evolution without requiring table rewrites, ensuring flexibility while maintaining compatibility with existing data.

With these features, Apache Iceberg is shaping the future of data lakes by providing robust, scalable, and user-friendly table management capabilities.

Architecture of Apache Iceberg

In this section we will discuss about the architecture of Apache Iceberg and how it enable Apache Iceberg to resolve the problems inherent in the Hive table format. We will be able to understand under the hood as well as best.

The Data Layer

The data layer of an Apache Iceberg table is responsible for storing the actual table data. It primarily consists of data files, but it also includes delete files when records are marked for removal. This layer is essential for serving query results, as it provides the underlying data required for processing. While certain queries can be answered using metadata alone—such as retrieving the maximum value of a column—the data layer is typically involved in fulfilling most user queries. Structurally, the files within this layer form the leaves of Apache Iceberg’s tree-based architecture.

In real-world applications, the data layer is hosted on a distributed filesystem like the Hadoop Distributed File System (HDFS) or an object storage system such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). This flexibility allows Apache Iceberg to integrate seamlessly with modern data lakehouse architectures, enabling efficient data management and analytics at scale.

Data Files

Data files store the actual data in an Apache Iceberg table. Iceberg is file format agnostic, supporting Apache Parquet, ORC, and Avro, offering key advantages:

  • Organizations can maintain multiple file formats due to historical or operational needs.
  • Workloads can use the best-suited format (e.g.,
  • Future-proofing allows easy adoption of new formats as technology evolves.

Despite this flexibility, Parquet is the most widely used format due to its columnar storage, which optimizes query performance, compression, and parallelism across modern analytics engines.

Delete Files

Since data lake storage is immutable, direct row updates aren’t possible. Instead, delete files track removed records, enabling Merge-on-Read (MOR) updates. There are two types:

Positional Deletes: Identify rows based on file path and row position (e.g., deleting a record at row #234 in a file).

Equality Deletes: Identify rows by specific column values (e.g., deleting all rows where order_id = 1234).

Delete files apply only to Iceberg v2 tables and ensure that query engines correctly apply updates using sequence numbers, preventing unintended row removals when inserting new data.

Checkout: Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

The Metadata Layer in Apache Iceberg

The metadata layer is a crucial component of an Iceberg table’s architecture, responsible for managing all metadata files. It follows a tree structure, which tracks both the data files and the operations that led to their creation.

Key Metadata Components in Iceberg

  • Manifest Files
    • Track datafiles and delete files at a granular level.
    • Contain statistics like column value ranges, aiding query pruning.
    • Written in Avro format for efficient storage.
  • Manifest Lists
    • Represent snapshots of the table at a given time.
    • Store metadata about manifest files, including partition details and row counts.
    • Help Iceberg maintain a time-travel feature for querying historical states.
  • Metadata Files
    • Track table-wide information such as schema, partition specs, and snapshots.
    • Ensure atomic updates to prevent inconsistencies during concurrent writes.
    • Maintain historical logs of changes to support schema evolution.
  • Puffin Files
    • Store advanced statistics and indexes, like Theta sketches from Apache DataSketches.
    • Optimize queries requiring approximate distinct counts (e.g., unique users per region).
    • Improve performance for analytical queries without requiring full table scans.

By efficiently organizing these metadata files, Iceberg enables key features like time travel (querying historical data states) and schema evolution (modifying table schemas without disrupting existing queries). This structured approach makes Iceberg a powerful solution for managing large-scale datasets.

The Catalog in Apache Iceberg

When reading from a table—or managing hundreds or thousands of tables—users need a way to locate the correct metadata file that tells them where to read or write data. The Iceberg catalog serves as this central registry, helping users and systems determine the current metadata file location for any given table.

Role of the Iceberg Catalog

The primary function of the catalog is to store a pointer to the current metadata file for each table. This metadata pointer is crucial because it ensures that all readers and writers interact with the same table state at any given time. The catalog primarily stores a pointer to the current metadata file for each table. This metadata pointer ensures that all readers and writers interact with the same table state at any given time.

How Iceberg Catalogs Store Metadata Pointers?

Different backend systems can serve as an Iceberg catalog, each handling the metadata pointer in its own way:

  • Hadoop Catalog (Amazon S3 Example)
    • Uses a file named version-hint.text in the table’s metadata folder.
    • The file contains the version number of the latest metadata file.
    • Since this approach relies on a distributed file system (or a similar abstraction), it is referred to as the Hadoop Catalog.
  • Hive Metastore Catalog
    • Stores the metadata file location in a table property called location.
    • Commonly used in Hive-based data ecosystems.
  • Nessie Catalog
    • Stores the metadata file location in a table property called metadataLocation.
    • Useful for version-controlled data lake implementations.
  • AWS Glue Catalog
    • Functions similarly to the Hive Metastore but is fully managed within AWS.

Comparing Apache Iceberg with Other Table Formats

When dealing with large-scale data processing in data lakes, choosing the right file or table format is crucial for performance, consistency, and scalability. Apache Iceberg, Apache Parquet, Apache ORC, and Delta Lake are widely used, but they serve different purposes.

Overview of Each Format

Format Type Key Feature Best Use Case
Apache Iceberg Table format ACID transactions, time travel, schema evolution Large-scale analytics, cloud-based data lakes
Apache Parquet File format Columnar storage, compression Optimized querying, analytics
Apache ORC File format Columnar storage, lightweight indexing Hive-based workloads, big data processing
Delta Lake Table format ACID transactions, versioning Streaming + batch workloads, real-time pipelines

Apache Iceberg enables large-scale data lakes with ACID transactions, schema evolution, partition evolution, and time travel as a modern table format. Compared to Parquet and ORC, Iceberg is more than just a file format – it provides transactional guarantees and metadata optimizations. While Delta Lake also supports ACID transactions, Iceberg has an edge in schema and partition evolution, making it a strong choice for long-term, cloud-native data lake storage.

Also Read: Getting Started with Apache Arrow

Conclusion

Apache Iceberg has emerged as a powerful table format designed to overcome the limitations of the Hive table format, offering improved consistency, performance, scalability, and ease of use. Its innovative features, such as ACID transactions, partition evolution, time travel, and schema evolution, make it a compelling choice for organizations managing large-scale data lakes. By integrating seamlessly with existing storage solutions and compute engines, Iceberg provides a flexible and future-proof approach to data lake management.

Frequently Asked Questions

Q1. What is Apache Iceberg?

A. Apache Iceberg improves data lake performance, consistency, and scalability as an open-source table format.

Q2. What is the need for Apache Iceberg?

A. Developers created it to overcome the limitations of the Hive table format, such as inefficient metadata handling and the lack of atomic transactions.

Q3. How does Apache Iceberg handle schema evolution?

A. Iceberg supports schema changes like adding, renaming, or removing columns without requiring a full table rewrite.

Q4. What is partition evolution in Apache Iceberg?

A. Partition evolution allows modifying partitioning schemes without rewriting historical data, enabling better query optimization.

Q5. How does Iceberg support ACID transactions?

A. It uses optimistic concurrency control to ensure atomic updates and prevent conflicts in concurrent writes.

Hello, I'm Abhishek, a Data Engineer Trainee at Analytics Vidhya. I'm passionate about data engineering and video games I have experience in Apache Hadoop, AWS, and SQL,and I keep on exploring their intricacies and optimizing data workflows 

Login to continue reading and enjoy expert-curated content.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details