How to Use Apache Iceberg Tables?

Abhishek Kumar Last Updated : 13 Mar, 2025

10 min read

Apache Iceberg is a modern table format designed to overcome the limitations of traditional Hive tables, offering improved performance, consistency, and scalability. In this article, we will explore the evolution of Iceberg, its key features like ACID transactions, partition evolution, and time travel, and how it integrates with modern data lakes. We’ll also dive into its architecture, metadata management, and catalog system while comparing it with other table formats like Delta Lake and Parquet. By the end, you’ll have a clear understanding of how Apache Iceberg enhances large-scale data management and analytics.

Learning Objectives

Understand the key features and architecture of Apache Iceberg.
Learn how Iceberg enables schema and partition evolution without rewriting data.
Explore how ACID transactions and time travel improve data consistency.
Compare Iceberg with other table formats like Delta Lake and Hudi.
Discover use cases where Apache Iceberg enhances data lake performance.

Introduction to Apache Iceberg
Evolution of Apache Iceberg
Introducing the Iceberg Format
Key Features of Apache Iceberg
Architecture of Apache Iceberg
Comparing Apache Iceberg with Other Table Formats
Conclusion
Frequently Asked Questions

Introduction to Apache Iceberg

Apache Iceberg is a table format developed in 2017 by Ryan Blue and Daniel Weeks at Netflix to address performance bottlenecks, consistency issues, and limitations associated with the Hive table format. In 2018, the project was open-sourced and donated to the Apache Software Foundation, attracting contributions from major companies such as Apple, Dremio, AWS, Tencent, LinkedIn, and Stripe. Over time, many more organizations have joined in supporting and enhancing the project.

Evolution of Apache Iceberg

Netflix identified a fundamental flaw in the Hive table format: tables were tracked using directories and subdirectories, which restricted the level of granularity required for maintaining consistency, improving concurrency, and supporting features commonly found in data warehouses. To overcome these limitations, Netflix set out to develop a new table format with several key objectives:

Consistency

When updates span multiple partitions, users should never experience inconsistent data. Changes should be applied atomically and quickly, ensuring that users either see the data before or after an update, but never in an intermediate state.

Performance

Hive’s reliance on file and directory listings created query planning bottlenecks. The new format needed to provide efficient metadata handling, reducing unnecessary file scans and improving query execution speed.

Ease of Use

Users shouldn’t need to understand the physical structure of a table to benefit from partitioning. The system should automatically optimize queries without requiring additional filtering on derived partition columns.

Evolvability

Schema modifications in Hive often led to unsafe transactions, and changing a table’s partitioning required rewriting the entire dataset. The new format had to allow safe schema and partitioning updates without requiring a full table rewrite.

Scalability

All these improvements had to work at Netflix’s massive scale, handling petabytes of data efficiently.

Introducing the Iceberg Format

To address these challenges, Netflix designed Iceberg to track tables as a canonical list of files rather than directories. Apache Iceberg serves as a standardized table format that defines how metadata should be structured across multiple files. To drive adoption, the project provides libraries that integrate with popular compute engines like Apache Spark and Apache Flink.

Standard for Data Lakes

Apache Iceberg is built to seamlessly integrate with existing storage solutions and compute engines, allowing tools to adopt the standard without requiring major changes. The goal is for Iceberg to become a ubiquitous industry standard, enabling users to interact with tables without worrying about the underlying format.

Many data tools now offer native support for Iceberg, making it possible for users to work with Iceberg tables without even realizing it. Over time, as automated table optimization and ingestion tools evolve, even data engineers will be able to interact with data lake storage just as easily as they do with traditional data warehouses—without needing to manage the storage layer manually.

Also Read: Apache Spark 4.0: A New Era of Big Data Processing

Key Features of Apache Iceberg

Apache Iceberg is designed to go beyond merely addressing the limitations of the Hive table format—it introduces powerful capabilities that enhance data lake and data lakehouse workloads. Below is an overview of its key features:

ACID Transactions

Apache Iceberg provides ACID guarantees using optimistic concurrency control, ensuring that transactions are either fully committed or completely rolled back. Unlike traditional pessimistic locking, which can create bottlenecks, Iceberg’s approach minimizes conflicts while maintaining consistency. The catalog plays a crucial role in managing these transactions, preventing conflicting updates that could lead to data loss.

Partition Evolution

One of the challenges with traditional data lakes is the inability to modify partitioning without rewriting the entire table. Iceberg solves this by enabling partition evolution, allowing changes to the partitioning scheme without requiring expensive table rewrites. New data can be written using an updated partitioning strategy while old data remains unchanged, ensuring seamless optimization.

Hidden Partitioning

Users often don’t need to know how a table is physically partitioned. Iceberg introduces a more intuitive approach by allowing queries to benefit from partitioning automatically. Instead of requiring users to filter by derived partitioning columns (e.g., filtering by event_day when querying timestamps), Iceberg applies transformations such as bucket, truncate, year, month, day, and hour, ensuring efficient query execution without manual intervention.

Row-Level Table Operations

Iceberg supports two strategies for row-level updates:

Copy-on-Write (COW): When a row is updated, the entire data file is rewritten, ensuring strong consistency.
Merge-on-Read (MOR): Only the modified records are written to a new file, and changes are reconciled during query execution, optimizing for workloads with frequent updates and deletes.

Time Travel

Iceberg maintains immutable snapshots of data, enabling time travel queries. This feature allows users to analyze historical table states, making it useful for auditing, reproducing machine learning model outputs, or retrieving data as it appeared at a specific point in time—without requiring separate data copies.

Version Rollback

Beyond just querying historical data, Iceberg allows rolling back a table to a previous snapshot. This is particularly useful for undoing accidental modifications or restoring data to a known good state.

Schema Evolution

Tables naturally evolve over time, requiring changes such as adding or removing columns, renaming fields, or modifying data types. Iceberg supports schema evolution without requiring table rewrites, ensuring flexibility while maintaining compatibility with existing data.

With these features, Apache Iceberg is shaping the future of data lakes by providing robust, scalable, and user-friendly table management capabilities.

Architecture of Apache Iceberg

In this section we will discuss about the architecture of Apache Iceberg and how it enable Apache Iceberg to resolve the problems inherent in the Hive table format. We will be able to understand under the hood as well as best.

The Data Layer

The data layer of an Apache Iceberg table is responsible for storing the actual table data. It primarily consists of data files, but it also includes delete files when records are marked for removal. This layer is essential for serving query results, as it provides the underlying data required for processing. While certain queries can be answered using metadata alone—such as retrieving the maximum value of a column—the data layer is typically involved in fulfilling most user queries. Structurally, the files within this layer form the leaves of Apache Iceberg’s tree-based architecture.

In real-world applications, the data layer is hosted on a distributed filesystem like the Hadoop Distributed File System (HDFS) or an object storage system such as Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). This flexibility allows Apache Iceberg to integrate seamlessly with modern data lakehouse architectures, enabling efficient data management and analytics at scale.

Data Files

Data files store the actual data in an Apache Iceberg table. Iceberg is file format agnostic, supporting Apache Parquet, ORC, and Avro, offering key advantages:

Organizations can maintain multiple file formats due to historical or operational needs.
Workloads can use the best-suited format (e.g.,
Future-proofing allows easy adoption of new formats as technology evolves.

Despite this flexibility, Parquet is the most widely used format due to its columnar storage, which optimizes query performance, compression, and parallelism across modern analytics engines.

Delete Files

Since data lake storage is immutable, direct row updates aren’t possible. Instead, delete files track removed records, enabling Merge-on-Read (MOR) updates. There are two types:

Positional Deletes: Identify rows based on file path and row position (e.g., deleting a record at row #234 in a file).

Equality Deletes: Identify rows by specific column values (e.g., deleting all rows where order_id = 1234).

Delete files apply only to Iceberg v2 tables and ensure that query engines correctly apply updates using sequence numbers, preventing unintended row removals when inserting new data.

Checkout: Top 11 GenAI Powered Data Engineering Tools to Follow in 2025

The Metadata Layer in Apache Iceberg

The metadata layer is a crucial component of an Iceberg table’s architecture, responsible for managing all metadata files. It follows a tree structure, which tracks both the data files and the operations that led to their creation.

Key Metadata Components in Iceberg

Manifest Files
- Track datafiles and delete files at a granular level.
- Contain statistics like column value ranges, aiding query pruning.
- Written in Avro format for efficient storage.
Manifest Lists
- Represent snapshots of the table at a given time.
- Store metadata about manifest files, including partition details and row counts.
- Help Iceberg maintain a time-travel feature for querying historical states.
Metadata Files
- Track table-wide information such as schema, partition specs, and snapshots.
- Ensure atomic updates to prevent inconsistencies during concurrent writes.
- Maintain historical logs of changes to support schema evolution.
Puffin Files
- Store advanced statistics and indexes, like Theta sketches from Apache DataSketches.
- Optimize queries requiring approximate distinct counts (e.g., unique users per region).
- Improve performance for analytical queries without requiring full table scans.

By efficiently organizing these metadata files, Iceberg enables key features like time travel (querying historical data states) and schema evolution (modifying table schemas without disrupting existing queries). This structured approach makes Iceberg a powerful solution for managing large-scale datasets.

The Catalog in Apache Iceberg

When reading from a table—or managing hundreds or thousands of tables—users need a way to locate the correct metadata file that tells them where to read or write data. The Iceberg catalog serves as this central registry, helping users and systems determine the current metadata file location for any given table.

Role of the Iceberg Catalog

The primary function of the catalog is to store a pointer to the current metadata file for each table. This metadata pointer is crucial because it ensures that all readers and writers interact with the same table state at any given time. The catalog primarily stores a pointer to the current metadata file for each table. This metadata pointer ensures that all readers and writers interact with the same table state at any given time.

How Iceberg Catalogs Store Metadata Pointers?

Different backend systems can serve as an Iceberg catalog, each handling the metadata pointer in its own way:

Hadoop Catalog (Amazon S3 Example)
- Uses a file named version-hint.text in the table’s metadata folder.
- The file contains the version number of the latest metadata file.
- Since this approach relies on a distributed file system (or a similar abstraction), it is referred to as the Hadoop Catalog.
Hive Metastore Catalog
- Stores the metadata file location in a table property called location.
- Commonly used in Hive-based data ecosystems.
Nessie Catalog
- Stores the metadata file location in a table property called metadataLocation.
- Useful for version-controlled data lake implementations.
AWS Glue Catalog
- Functions similarly to the Hive Metastore but is fully managed within AWS.

Comparing Apache Iceberg with Other Table Formats

When dealing with large-scale data processing in data lakes, choosing the right file or table format is crucial for performance, consistency, and scalability. Apache Iceberg, Apache Parquet, Apache ORC, and Delta Lake are widely used, but they serve different purposes.

Overview of Each Format

Format	Type	Key Feature	Best Use Case
Apache Iceberg	Table format	ACID transactions, time travel, schema evolution	Large-scale analytics, cloud-based data lakes
Apache Parquet	File format	Columnar storage, compression	Optimized querying, analytics
Apache ORC	File format	Columnar storage, lightweight indexing	Hive-based workloads, big data processing
Delta Lake	Table format	ACID transactions, versioning	Streaming + batch workloads, real-time pipelines

Apache Iceberg enables large-scale data lakes with ACID transactions, schema evolution, partition evolution, and time travel as a modern table format. Compared to Parquet and ORC, Iceberg is more than just a file format – it provides transactional guarantees and metadata optimizations. While Delta Lake also supports ACID transactions, Iceberg has an edge in schema and partition evolution, making it a strong choice for long-term, cloud-native data lake storage.

Also Read: Getting Started with Apache Arrow

Conclusion

Apache Iceberg has emerged as a powerful table format designed to overcome the limitations of the Hive table format, offering improved consistency, performance, scalability, and ease of use. Its innovative features, such as ACID transactions, partition evolution, time travel, and schema evolution, make it a compelling choice for organizations managing large-scale data lakes. By integrating seamlessly with existing storage solutions and compute engines, Iceberg provides a flexible and future-proof approach to data lake management.

Frequently Asked Questions

Q1. What is Apache Iceberg?

A. Apache Iceberg improves data lake performance, consistency, and scalability as an open-source table format.

Q2. What is the need for Apache Iceberg?

A. Developers created it to overcome the limitations of the Hive table format, such as inefficient metadata handling and the lack of atomic transactions.

Q3. How does Apache Iceberg handle schema evolution?

A. Iceberg supports schema changes like adding, renaming, or removing columns without requiring a full table rewrite.

Q4. What is partition evolution in Apache Iceberg?

A. Partition evolution allows modifying partitioning schemes without rewriting historical data, enabling better query optimization.

Q5. How does Iceberg support ACID transactions?

A. It uses optimistic concurrency control to ensure atomic updates and prevent conflicts in concurrent writes.

Abhishek Kumar

Hello, I'm Abhishek, a Data Engineer Trainee at Analytics Vidhya. I'm passionate about data engineering and video games I have experience in Apache Hadoop, AWS, and SQL,and I keep on exploring their intricacies and optimizing data workflows

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

How to Use Apache Iceberg Tables?

Learning Objectives

Table of contents

Introduction to Apache Iceberg

Evolution of Apache Iceberg

Consistency

Performance

Ease of Use

Evolvability

Scalability

Introducing the Iceberg Format

Standard for Data Lakes

Key Features of Apache Iceberg

ACID Transactions

Partition Evolution

Hidden Partitioning

Row-Level Table Operations

Time Travel

Version Rollback

Schema Evolution

Architecture of Apache Iceberg

The Data Layer

Data Files

Delete Files

The Metadata Layer in Apache Iceberg

Key Metadata Components in Iceberg

The Catalog in Apache Iceberg

Role of the Iceberg Catalog

How Iceberg Catalogs Store Metadata Pointers?

Comparing Apache Iceberg with Other Table Formats

Overview of Each Format

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)