Basic Apache Iceberg Table Creation Recipe

Overview

This recipe demonstrates how to create a basic Apache Iceberg table from scratch using PySpark. It showcases Iceberg’s key differentiators like hidden partitioning and multi-catalog support.

What You’ll Learn

How to configure Spark for Iceberg
Creating tables with Iceberg catalog
Reading and querying Iceberg tables
Understanding Iceberg’s snapshot system
Working with hidden partitioning

Prerequisites

Python 3.8 or later
Apache Spark 3.3 or later
Basic understanding of Apache Spark
Familiarity with DataFrames

Quick Start

# Install dependencies
pip install -r requirements.txt

# Download Iceberg Spark Runtime (if not already available)
# Version should match your Spark version
# Example for Spark 3.3:
# wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.0/iceberg-spark-runtime-3.3_2.12-1.4.0.jar

# Run the solution
python solution.py

# Validate the recipe
./validate.sh

Recipe Structure

basic-iceberg-table/
├── problem.md         # Detailed problem description
├── solution.py        # Complete, commented solution
├── requirements.txt   # Python dependencies
├── validate.sh        # Automated validation script
└── README.md          # This file

Expected Output

When you run the solution, you’ll see:

Spark session initialization with Iceberg configuration
Sample data creation (5 users)
Iceberg table creation with catalog
Table statistics and schema
SQL query demonstration
Snapshot metadata display
Hidden partitioning example

Key Concepts Demonstrated

1. Iceberg Catalog Configuration

spark = SparkSession.builder \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
    .getOrCreate()

2. Creating Iceberg Tables

# Using writeTo API (Iceberg-specific)
df.writeTo("local.db.users").create()

# Using SQL
spark.sql("""
    CREATE TABLE local.db.users (
        user_id INT,
        username STRING,
        email STRING
    ) USING iceberg
""")

3. Hidden Partitioning

# Partition by day transformation
spark.sql("""
    CREATE TABLE local.db.events (
        event_time TIMESTAMP,
        user_id STRING
    )
    USING iceberg
    PARTITIONED BY (days(event_time))
""")

4. Accessing Metadata

# View snapshots
spark.sql("SELECT * FROM local.db.users.snapshots").show()

# View files
spark.sql("SELECT * FROM local.db.users.files").show()

Architecture Diagram

graph TB
    A[Sample Data] --> B[DataFrame]
    B --> C[Iceberg Writer]
    C --> D[Data Files]
    C --> E[Metadata Layer]
    
    E --> F[manifest-list.avro]
    E --> G[manifest.avro]
    E --> H[metadata.json]
    
    D --> I[Parquet/ORC/Avro Files]
    
    F --> J[Iceberg Table]
    G --> J
    H --> J
    I --> J
    
    J --> K[Multi-Engine Access]
    K --> L[Spark]
    K --> M[Trino]
    K --> N[Flink]

Iceberg vs Delta Comparison

Feature	Delta Lake	Apache Iceberg (This Recipe)
Catalog	File-based	Catalog-based (Hive, Nessie, etc.)
Partitioning	Explicit	Hidden with transforms
Multi-Engine	Good	Excellent
Metadata	JSON transaction log	Avro metadata files

Advanced Usage

Using Different Catalogs

# AWS Glue Catalog
.config("spark.sql.catalog.glue", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.glue.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.glue.warehouse", "s3://my-bucket/warehouse")

# Hive Catalog
.config("spark.sql.catalog.hive", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.hive.type", "hive")
.config("spark.sql.catalog.hive.uri", "thrift://localhost:9083")

Partition Evolution

# Start with one partition strategy
spark.sql("""
    CREATE TABLE local.db.orders (
        order_time TIMESTAMP,
        amount DECIMAL
    )
    USING iceberg
    PARTITIONED BY (days(order_time))
""")

# Later, add another partition field without rewriting data
spark.sql("""
    ALTER TABLE local.db.orders
    ADD PARTITION FIELD bucket(16, order_id)
""")

Validation

The validate.sh script automatically:

Checks Python installation
Installs dependencies if needed
Runs the solution
Verifies Iceberg table structure
Confirms metadata creation
Reports success/failure

Common Issues

Issue: Iceberg JAR not found

Solution: Download and add Iceberg Spark Runtime JAR

# For Spark 3.3
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.0/iceberg-spark-runtime-3.3_2.12-1.4.0.jar

# Add to spark-submit
spark-submit --jars iceberg-spark-runtime-3.3_2.12-1.4.0.jar solution.py

Issue: Catalog not configured

Solution: Ensure catalog configuration matches your environment

# For local testing, use hadoop catalog
.config("spark.sql.catalog.local.type", "hadoop")

# For production, use appropriate catalog (Hive, Glue, Nessie)

Issue: Table already exists

Solution: Use createOrReplace() or drop the table first

df.writeTo("local.db.users").createOrReplace()

# Or
spark.sql("DROP TABLE IF EXISTS local.db.users")

Next Steps

After mastering this basic recipe, explore:

Advanced Operations: MERGE, UPDATE, DELETE
Time Travel: Query historical snapshots
Partition Evolution: Change partitioning strategy
Multi-Engine: Query with Trino, Flink, Dremio
Table Maintenance: Compaction, snapshot expiration
Catalog Integration: AWS Glue, Hive Metastore, Nessie

Contributing

Found a bug or have an improvement? Please:

Open an issue describing the problem
Submit a PR with your fix
Ensure validation passes

References

License

This recipe is part of the Delta Lake & Apache Iceberg Knowledge Hub, licensed under Apache 2.0.