Getting Started with Delta Lake and Apache Iceberg

This tutorial provides a comprehensive introduction to both Delta Lake and Apache Iceberg, helping you understand when and how to use each technology.

Overview

Both Delta Lake and Apache Iceberg are open-source table formats that bring ACID transactions, schema evolution, and time travel capabilities to data lakes. They transform collections of Parquet files into reliable, transactional data stores.

Prerequisites

Basic understanding of data lakes and Parquet files
Familiarity with Apache Spark or another query engine
Access to a development environment (local or cloud)
Java 8 or 11 installed (for Spark)

Choosing Between Delta Lake and Iceberg

Use this decision tree to help choose the right technology for your needs:

graph TD
    A[Start] --> B{Primary compute engine?}
    B -->|Databricks| C[Delta Lake]
    B -->|Apache Spark| D{Need multi-engine support?}
    B -->|Apache Flink| E[Apache Iceberg]
    B -->|Trino/Presto| E
    
    D -->|Yes| E
    D -->|No| F{Which features are critical?}
    
    F -->|Z-ordering, CDC| C
    F -->|Hidden partitioning| E
    F -->|Either works| G[Choose based on team expertise]
    
    C --> H[Implement Delta Lake]
    E --> I[Implement Apache Iceberg]
    G --> J[Start with Delta Lake for Spark]

Part 1: Delta Lake Quickstart

Installation

# Using pip
pip install pyspark delta-spark

# Using conda
conda install -c conda-forge pyspark delta-spark

Your First Delta Table

from pyspark.sql import SparkSession

# Create Spark session with Delta support
spark = SparkSession.builder \
    .appName("DeltaQuickstart") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])

# Write as Delta table
df.write.format("delta").mode("overwrite").save("/tmp/users-delta")

# Read Delta table
delta_df = spark.read.format("delta").load("/tmp/users-delta")
delta_df.show()

Key Delta Lake Operations

1. Update Records

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/tmp/users-delta")

# Update records
delta_table.update(
    condition = "age < 30",
    set = {"age": "age + 1"}
)

2. Delete Records

delta_table.delete("id = 2")

3. Upsert (MERGE)

# New data
new_data = [(2, "Bob", 31), (4, "Diana", 28)]
new_df = spark.createDataFrame(new_data, ["id", "name", "age"])

# Merge
delta_table.alias("target").merge(
    new_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set = {
    "name": "source.name",
    "age": "source.age"
}).whenNotMatchedInsert(values = {
    "id": "source.id",
    "name": "source.name",
    "age": "source.age"
}).execute()

4. Time Travel

# Query historical version
historical_df = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("/tmp/users-delta")

# Query by timestamp
timestamp_df = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/tmp/users-delta")

# View history
delta_table.history().show()

Part 2: Apache Iceberg Quickstart

Installation

# Using pip
pip install pyspark pyiceberg

# Add Iceberg jars to Spark
# Download from: https://iceberg.apache.org/releases/

Your First Iceberg Table

from pyspark.sql import SparkSession

# Create Spark session with Iceberg support
spark = SparkSession.builder \
    .appName("IcebergQuickstart") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.local.type", "hadoop") \
    .config("spark.sql.catalog.local.warehouse", "/tmp/warehouse") \
    .getOrCreate()

# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])

# Create Iceberg table
df.writeTo("local.db.users").create()

# Read Iceberg table
iceberg_df = spark.table("local.db.users")
iceberg_df.show()

Key Iceberg Operations

1. Update Records

spark.sql("""
    UPDATE local.db.users
    SET age = age + 1
    WHERE age < 30
""")

2. Delete Records

spark.sql("DELETE FROM local.db.users WHERE id = 2")

3. Upsert (MERGE)

spark.sql("""
    MERGE INTO local.db.users AS target
    USING (
        SELECT 2 AS id, 'Bob' AS name, 31 AS age
        UNION ALL
        SELECT 4 AS id, 'Diana' AS name, 28 AS age
    ) AS source
    ON target.id = source.id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

4. Time Travel

# Query by snapshot ID
historical_df = spark.read \
    .option("snapshot-id", "1234567890") \
    .table("local.db.users")

# Query by timestamp
timestamp_df = spark.read \
    .option("as-of-timestamp", "1672531200000") \
    .table("local.db.users")

# View history
spark.sql("SELECT * FROM local.db.users.history").show()

Common Patterns

Pattern 1: Incremental Data Loading

Delta Lake

from delta.tables import DeltaTable

# Read new data
new_data = spark.read.parquet("s3://bucket/new-data/")

# Append to Delta table
new_data.write.format("delta").mode("append").save("/path/to/delta")

Iceberg

# Read new data
new_data = spark.read.parquet("s3://bucket/new-data/")

# Append to Iceberg table
new_data.writeTo("local.db.users").append()

Pattern 2: Change Data Capture (CDC)

Delta Lake (Built-in CDC)

# Enable CDC
spark.sql("ALTER TABLE delta.`/path/to/table` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Read changes between versions
changes = spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 1) \
    .option("endingVersion", 3) \
    .load("/path/to/table")

changes.show()

Iceberg (Query-based CDC)

# Query changes between snapshots
spark.sql("""
    SELECT * 
    FROM local.db.users.changes
    WHERE snapshot_id > 1234567890
""")

Pattern 3: Data Compaction

Delta Lake

# Optimize table
spark.sql("OPTIMIZE delta.`/path/to/table`")

# Z-order by frequently queried columns
spark.sql("OPTIMIZE delta.`/path/to/table` ZORDER BY (date, user_id)")

# Clean up old files
spark.sql("VACUUM delta.`/path/to/table` RETAIN 168 HOURS")

Iceberg

from pyspark.sql.functions import col
from org.apache.iceberg.actions import Actions

# Rewrite small files
actions = Actions.forTable(spark, "local.db.users")
actions.rewriteDataFiles() \
    .option("target-file-size-bytes", "134217728") \
    .execute()

# Expire old snapshots
actions.expireSnapshots() \
    .expireOlderThan(System.currentTimeMillis() - 7 * 24 * 60 * 60 * 1000) \
    .execute()

Performance Best Practices

For Both Technologies

Partition Wisely: Choose partition columns based on query patterns
Monitor Small Files: Compact regularly to avoid performance degradation
Use Statistics: Both formats collect statistics; leverage them in queries
Enable Caching: Cache frequently accessed data
Optimize Schema: Use appropriate data types

Delta Lake Specific

Use Z-Ordering: For multi-dimensional queries
Enable Auto-Optimize: In Databricks environments
Leverage Data Skipping: Ensure proper statistics collection
Enable CDC: Only when needed (adds overhead)

Iceberg Specific

Use Hidden Partitioning: Avoid partition pruning issues
Configure Snapshot Retention: Balance history vs. storage
Optimize Metadata: Use table properties effectively
Choose Write Mode: Copy-on-Write vs. Merge-on-Read

Troubleshooting

Common Issues

Issue: “Delta table not found”

Solution: Ensure Delta Lake extensions are configured in SparkSession

.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

Issue: “Iceberg table already exists”

Solution: Use createOrReplace() or check if table exists first

df.writeTo("local.db.users").createOrReplace()

Issue: Slow queries

Solution: Check partitioning and run compaction

# Delta
spark.sql("OPTIMIZE table_name")

# Iceberg
actions.rewriteDataFiles().execute()

Next Steps

After completing this tutorial, explore:

Advanced Features:
Production Patterns:
Hands-on Practice:
- Browse the Code Recipes Collection
- Try the Streaming CDC Pipeline
- Explore Time Series Forecasting

Resources

Documentation

Community

Learning

Contributing

Found an issue or have improvements? See our Contributing Guide!

Last Updated: 2025-11-14
Maintainers: Community