Getting Started with Delta Lake and Apache Iceberg
This tutorial provides a comprehensive introduction to both Delta Lake and Apache Iceberg, helping you understand when and how to use each technology.
Overview
Both Delta Lake and Apache Iceberg are open-source table formats that bring ACID transactions, schema evolution, and time travel capabilities to data lakes. They transform collections of Parquet files into reliable, transactional data stores.
Prerequisites
- Basic understanding of data lakes and Parquet files
- Familiarity with Apache Spark or another query engine
- Access to a development environment (local or cloud)
- Java 8 or 11 installed (for Spark)
Choosing Between Delta Lake and Iceberg
Use this decision tree to help choose the right technology for your needs:
graph TD
A[Start] --> B{Primary compute engine?}
B -->|Databricks| C[Delta Lake]
B -->|Apache Spark| D{Need multi-engine support?}
B -->|Apache Flink| E[Apache Iceberg]
B -->|Trino/Presto| E
D -->|Yes| E
D -->|No| F{Which features are critical?}
F -->|Z-ordering, CDC| C
F -->|Hidden partitioning| E
F -->|Either works| G[Choose based on team expertise]
C --> H[Implement Delta Lake]
E --> I[Implement Apache Iceberg]
G --> J[Start with Delta Lake for Spark]
Part 1: Delta Lake Quickstart
Installation
# Using pip
pip install pyspark delta-spark
# Using conda
conda install -c conda-forge pyspark delta-spark
Your First Delta Table
from pyspark.sql import SparkSession
# Create Spark session with Delta support
spark = SparkSession.builder \
.appName("DeltaQuickstart") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])
# Write as Delta table
df.write.format("delta").mode("overwrite").save("/tmp/users-delta")
# Read Delta table
delta_df = spark.read.format("delta").load("/tmp/users-delta")
delta_df.show()
Key Delta Lake Operations
1. Update Records
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/tmp/users-delta")
# Update records
delta_table.update(
condition = "age < 30",
set = {"age": "age + 1"}
)
2. Delete Records
delta_table.delete("id = 2")
3. Upsert (MERGE)
# New data
new_data = [(2, "Bob", 31), (4, "Diana", 28)]
new_df = spark.createDataFrame(new_data, ["id", "name", "age"])
# Merge
delta_table.alias("target").merge(
new_df.alias("source"),
"target.id = source.id"
).whenMatchedUpdate(set = {
"name": "source.name",
"age": "source.age"
}).whenNotMatchedInsert(values = {
"id": "source.id",
"name": "source.name",
"age": "source.age"
}).execute()
4. Time Travel
# Query historical version
historical_df = spark.read.format("delta") \
.option("versionAsOf", 0) \
.load("/tmp/users-delta")
# Query by timestamp
timestamp_df = spark.read.format("delta") \
.option("timestampAsOf", "2024-01-01") \
.load("/tmp/users-delta")
# View history
delta_table.history().show()
Part 2: Apache Iceberg Quickstart
Installation
# Using pip
pip install pyspark pyiceberg
# Add Iceberg jars to Spark
# Download from: https://iceberg.apache.org/releases/
Your First Iceberg Table
from pyspark.sql import SparkSession
# Create Spark session with Iceberg support
spark = SparkSession.builder \
.appName("IcebergQuickstart") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "/tmp/warehouse") \
.getOrCreate()
# Create sample data
data = [(1, "Alice", 25), (2, "Bob", 30), (3, "Charlie", 35)]
df = spark.createDataFrame(data, ["id", "name", "age"])
# Create Iceberg table
df.writeTo("local.db.users").create()
# Read Iceberg table
iceberg_df = spark.table("local.db.users")
iceberg_df.show()
Key Iceberg Operations
1. Update Records
spark.sql("""
UPDATE local.db.users
SET age = age + 1
WHERE age < 30
""")
2. Delete Records
spark.sql("DELETE FROM local.db.users WHERE id = 2")
3. Upsert (MERGE)
spark.sql("""
MERGE INTO local.db.users AS target
USING (
SELECT 2 AS id, 'Bob' AS name, 31 AS age
UNION ALL
SELECT 4 AS id, 'Diana' AS name, 28 AS age
) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
4. Time Travel
# Query by snapshot ID
historical_df = spark.read \
.option("snapshot-id", "1234567890") \
.table("local.db.users")
# Query by timestamp
timestamp_df = spark.read \
.option("as-of-timestamp", "1672531200000") \
.table("local.db.users")
# View history
spark.sql("SELECT * FROM local.db.users.history").show()
Common Patterns
Pattern 1: Incremental Data Loading
Delta Lake
from delta.tables import DeltaTable
# Read new data
new_data = spark.read.parquet("s3://bucket/new-data/")
# Append to Delta table
new_data.write.format("delta").mode("append").save("/path/to/delta")
Iceberg
# Read new data
new_data = spark.read.parquet("s3://bucket/new-data/")
# Append to Iceberg table
new_data.writeTo("local.db.users").append()
Pattern 2: Change Data Capture (CDC)
Delta Lake (Built-in CDC)
# Enable CDC
spark.sql("ALTER TABLE delta.`/path/to/table` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")
# Read changes between versions
changes = spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 1) \
.option("endingVersion", 3) \
.load("/path/to/table")
changes.show()
Iceberg (Query-based CDC)
# Query changes between snapshots
spark.sql("""
SELECT *
FROM local.db.users.changes
WHERE snapshot_id > 1234567890
""")
Pattern 3: Data Compaction
Delta Lake
# Optimize table
spark.sql("OPTIMIZE delta.`/path/to/table`")
# Z-order by frequently queried columns
spark.sql("OPTIMIZE delta.`/path/to/table` ZORDER BY (date, user_id)")
# Clean up old files
spark.sql("VACUUM delta.`/path/to/table` RETAIN 168 HOURS")
Iceberg
from pyspark.sql.functions import col
from org.apache.iceberg.actions import Actions
# Rewrite small files
actions = Actions.forTable(spark, "local.db.users")
actions.rewriteDataFiles() \
.option("target-file-size-bytes", "134217728") \
.execute()
# Expire old snapshots
actions.expireSnapshots() \
.expireOlderThan(System.currentTimeMillis() - 7 * 24 * 60 * 60 * 1000) \
.execute()
Performance Best Practices
For Both Technologies
- Partition Wisely: Choose partition columns based on query patterns
- Monitor Small Files: Compact regularly to avoid performance degradation
- Use Statistics: Both formats collect statistics; leverage them in queries
- Enable Caching: Cache frequently accessed data
- Optimize Schema: Use appropriate data types
Delta Lake Specific
- Use Z-Ordering: For multi-dimensional queries
- Enable Auto-Optimize: In Databricks environments
- Leverage Data Skipping: Ensure proper statistics collection
- Enable CDC: Only when needed (adds overhead)
Iceberg Specific
- Use Hidden Partitioning: Avoid partition pruning issues
- Configure Snapshot Retention: Balance history vs. storage
- Optimize Metadata: Use table properties effectively
- Choose Write Mode: Copy-on-Write vs. Merge-on-Read
Troubleshooting
Common Issues
Issue: “Delta table not found”
Solution: Ensure Delta Lake extensions are configured in SparkSession
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
Issue: “Iceberg table already exists”
Solution: Use createOrReplace() or check if table exists first
df.writeTo("local.db.users").createOrReplace()
Issue: Slow queries
Solution: Check partitioning and run compaction
# Delta
spark.sql("OPTIMIZE table_name")
# Iceberg
actions.rewriteDataFiles().execute()
Next Steps
After completing this tutorial, explore:
- Advanced Features:
- Production Patterns:
- Hands-on Practice:
- Browse the Code Recipes Collection
- Try the Streaming CDC Pipeline
- Explore Time Series Forecasting
Resources
Documentation
Community
Learning
Contributing
Found an issue or have improvements? See our Contributing Guide!
Last Updated: 2025-11-14
Maintainers: Community