Basic Apache Iceberg Table Creation Recipe
Overview
This recipe demonstrates how to create a basic Apache Iceberg table from scratch using PySpark. It showcases Iceberg’s key differentiators like hidden partitioning and multi-catalog support.
What You’ll Learn
- How to configure Spark for Iceberg
- Creating tables with Iceberg catalog
- Reading and querying Iceberg tables
- Understanding Iceberg’s snapshot system
- Working with hidden partitioning
Prerequisites
- Python 3.8 or later
- Apache Spark 3.3 or later
- Basic understanding of Apache Spark
- Familiarity with DataFrames
Quick Start
# Install dependencies
pip install -r requirements.txt
# Download Iceberg Spark Runtime (if not already available)
# Version should match your Spark version
# Example for Spark 3.3:
# wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.0/iceberg-spark-runtime-3.3_2.12-1.4.0.jar
# Run the solution
python solution.py
# Validate the recipe
./validate.sh
Recipe Structure
basic-iceberg-table/
├── problem.md # Detailed problem description
├── solution.py # Complete, commented solution
├── requirements.txt # Python dependencies
├── validate.sh # Automated validation script
└── README.md # This file
Expected Output
When you run the solution, you’ll see:
- Spark session initialization with Iceberg configuration
- Sample data creation (5 users)
- Iceberg table creation with catalog
- Table statistics and schema
- SQL query demonstration
- Snapshot metadata display
- Hidden partitioning example
Key Concepts Demonstrated
1. Iceberg Catalog Configuration
spark = SparkSession.builder \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "/tmp/iceberg-warehouse") \
.getOrCreate()
2. Creating Iceberg Tables
# Using writeTo API (Iceberg-specific)
df.writeTo("local.db.users").create()
# Using SQL
spark.sql("""
CREATE TABLE local.db.users (
user_id INT,
username STRING,
email STRING
) USING iceberg
""")
3. Hidden Partitioning
# Partition by day transformation
spark.sql("""
CREATE TABLE local.db.events (
event_time TIMESTAMP,
user_id STRING
)
USING iceberg
PARTITIONED BY (days(event_time))
""")
4. Accessing Metadata
# View snapshots
spark.sql("SELECT * FROM local.db.users.snapshots").show()
# View files
spark.sql("SELECT * FROM local.db.users.files").show()
Architecture Diagram
graph TB
A[Sample Data] --> B[DataFrame]
B --> C[Iceberg Writer]
C --> D[Data Files]
C --> E[Metadata Layer]
E --> F[manifest-list.avro]
E --> G[manifest.avro]
E --> H[metadata.json]
D --> I[Parquet/ORC/Avro Files]
F --> J[Iceberg Table]
G --> J
H --> J
I --> J
J --> K[Multi-Engine Access]
K --> L[Spark]
K --> M[Trino]
K --> N[Flink]
Iceberg vs Delta Comparison
| Feature | Delta Lake | Apache Iceberg (This Recipe) |
|---|---|---|
| Catalog | File-based | Catalog-based (Hive, Nessie, etc.) |
| Partitioning | Explicit | Hidden with transforms |
| Multi-Engine | Good | Excellent |
| Metadata | JSON transaction log | Avro metadata files |
Advanced Usage
Using Different Catalogs
# AWS Glue Catalog
.config("spark.sql.catalog.glue", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.glue.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
.config("spark.sql.catalog.glue.warehouse", "s3://my-bucket/warehouse")
# Hive Catalog
.config("spark.sql.catalog.hive", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.hive.type", "hive")
.config("spark.sql.catalog.hive.uri", "thrift://localhost:9083")
Partition Evolution
# Start with one partition strategy
spark.sql("""
CREATE TABLE local.db.orders (
order_time TIMESTAMP,
amount DECIMAL
)
USING iceberg
PARTITIONED BY (days(order_time))
""")
# Later, add another partition field without rewriting data
spark.sql("""
ALTER TABLE local.db.orders
ADD PARTITION FIELD bucket(16, order_id)
""")
Validation
The validate.sh script automatically:
- Checks Python installation
- Installs dependencies if needed
- Runs the solution
- Verifies Iceberg table structure
- Confirms metadata creation
- Reports success/failure
Common Issues
Issue: Iceberg JAR not found
Solution: Download and add Iceberg Spark Runtime JAR
# For Spark 3.3
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.0/iceberg-spark-runtime-3.3_2.12-1.4.0.jar
# Add to spark-submit
spark-submit --jars iceberg-spark-runtime-3.3_2.12-1.4.0.jar solution.py
Issue: Catalog not configured
Solution: Ensure catalog configuration matches your environment
# For local testing, use hadoop catalog
.config("spark.sql.catalog.local.type", "hadoop")
# For production, use appropriate catalog (Hive, Glue, Nessie)
Issue: Table already exists
Solution: Use createOrReplace() or drop the table first
df.writeTo("local.db.users").createOrReplace()
# Or
spark.sql("DROP TABLE IF EXISTS local.db.users")
Next Steps
After mastering this basic recipe, explore:
- Advanced Operations: MERGE, UPDATE, DELETE
- Time Travel: Query historical snapshots
- Partition Evolution: Change partitioning strategy
- Multi-Engine: Query with Trino, Flink, Dremio
- Table Maintenance: Compaction, snapshot expiration
- Catalog Integration: AWS Glue, Hive Metastore, Nessie
Contributing
Found a bug or have an improvement? Please:
- Open an issue describing the problem
- Submit a PR with your fix
- Ensure validation passes
References
License
This recipe is part of the Delta Lake & Apache Iceberg Knowledge Hub, licensed under Apache 2.0.