Basic Delta Table Creation Recipe
Overview
This recipe demonstrates how to create a basic Delta Lake table from scratch using PySpark. It’s the perfect starting point for anyone new to Delta Lake.
What You’ll Learn
- How to configure Spark for Delta Lake
- Creating sample data with proper schema
- Writing data in Delta format
- Reading and querying Delta tables
- Accessing Delta table history (time travel)
Prerequisites
- Python 3.8 or later
- Basic understanding of Apache Spark
- Familiarity with DataFrames
Quick Start
# Install dependencies
pip install -r requirements.txt
# Run the solution
python solution.py
# Validate the recipe
./validate.sh
Recipe Structure
basic-delta-table/
├── problem.md # Detailed problem description
├── solution.py # Complete, commented solution
├── requirements.txt # Python dependencies
├── validate.sh # Automated validation script
└── README.md # This file
Expected Output
When you run the solution, you’ll see:
- Spark session initialization
- Sample data creation (5 users)
- Delta table creation
- Table statistics and schema
- Sample data display
- SQL query demonstration
- Table history (time travel metadata)
Key Concepts Demonstrated
1. Spark Configuration for Delta Lake
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
2. Writing Delta Format
df.write \
.format("delta") \
.mode("overwrite") \
.save(table_path)
3. Reading Delta Tables
df = spark.read.format("delta").load(table_path)
4. Accessing Table History
spark.sql(f"DESCRIBE HISTORY delta.`{table_path}`")
Validation
The validate.sh script automatically:
- Checks Python installation
- Installs dependencies if needed
- Runs the solution
- Verifies Delta table structure
- Confirms transaction log creation
- Reports success/failure
Architecture Diagram
graph LR
A[Sample Data] --> B[DataFrame]
B --> C[Delta Writer]
C --> D[Parquet Files]
C --> E[_delta_log/]
E --> F[00000.json]
D --> G[Delta Table]
E --> G
G --> H[Time Travel]
G --> I[ACID Transactions]
G --> J[Schema Enforcement]
Next Steps
After mastering this basic recipe, explore:
- Updates and Deletes: Learn MERGE operations
- Time Travel: Query historical versions
- Partitioning: Improve query performance
- Optimization: Use OPTIMIZE and Z-ORDER
- Change Data Feed: Enable CDC capabilities
- Concurrent Writes: Handle multi-writer scenarios
Common Issues
Issue: PySpark not found
Solution: pip install pyspark delta-spark
Issue: Java not installed
Solution: Install Java 8 or 11 (required by Spark)
Issue: Permission denied on validate.sh
Solution: chmod +x validate.sh
Contributing
Found a bug or have an improvement? Please:
- Open an issue describing the problem
- Submit a PR with your fix
- Ensure validation passes
References
License
This recipe is part of the Delta Lake & Apache Iceberg Knowledge Hub, licensed under Apache 2.0.