Basic Delta Table Creation Recipe

Overview

This recipe demonstrates how to create a basic Delta Lake table from scratch using PySpark. It’s the perfect starting point for anyone new to Delta Lake.

What You’ll Learn

How to configure Spark for Delta Lake
Creating sample data with proper schema
Writing data in Delta format
Reading and querying Delta tables
Accessing Delta table history (time travel)

Prerequisites

Python 3.8 or later
Basic understanding of Apache Spark
Familiarity with DataFrames

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the solution
python solution.py

# Validate the recipe
./validate.sh

Recipe Structure

basic-delta-table/
├── problem.md         # Detailed problem description
├── solution.py        # Complete, commented solution
├── requirements.txt   # Python dependencies
├── validate.sh        # Automated validation script
└── README.md          # This file

Expected Output

When you run the solution, you’ll see:

Spark session initialization
Sample data creation (5 users)
Delta table creation
Table statistics and schema
Sample data display
SQL query demonstration
Table history (time travel metadata)

Key Concepts Demonstrated

1. Spark Configuration for Delta Lake

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

2. Writing Delta Format

df.write \
    .format("delta") \
    .mode("overwrite") \
    .save(table_path)

3. Reading Delta Tables

df = spark.read.format("delta").load(table_path)

4. Accessing Table History

spark.sql(f"DESCRIBE HISTORY delta.`{table_path}`")

Validation

The validate.sh script automatically:

Checks Python installation
Installs dependencies if needed
Runs the solution
Verifies Delta table structure
Confirms transaction log creation
Reports success/failure

Architecture Diagram

graph LR
    A[Sample Data] --> B[DataFrame]
    B --> C[Delta Writer]
    C --> D[Parquet Files]
    C --> E[_delta_log/]
    E --> F[00000.json]
    D --> G[Delta Table]
    E --> G
    G --> H[Time Travel]
    G --> I[ACID Transactions]
    G --> J[Schema Enforcement]

Next Steps

After mastering this basic recipe, explore:

Updates and Deletes: Learn MERGE operations
Time Travel: Query historical versions
Partitioning: Improve query performance
Optimization: Use OPTIMIZE and Z-ORDER
Change Data Feed: Enable CDC capabilities
Concurrent Writes: Handle multi-writer scenarios

Common Issues

Issue: PySpark not found

Solution: pip install pyspark delta-spark

Issue: Java not installed

Solution: Install Java 8 or 11 (required by Spark)

Issue: Permission denied on validate.sh

Solution: chmod +x validate.sh

Contributing

Found a bug or have an improvement? Please:

Open an issue describing the problem
Submit a PR with your fix
Ensure validation passes

References

License

This recipe is part of the Delta Lake & Apache Iceberg Knowledge Hub, licensed under Apache 2.0.