Time Series Forecasting for Weather Prediction

Problem Statement

Weather forecasting is critical for agriculture, transportation, energy management, and disaster preparedness. Traditional weather models rely on complex physical simulations, but data-driven approaches using historical weather data can provide complementary insights and short-term predictions.

This example demonstrates building a time series forecasting system for weather prediction using statistical and machine learning approaches, focusing on:

Temperature forecasting using ARIMA and machine learning models
Feature engineering for temporal patterns
Model evaluation and comparison
Production-ready deployment considerations

Business Value

Agriculture: Optimize irrigation and crop protection schedules
Energy: Predict demand and optimize grid management
Transportation: Improve route planning and safety measures
Emergency Services: Better resource allocation for weather events

Key Challenges

Seasonal patterns: Daily, weekly, and yearly cycles in weather data
Non-stationarity: Weather patterns change over time
Missing data: Sensor failures and data transmission issues
Multiple variables: Temperature, humidity, wind speed, precipitation

Data Storage Options

The pipeline supports saving processed weather data to either:

Apache Iceberg (Local)

Default choice: Local Iceberg tables for fast, ACID-compliant storage
Benefits: Schema evolution, time travel, optimized for analytics
Location: /app/data/iceberg/

PostgreSQL Database

Alternative: Relational database storage with SQL access
Requirements: PostgreSQL service running (use docker-compose --profile with-postgres up)
Benefits: SQL queries, transactions, joins with other data

The system automatically tries Iceberg first, then falls back to PostgreSQL if available.

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Weather Data  │    │  Feature         │    │   Forecasting   │
│   (Delta/CSV)   │───▶│  Engineering    │───▶│   Models        │
│                 │    │                  │    │                 │
│ • Temperature   │    │ • Temporal       │    │ • ARIMA         │
│ • Humidity      │    │ • Lag Features   │    │ • Random Forest │
│ • Wind Speed    │    │ • Rolling Stats  │    │ • XGBoost       │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                        │
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Evaluation    │    │   Comparison     │    │   Results       │
│   Metrics       │◀───│   & Selection    │───▶│   & Insights    │
│                 │    │                  │    │                 │
│ • MAE/RMSE      │    │ • Model Comp.    │    │ • Forecasts     │
│ • MAPE          │    │ • Best Model     │    │ • Visualizations│
│ • R² Score      │    │ • Performance    │    │ • Analysis      │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Components

1. Data Loading & Preprocessing

WeatherDataLoader: Handles data ingestion from Delta tables or CSV files
FeatureEngineer: Creates temporal features and lag variables
Data Validation: Ensures data quality and handles missing values

2. Forecasting Models

Statistical Models

ARIMA: AutoRegressive Integrated Moving Average for univariate forecasting
Auto ARIMA: Automatic parameter selection for ARIMA models

Machine Learning Models

Random Forest: Ensemble method capturing non-linear relationships
XGBoost: Gradient boosting with excellent performance

3. Model Evaluation

MAE (Mean Absolute Error): Average absolute prediction error
RMSE (Root Mean Square Error): Square root of mean squared error
MAPE (Mean Absolute Percentage Error): Percentage-based error metric
R² Score: Proportion of variance explained by the model

Running the Example

Prerequisites

pip install -r requirements.txt

Basic Demo

python solution.py

This will:

Generate sample weather data (6 months of hourly temperature data)
Train ARIMA, Random Forest, and XGBoost models
Generate 24-hour ahead temperature forecasts
Evaluate and compare model performance
Display results and visualizations

Run inside Docker

The repo ships with a Docker image that contains all Python, Java, and system-level dependencies (CmdStan, PySpark, TensorFlow, XGBoost, Delta Lake, Iceberg, PostgreSQL JDBC, etc.). Build it from this directory:

docker build -t weather-forecast .

Run the example (default entrypoint executes solution.py):

docker run --rm -it weather-forecast

To execute a different script/command, override the container command:

docker run --rm -it weather-forecast python validate.py

Data Fetching and Storage

Option 1: Fetch Real Weather Data

# Fetch data for a single city
docker run --rm -v $(pwd)/data:/app/data weather-forecast \
  python fetch_weather_data.py --city "New York" --days 365

# Fetch data for multiple cities concurrently
docker run --rm -v $(pwd)/data:/app/data weather-forecast \
  python fetch_weather_data.py --cities "New York" "London" "Tokyo" --days 180

Option 2: Use Docker Compose (with PostgreSQL)

# Run with PostgreSQL database
docker-compose --profile with-postgres up

# Run data fetcher separately
docker-compose --profile fetch-data up

Data Storage: The pipeline automatically saves processed data to:

Iceberg (default): Local table at /app/data/iceberg/weather_forecasting_data
PostgreSQL (fallback): Database table if Postgres service is running

Resource tip: allocate at least 4 CPU cores and 8 GB RAM to Docker Desktop—the Prophet/CmdStan toolchain, PySpark, and TensorFlow can otherwise run out of memory.

Configuration Options

Data Configuration

# Forecasting parameters
forecast_horizon = 24  # Hours ahead to forecast
train_test_split = 0.8  # Train/test split ratio

# Feature engineering
lag_features = [1, 24, 168]  # Lag features in hours
rolling_windows = [6, 12, 24]  # Rolling statistics windows

Model Configuration

# ARIMA parameters (auto-selected if None)
arima_order = None  # (p, d, q) - set to None for auto-selection

# Random Forest parameters
rf_params = {
    'n_estimators': 100,
    'max_depth': 10,
    'random_state': 42
}

# XGBoost parameters
xgb_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'random_state': 42
}

Model Performance Analysis

Typical Results

Based on the demo data, you can expect:

Model	MAE (°C)	RMSE (°C)	MAPE	R²	Training Time
ARIMA	2.1	2.8	8.5%	0.82	~2s
Random Forest	1.8	2.4	7.2%	0.87	~5s
XGBoost	1.5	2.1	6.1%	0.91	~8s

Model Selection Guidelines

Short-term (< 6h): ARIMA or XGBoost
Medium-term (6h-24h): XGBoost or Random Forest
Interpretability needed: ARIMA or Linear models
High accuracy needed: XGBoost with feature engineering
Fast inference required: ARIMA or simple tree models

Advanced Usage

Custom Data Loading

from solution import WeatherForecasting

# Initialize forecaster
forecaster = WeatherForecasting()

# Load your data
weather_df = forecaster.load_weather_data("path/to/your/weather_data.csv")

# Preprocess with custom features
processed_df = forecaster.create_features(weather_df, lags=[1, 24, 48])

Training Custom Models

# Train specific models
results = forecaster.train_models(
    processed_df,
    target_col='temperature',
    models=['arima', 'rf', 'xgb']
)

# Get forecasts
forecasts = forecaster.generate_forecasts(results, steps=24)

Model Evaluation

# Evaluate all models
evaluation = forecaster.evaluate_models(
    actual=weather_df['temperature'],
    forecasts=forecasts
)

# Print results
forecaster.print_evaluation_results(evaluation)

Production Deployment

Model Serving Options

1. Batch Processing (Daily Forecasts)

def generate_daily_forecast():
    # Load latest weather data
    latest_data = load_weather_data()

    # Generate 24-hour forecast
    forecast = forecaster.generate_forecasts(latest_data, steps=24)

    # Save forecast results
    save_forecast_to_delta(forecast)

    return forecast

2. Real-time Updates

def update_forecast_model():
    # Retrain model with latest data
    new_model = forecaster.train_best_model(latest_weather_data)

    # Save updated model
    save_model_to_storage(new_model)

    # Update serving endpoint
    update_model_endpoint(new_model)

Monitoring & Alerting

Key Metrics to Monitor

Prediction Accuracy: Track MAE/RMSE over time
Model Drift: Compare predictions vs actuals
Data Quality: Missing data rates, outliers
Model Performance: Training time, inference latency

Best Practices

Data Quality

Validate sensor data: Check for outliers and impossible values
Handle missing data: Use appropriate imputation strategies
Data versioning: Track changes in data distribution

Model Development

Time series split: Use proper cross-validation for temporal data
Feature importance: Understand what drives predictions
Model updating: Regularly retrain with new data

Production Considerations

Model versioning: Track model changes and performance
A/B testing: Compare new models against production
Fallback strategies: Handle model failures gracefully

Troubleshooting

Common Issues

Poor Model Performance
- Check data quality and preprocessing
- Verify feature engineering is appropriate
- Consider longer training history
ARIMA Not Converging
- Try auto_arima for automatic parameter selection
- Check for stationarity in the time series
- Consider differencing or transformation
Memory Issues with Large Datasets
- Reduce rolling window sizes
- Use fewer lag features
- Consider sampling or aggregation

Next Steps

Advanced Models: Add LSTM or Prophet models
Multi-variable Forecasting: Predict multiple weather variables
Spatial Correlations: Include data from multiple weather stations
Ensemble Methods: Combine multiple models for better performance
Real-time Forecasting: Implement streaming predictions

Weather Data Sources

This example supports multiple free weather datasets for global city-level data:

1. Open-Meteo API (Recommended)

Coverage: Global, city-level, hourly/daily data
Variables: Temperature, humidity, precipitation, wind speed, pressure
Historical Data: Available from 1940 onwards
Usage: Free for non-commercial use, no API key required
Pros: Real-time access, comprehensive variables, global coverage
Cons: Rate limits apply, data fetching required at runtime

2. NOAA GSOD (Global Summary of the Day)

Coverage: 30,000+ weather stations worldwide
Variables: Daily temperature, precipitation, wind, visibility
Historical Data: Available from 1929 onwards
Usage: Completely free, downloadable CSV files
Pros: No rate limits, reliable government data
Cons: Daily summaries only, station-based (not exactly city centers)

Data Consumption Methods

Runtime Data Fetching (Recommended)

Fetch fresh data when running the container:

# Fetch data for New York City (30 days)
docker run --rm -v $(pwd)/data:/app/data weather-forecasting \
  python fetch_weather_data.py --city "New York" --days 30

# Fetch NOAA data for a specific station
docker run --rm -v $(pwd)/data:/app/data weather-forecasting \
  python fetch_weather_data.py --source noaa-gsod --station "725030-14732" --year 2023

Build-time Data Download

Uncomment the data download line in Dockerfile for sample data inclusion.

Running with Docker

Prerequisites

Docker installed and running
At least 4GB RAM available for the container

Build the Image

docker build -t weather-forecasting .

Quick Start with Docker Compose

# Create required directories
mkdir -p data output

# Fetch sample data and run forecasting
docker-compose up --build

# Or fetch data separately first
docker-compose --profile fetch-data up data-fetcher

# Then run the forecasting
docker-compose up weather-forecasting

Manual Docker Commands

Run with Sample Data Generation

# Create data directory
mkdir -p data

# Run the demo (will generate sample data if none exists)
docker run --rm -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output weather-forecasting

Fetch Real Weather Data First

# Fetch data for New York City (30 days)
docker run --rm -v $(pwd)/data:/app/data weather-forecasting \
  python fetch_weather_data.py --city "New York" --days 30

# Fetch NOAA data for JFK Airport
docker run --rm -v $(pwd)/data:/app/data weather-forecasting \
  python fetch_weather_data.py --source noaa-gsod --station "725030-14732" --year 2023

# Then run forecasting with real data
docker run --rm -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output weather-forecasting

Interactive Development

# Run container with bash for development/debugging
docker run -it --rm -v $(pwd):/app -v $(pwd)/data:/app/data weather-forecasting bash

# Inside container, you can run:
python fetch_weather_data.py --help
python solution.py
python validate.py

Docker Image Details

Base Image: python:3.10-slim
Size: ~2.5GB (includes all ML libraries and Spark)
Java: OpenJDK 21 for PySpark
Python Packages: All dependencies listed in requirements.txt
Data Volume: /app/data for input data
Output Volume: /app/output for results and plots

Data Format Requirements

The weather data should be in CSV format with the following structure:

timestamp,temperature,humidity,precipitation,wind_speed
2023-01-01 00:00:00,15.2,65.5,0.0,3.2
2023-01-01 01:00:00,14.8,67.2,0.0,2.8
...

timestamp: ISO format datetime (YYYY-MM-DD HH:MM:SS)
temperature: Celsius degrees
humidity: Percentage (0-100)
precipitation: Millimeters
wind_speed: Meters per second

Dataset Recommendations by Use Case

For Learning/Development

Open-Meteo with 30-90 days of data for quick testing
Cities: New York, London, Tokyo, Sydney

For Production Model Training

NOAA GSOD with 5-10 years of historical data
Multiple stations per city for robustness
Combine with Open-Meteo for recent data

For Global Analysis

ERA5 (if you need gridded data) or multiple NOAA stations
Cities from different climate zones for diverse training data

Troubleshooting Data Issues

Open-Meteo API Limits

Free tier: 10,000 requests/day
Solution: Cache data locally, use during off-peak hours

NOAA Data Gaps

Some stations have missing data periods
Solution: Use multiple nearby stations, interpolate missing values

Coordinate Accuracy

City coordinates may not match exact weather station locations
Solution: Use actual weather station coordinates from NOAA database</content>

c:\Users\Moshe\Analytical_Guide\Datalake-Guide\code-recipes\examples\time-series-forecasting\problem.md