Skip to content

Data Pipeline Overview

Data Pipeline

The DataPipeline class combines data sources, transformers, and validators into a cohesive data processing workflow.

Components

The pipeline consists of three types of components:

  • Sources fetch raw data from external APIs or files
  • Transformers modify the DataFrame (resample, create lags, convert timezone)
  • Validators check data quality without modifying the DataFrame

Basic Usage

from epftoolbox2.pipelines import DataPipeline
from epftoolbox2.data.sources import EntsoeSource, CalendarSource
from epftoolbox2.data.transformers import ResampleTransformer, TimezoneTransformer
from epftoolbox2.data.validators import NullCheckValidator
pipeline = (
DataPipeline()
.add_source(EntsoeSource(country_code="PL", api_key="...", type=["load", "price"]))
.add_source(CalendarSource(country="PL", holidays="binary"))
.add_transformer(ResampleTransformer(freq="1h"))
.add_transformer(TimezoneTransformer(target_tz="Europe/Warsaw"))
.add_validator(NullCheckValidator(columns=["load_actual", "price"]))
)
df = pipeline.run(start="2024-01-01", end="2024-06-01", cache=True)

Date Format Options

The start and end parameters accept multiple formats:

df = pipeline.run(start="2024-01-01", end="2024-06-01")

Data Flow

  1. Sources fetch data in UTC

    All data sources output DataFrames with UTC DatetimeIndex.

  2. Transformers modify the data

    Apply resampling, create lags, then convert timezone.

  3. Validators check quality

    Verify data integrity without modifying the DataFrame.

Pipeline Components

Data Sources

Transformers

Validators