Data Pipeline Overview
Data Pipeline
The DataPipeline class combines data sources, transformers, and validators into a cohesive data processing workflow.
Components
The pipeline consists of three types of components:
- Sources fetch raw data from external APIs or files
- Transformers modify the DataFrame (resample, create lags, convert timezone)
- Validators check data quality without modifying the DataFrame
Basic Usage
from epftoolbox2.pipelines import DataPipelinefrom epftoolbox2.data.sources import EntsoeSource, CalendarSourcefrom epftoolbox2.data.transformers import ResampleTransformer, TimezoneTransformerfrom epftoolbox2.data.validators import NullCheckValidator
pipeline = ( DataPipeline() .add_source(EntsoeSource(country_code="PL", api_key="...", type=["load", "price"])) .add_source(CalendarSource(country="PL", holidays="binary")) .add_transformer(ResampleTransformer(freq="1h")) .add_transformer(TimezoneTransformer(target_tz="Europe/Warsaw")) .add_validator(NullCheckValidator(columns=["load_actual", "price"])))
df = pipeline.run(start="2024-01-01", end="2024-06-01", cache=True)Date Format Options
The start and end parameters accept multiple formats:
df = pipeline.run(start="2024-01-01", end="2024-06-01")import pandas as pd
df = pipeline.run( start=pd.Timestamp("2024-01-01", tz="UTC"), end=pd.Timestamp("2024-06-01", tz="UTC"),)df = pipeline.run(start="today", end="today")df = pipeline.run(start="2024-01-01", end="today")Data Flow
-
Sources fetch data in UTC
All data sources output DataFrames with UTC DatetimeIndex.
-
Transformers modify the data
Apply resampling, create lags, then convert timezone.
-
Validators check quality
Verify data integrity without modifying the DataFrame.
Pipeline Components
Data Sources
EntsoeSource ENTSOE Transparency Platform
OpenMeteoSource Weather forecasts
CalendarSource Holidays & calendar
CsvSource CSV files
Transformers
ResampleTransformer Regular frequency
LagTransformer Lagged features
TimezoneTransformer Convert timezone