Timeseries Tutorial

This notebook demonstrates the use of timeseries module capabilities to process data for analysis and build expected energy models.

[20]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings("ignore")
[21]:
from pvops.timeseries import preprocess
from pvops.timeseries.models import linear, iec
from pvops.text2time import utils as t2t_utils, preprocess as t2t_preprocess
[22]:
example_prodpath = os.path.join('example_data', 'example_prod_with_covariates.csv')

Load in data

[23]:
prod_data = pd.read_csv(example_prodpath, on_bad_lines='skip', engine='python')

Production data is loaded in with the columns: - date: a time stamp - randid: site ID - generated_kW: generated energy - expected_kW: expected energy - irrad_poa_Wm2: irradiance - temp_amb_C: ambient temperature - wind_speed_ms: wind speed - temp_mod_C: module temperature

[24]:
prod_data.head()
[24]:
date randid generated_kW expected_kW irrad_poa_Wm2 temp_amb_C wind_speed_ms temp_mod_C
0 2018-04-01 07:00:00 R15 0.475 0.527845 0.02775 16.570 4.2065 14.1270
1 2018-04-01 08:00:00 R15 1332.547 1685.979445 87.91450 16.998 4.1065 15.8610
2 2018-04-01 09:00:00 R15 6616.573 7343.981135 367.90350 20.168 4.5095 24.5745
3 2018-04-01 10:00:00 R15 8847.800 10429.876422 508.28700 21.987 4.9785 30.7740
4 2018-04-01 11:00:00 R15 11607.389 12981.228814 618.79450 23.417 4.6410 35.8695
[25]:
metadata = pd.DataFrame()
metadata['randid'] = ['R15', 'R10']
metadata['dcsize'] = [20000, 20000]
metadata.head()
[25]:
randid dcsize
0 R15 20000
1 R10 20000

Column dictionaries

Create production and metadata column dictionary with format {pvops variable: user-specific column names}. This establishes a connection between the user’s data columns and the pvops library.

[26]:
prod_col_dict = {'siteid': 'randid',
                 'timestamp': 'date',
                 'powerprod': 'generated_kW',
                 'irradiance':'irrad_poa_Wm2',
                 'temperature':'temp_amb_C', # Optional parameter, used by one of the modeling structures
                 'baseline': 'IEC_pstep', #user's name choice for new column (baseline expected energy defined by user or calculated based on IEC)
                 'dcsize': 'dcsize', #user's name choice for new column (System DC-size, extracted from meta-data)
                 'compared': 'Compared',#user's name choice for new column
                 'energy_pstep': 'Energy_pstep', #user's name choice for new column
                 'capacity_normalized_power': 'capacity_normalized_power', #user's name choice for new column
}

metad_col_dict = {'siteid': 'randid',
                  'dcsize': 'dcsize'}

Data Formatting

Use the prod_date_convert function to convert date information to python datetime objects and use prod_nadate_process to handle data entries with no date information - here we use pnadrop=True to drop such entries.

[27]:
prod_data_converted = t2t_preprocess.prod_date_convert(prod_data, prod_col_dict)
prod_data_datena_d, _ = t2t_preprocess.prod_nadate_process(prod_data_converted, prod_col_dict, pnadrop=True)

Assign production data index to timestamp data, using column dictionary to translate to user columns.

[28]:
prod_data_datena_d.index = prod_data_datena_d[prod_col_dict['timestamp']]

min(prod_data_datena_d.index), max(prod_data_datena_d.index)
[28]:
(Timestamp('2018-04-01 07:00:00'), Timestamp('2019-03-31 18:00:00'))
[29]:
prod_data_datena_d
[29]:
date randid generated_kW expected_kW irrad_poa_Wm2 temp_amb_C wind_speed_ms temp_mod_C
date
2018-04-01 07:00:00 2018-04-01 07:00:00 R15 0.475 0.527845 0.02775 16.570000 4.2065 14.127000
2018-04-01 08:00:00 2018-04-01 08:00:00 R15 1332.547 1685.979445 87.91450 16.998000 4.1065 15.861000
2018-04-01 09:00:00 2018-04-01 09:00:00 R15 6616.573 7343.981135 367.90350 20.168000 4.5095 24.574500
2018-04-01 10:00:00 2018-04-01 10:00:00 R15 8847.800 10429.876422 508.28700 21.987000 4.9785 30.774000
2018-04-01 11:00:00 2018-04-01 11:00:00 R15 11607.389 12981.228814 618.79450 23.417000 4.6410 35.869500
... ... ... ... ... ... ... ... ...
2019-03-31 14:00:00 2019-03-31 14:00:00 R10 19616.000 20179.797303 967.10350 -4.129722 5.9935 35.860694
2019-03-31 15:00:00 2019-03-31 15:00:00 R10 19552.000 20149.254116 983.94500 -4.214444 5.1770 38.549306
2019-03-31 16:00:00 2019-03-31 16:00:00 R10 19520.000 20192.672439 1011.77950 -3.943611 4.2430 42.848889
2019-03-31 17:00:00 2019-03-31 17:00:00 R10 17968.000 19141.761319 895.76800 -4.124167 3.1885 42.321806
2019-03-31 18:00:00 2019-03-31 18:00:00 R10 11616.000 12501.159607 649.48700 -4.663056 2.2540 38.596944

8755 rows × 8 columns

Data Preprocessing

Preprocess data with prod_inverter_clipping_filter using the threshold model. This adds a mask column to the dataframe where True indicates a row to be removed by the filter.

[30]:
masked_prod_data = preprocess.prod_inverter_clipping_filter(prod_data_datena_d, prod_col_dict, metadata, metad_col_dict, 'threshold', freq=60)

filtered_prod_data = masked_prod_data[masked_prod_data['mask'] == False]

print(f"Detected and removed {sum(masked_prod_data['mask'])} rows with inverter clipping.")
Detected and removed 24 rows with inverter clipping.

Normalize the power by capacity

[31]:
for site in metadata[metad_col_dict['siteid']].unique():
    site_metad_mask = metadata[metad_col_dict['siteid']] == site
    site_prod_mask = filtered_prod_data[prod_col_dict['siteid']] == site
    dcsize = metadata.loc[site_metad_mask, metad_col_dict['dcsize']].values[0]
    filtered_prod_data.loc[site_prod_mask, [prod_col_dict['capacity_normalized_power']]] = filtered_prod_data.loc[site_prod_mask, prod_col_dict['powerprod']] / dcsize

Visualize the power signal versus covariates for one site with normalized values.

[32]:
temp = filtered_prod_data[filtered_prod_data['randid'] == 'R10']
for xcol in ['irrad_poa_Wm2', 'temp_amb_C', 'wind_speed_ms']:
    plt.scatter(temp[xcol], temp[prod_col_dict['capacity_normalized_power']])
    plt.title(xcol)
    plt.grid()
    plt.show()
../../_images/pages_tutorials_tutorial_timeseries_22_0.png
../../_images/pages_tutorials_tutorial_timeseries_22_1.png
../../_images/pages_tutorials_tutorial_timeseries_22_2.png

Dynamic linear modeling

We train a linear model using site data (irradiance, temperature, and windspeed)

[33]:
model_prod_data = filtered_prod_data.dropna(subset=['irrad_poa_Wm2', 'temp_amb_C', 'wind_speed_ms']+[prod_col_dict['powerprod']])
# Make sure to only pass data for one site! If sites are very similar, you can consider providing both sites.
model, train_df, test_df = linear.modeller(prod_col_dict,
                                           kernel_type='polynomial',
                                           #time_weighted='month',
                                           X_parameters=['irrad_poa_Wm2', 'temp_amb_C'],#, 'wind_speed_ms'],
                                           Y_parameter='generated_kW',
                                           prod_df=model_prod_data,
                                           test_split=0.05,
                                           degree=3,
                                           verbose=1)
train {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
Design matrix shape: (8293, 108)
test {2, 3}
Design matrix shape: (437, 108)

Begin training.
[OLS] Mean squared error: 679448.77
[OLS] Coefficient of determination: 0.99
[OLS] 108 coefficient trained.
[RANSAC] Mean squared error: 812510.79
[RANSAC] Coefficient of determination: 0.98

Begin testing.
[OLS] Mean squared error: 7982573.16
[OLS] Coefficient of determination: 0.85
[OLS] 108 coefficient trained.
[RANSAC] Mean squared error: 81461712.61
[RANSAC] Coefficient of determination: -0.55

Define a plotting function

[34]:
from sklearn.metrics import mean_squared_error, r2_score

def plot(model, prod_col_dict, data_split='test', npts=50):

    def print_info(real,pred,name):
        mse = mean_squared_error(real, pred)
        r2 = r2_score(real, pred)
        print(f'[{name}] Mean squared error: %.2f'
                % mse)
        print(f'[{name}] Coefficient of determination: %.2f'
                % r2)

    fig,(ax) = plt.subplots(figsize=(14,8))

    if data_split == 'test':
        df = test_df
    elif data_split == 'train':
        df = train_df

    measured = model.estimators['OLS'][f'{data_split}_y'][:npts]

    ax2 = ax.twinx()
    ax2.plot(model.estimators['OLS'][f'{data_split}_index'][:npts], df[prod_col_dict['irradiance']].values[:npts], 'k', label='irradiance')

    ax.plot(model.estimators['OLS'][f'{data_split}_index'][:npts], df['expected_kW'].values[:npts], label='partner_expected')
    print_info(measured, df['expected_kW'].values[:npts], 'partner_expected')

    ax.plot(model.estimators['OLS'][f'{data_split}_index'][:npts], measured, label='measured')
    for name, info in model.estimators.items():
        predicted = model.estimators[name][f'{data_split}_prediction'][:npts]
        ax.plot(model.estimators[name][f'{data_split}_index'][:npts], predicted, label=name)
        print_info(measured, predicted, name)

    ax2.set_ylabel("Irradiance (W/m2)")
    ax.set_ylabel("Power (W)")
    ax.set_xlabel('Time')
    handles, labels = [(a+b) for a, b in zip(ax.get_legend_handles_labels(), ax2.get_legend_handles_labels())]
    ax.legend(handles, labels, loc='best')
    plt.show()

Observe performance

[35]:
plot(model, prod_col_dict, data_split='train', npts=40)
[partner_expected] Mean squared error: 2883252.79
[partner_expected] Coefficient of determination: 0.93
[OLS] Mean squared error: 225151.82
[OLS] Coefficient of determination: 0.99
[RANSAC] Mean squared error: 253701.63
[RANSAC] Coefficient of determination: 0.99
../../_images/pages_tutorials_tutorial_timeseries_28_1.png
[36]:
plot(model, prod_col_dict, data_split='test', npts=40)
[partner_expected] Mean squared error: 273446.60
[partner_expected] Coefficient of determination: 0.99
[OLS] Mean squared error: 1037605.79
[OLS] Coefficient of determination: 0.98
[RANSAC] Mean squared error: 1300872.49
[RANSAC] Coefficient of determination: 0.97
../../_images/pages_tutorials_tutorial_timeseries_29_1.png