Timeseries Survival Analysis Tutorial

This notebook aims to demonstrate how to perform survival analysis in the context of PV assets, using the methodology from https://ieeexplore.ieee.org/document/9272625/. As outlined in that paper, survival analysis is concerned with estimating the failure rate over the lifetime of a system, given discrete observed failure events.

Gunda et al. used two methods: Kaplan-Meier estimation and fitting a Weibull distribution. We will do both here.

[1]:
import pandas as pd
import matplotlib.pyplot as plt
from pvops import timeseries
WARNING:tensorflow:From c:\Users\agmoore\AppData\Local\anaconda3\envs\pvops\Lib\site-packages\keras\src\backend\common\global_state.py:82: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

Loading and preprocessing data

Reading in data

We read in the data below, where each column reflects data from an O&M (operation and maintenance) ticket. Note the following column descriptions:

  • randid is an integer identifying the PV site

  • COD identifies the commissioning date of the site

  • EventStart is the recorded time of the event

  • Asset identifies the asset involved in the event

[2]:
om_df = pd.read_csv('example_data/example_om_survival_analysis_data.csv')
om_df.head(5)
[2]:
randid COD EventStart Asset
0 0 2015-09-24 2016-05-18 09:00:00 Other
1 1 2014-08-01 2016-06-13 09:00:00 Other
2 2 2011-12-27 2016-11-10 09:00:00 Other
3 3 2015-07-22 2016-11-10 09:00:00 Other
4 4 2015-08-13 2016-11-24 09:00:00 Other

Data preprocessing

As a first step, we can calculate the days until the event was observed.

[3]:
om_df['days_until_failure'] = (pd.to_datetime(om_df['EventStart']) - pd.to_datetime(om_df['COD'])).dt.days
om_df.head(5)
[3]:
randid COD EventStart Asset days_until_failure
0 0 2015-09-24 2016-05-18 09:00:00 Other 237
1 1 2014-08-01 2016-06-13 09:00:00 Other 682
2 2 2011-12-27 2016-11-10 09:00:00 Other 1780
3 3 2015-07-22 2016-11-10 09:00:00 Other 477
4 4 2015-08-13 2016-11-24 09:00:00 Other 469

Without knowing more about the nature of the observed events, we will only consider the first event for a given asset type for each site.

If a given site-asset pair wasn’t observed in the dataset, we can still glean information from the lack of a failure: we call this a right-censored datapoint. For these right-censored datapoints, we can take the last point we know the site was recording data (i.e., the latest timestamp for that site) and use that time as our datapoint. Then we just mark those datapoints as censored.

The function pvops.timeseries.preprocess.identify_right_censored_data performs this preprocessing step for us. We can see the result below:

[4]:
censored_om_df = timeseries.preprocess.identify_right_censored_data(om_df, {'group_by': 'Asset', 'site': 'randid'})
censored_om_df.head(10)
[4]:
COD EventStart days_until_failure was_observed
randid Asset
0 Other 2015-09-24 2016-05-18 09:00:00 237 True
Transformer 2015-09-24 2018-10-21 09:00:00 1123 False
Inverter 2015-09-24 2018-10-21 09:00:00 1123 False
Tracker 2015-09-24 2018-10-21 09:00:00 1123 False
Facility 2015-09-24 2018-10-21 09:00:00 1123 False
Combiner 2015-09-24 2018-10-21 09:00:00 1123 False
1 Other 2014-08-01 2016-06-13 09:00:00 682 True
Transformer 2014-08-01 2018-10-08 09:00:00 1529 False
Inverter 2014-08-01 2018-10-08 09:00:00 1529 False
Tracker 2014-08-01 2018-10-08 09:00:00 1529 False

Now we are ready to perform survival analysis on the data.

Failure probability estimation

Kaplan-Meier estimator

The Kaplan-Meier estimator is a nonparametric estimator, which means it does not assume a particular form for the survival probability distribution. This makes it more flexible and potentially more accurrate than parametric methods.

The models.survival.fit_survival_function in the pvops.timeseries module implements the Kaplan-Meier estimator when method='kaplan-meier' is used, shown below.

[5]:
col_dict = {'group_by': 'Asset', 'time_to_fail': 'days_until_failure', 'was_observed': 'was_observed'}
kaplan_meier_results = timeseries.models.survival.fit_survival_function(censored_om_df, col_dict, method='kaplan-meier')

And we can plot the results as follows:

[6]:
plt.figure(figsize=(12,6))
assets = om_df['Asset'].unique()
colors = list(plt.cm.colors.BASE_COLORS.values())[:len(assets)]

for asset, color in zip(assets, colors):
    result = kaplan_meier_results[asset]
    plt.plot(result['times'], result['fail_prob'], color=color, label=asset)
    plt.fill_between(result['times'], result['conf_int'][0], result['conf_int'][1], color=color, alpha=0.2)

plt.legend();
plt.grid();
plt.xlim(0,None);
plt.ylim(0,1);
plt.xlabel('Days since site commissioning')
plt.ylabel('Survival probability');
../../_images/pages_tutorials_tutorial_timeseries_survival_analysis_17_0.png

Weibull distribution fitting

The Weibull distribution describes a family of probability distributions parametrized by a shape parameter and a scale parameter. It is frequently used to model the probability of failure by fitting its parameters to observed data. As a parametric method, it is less flexible than the nonparametric Kaplan-Meier estimation but is simpler and potentially easier to interpret.

The models.survival.fit_survival_function in the pvops.timeseries module implements the Weibull fitting when method=weibull' is used, shown below.

[7]:
col_dict = {'group_by': 'Asset', 'time_to_fail': 'days_until_failure', 'was_observed': 'was_observed'}
weibull_results = timeseries.models.survival.fit_survival_function(censored_om_df, col_dict, method='weibull')

And we can plot the Weibull results as well:

[8]:
plt.figure(figsize=(12,6))
assets = om_df['Asset'].unique()
colors = list(plt.cm.colors.BASE_COLORS.values())[:len(assets)]
days = range(censored_om_df['days_until_failure'].max())

for asset, color in zip(assets, colors):
    fitted_weibull = weibull_results[asset]['distribution']
    weibull_probs = fitted_weibull.sf(days)
    plt.plot(days, weibull_probs, color=color, label=asset)

plt.legend();
plt.grid();
plt.xlim(0,None);
plt.ylim(0,1);
plt.xlabel('Days since site commissioning')
plt.ylabel('Survival probability');
../../_images/pages_tutorials_tutorial_timeseries_survival_analysis_22_0.png

We can also extract the coefficients and format them into a table for analysis:

[10]:
parameters = [list(series.values())[:-1] for series in weibull_results.values()]
param_df = pd.DataFrame(parameters, index=weibull_results.keys(), columns=['shape', 'scale'])
param_df
[10]:
shape scale
Other 0.932064 823.962180
Transformer 0.875983 7396.679666
Inverter 0.943692 961.246107
Tracker 0.724652 14380.096078
Facility 0.742765 1430.950715
Combiner 0.850317 1986.362160

We can see that all asset types have shape parameters < 1, indicating the rates of failure decreases over time for all asset types. Inverters see the slowest reduction in failure rate, while trackers have the fastest.