text2time module

text2time.preprocess module

These functions focus on pre-processing user O&M and production data to create visualizations of the merged data

pvops.text2time.preprocess.data_site_na(pom_df, df_col_dict)[source]

Drops rows where site-ID is missing (NAN) within either production or O&M data.

Parameters:
  • pom_df (DataFrame) – A data frame corresponding to either the production or O&M data.

  • df_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the input pom_df and contains at least:

    • siteid (string), should be assigned to column name for user’s site-ID

Returns:

  • pom_df (DataFrame) – An updated version of the input data frame, where rows with site-IDs of NAN are dropped.

  • addressed (DataFrame) – A data frame showing rows from the input that were removed by this function.

pvops.text2time.preprocess.om_date_convert(om_df, om_col_dict, toffset=0.0)[source]

Converts dates from string format to date time object in O&M dataframe.

Parameters:
  • om_df (DataFrame) – A data frame corresponding to O&M data.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the O&M data, which consist of at least:

    • datestart (string), should be assigned to column name for O&M event start date in om_df

    • dateend (string), should be assigned to column name for O&M event end date in om_df

  • toffset (float) – Value that specifies how many hours the O&M data should be shifted by in case time-stamps in production data and O&M data don’t align as they should

Returns:

DataFrame – An updated version of the input dataframe, but with time-stamps converted to localized (time-zone agnostic) date-time objects.

pvops.text2time.preprocess.om_datelogic_check(om_df, om_col_dict, om_dflag='swap')[source]

Addresses issues with O&M dates where the start of an event is listed as occurring after its end. These row are either dropped or the dates are swapped, depending on the user’s preference.

Parameters:
  • om_df (DataFrame) – A data frame corresponding to O&M data.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the O&M data, which consist of at least:

    • datestart (string), should be assigned to column name for associated O&M event start date in om_df

    • dateend (string), should be assigned to column name for associated O&M event end date in om_df

  • om_dflag (str) – A flag that specifies how to address rows where the start of an event occurs after its conclusion. A flag of ‘drop’ will drop those rows, and a flag of ‘swap’ swap the two dates for that row.

Returns:

  • om_df (DataFrame) – An updated version of the input dataframe, but with O&M data quality issues addressed to ensure the start of an event precedes the event end date.

  • addressed (DataFrame) – A data frame showing rows from the input that were addressed by this function.

pvops.text2time.preprocess.om_nadate_process(om_df, om_col_dict, om_dendflag='drop')[source]

Addresses issues with O&M dataframe where dates are missing (NAN). Two operations are performed : 1) rows are dropped where start of an event is missing and (2) rows where the conclusion of an event is NAN can either be dropped or marked with the time at which program is run, depending on the user’s preference.

Parameters:
  • om_df (DataFrame) – A data frame corresponding to O&M data.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the O&M data, which consist of at least:

    • datestart (string), should be assigned to column name for user’s O&M event start-date

    • dateend (string), should be assigned to column name for user’s O&M event end-date

  • om_dendflag (str) – A flag that specifies how to address rows where the conclusion of an event is missing (NAN). A flag of ‘drop’ will drop those rows, and a flag of ‘today’ will replace the NAN with the time at which the program is run. Any other value will leave the rows untouched.

Returns:

  • om_df (DataFrame) – An updated version of the input dataframe, but with no missing time-stamps in the O&M data.

  • addressed (DataFrame) – A data frame showing rows from the input that were addressed by this function.

pvops.text2time.preprocess.prod_date_convert(prod_df, prod_col_dict, toffset=0.0)[source]

Converts dates from string format to datetime format in production dataframe.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to production data.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the production data, which consist of at least:

    • timestamp (string), should be assigned to user’s time-stamp column name

  • toffset (float) – Value that specifies how many hours the production data should be shifted by in case time-stamps in production data and O&M data don’t align as they should.

Returns:

DataFrame – An updated version of the input dataframe, but with time-stamps converted to localized (time-zone agnostic) date-time objects.

pvops.text2time.preprocess.prod_nadate_process(prod_df, prod_col_dict, pnadrop=False)[source]

Processes rows of production data frame for missing time-stamp info (NAN).

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to production data.

  • prod_df_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the production data, which consist of at least:

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

  • pnadrop (bool) – Boolean flag that determines what to do with rows where time-stamp is missing. A value of True will drop these rows. Leaving the default value of False will identify rows with missing time-stamps for the user, but the function will output the same input data frame with no modifications.

Returns:

  • prod_df (DataFrame) – The output data frame. If pflag = ‘drop’, an updated version of the input data frame is output, but rows with missing time-stamps are removed. If default value is maintained, the input data frame is output with no modifications.

  • addressed (DataFrame) – A data frame showing rows from the input that were addressed or identified by this function.

text2time.utils module

These helper functions focus on performing secondary calcuations from the O&M and production data to create visualizations of the merged data

pvops.text2time.utils.interpolate_data(prod_df, om_df, prod_col_dict, om_col_dict, om_cols_to_translate=['asset', 'prod_impact'])[source]

Provides general overview of the overlapping production and O&M data.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to the production data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in prod_col_dict.

  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in om_col_dict.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the production data

    • siteid (string), should be assigned to associated site-ID column name in prod_df

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

    • energyprod (string), should be assigned to associated production column name in prod_df

    • irradiance (string), should be assigned to associated irradiance column name in prod_df

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • siteid (string), should be assigned to associated site-ID column name in om_df

    • datestart (string), should be assigned to associated O&M event start-date column name in om_df

    • dateend (string), should be assigned to associated O&M event end-date column name in om_df

    • Others specified in om_cols_to_translate

  • om_cols_to_translate (list) – List of om_col_dict keys to translate into prod_df

Returns:

  • prod_output (DataFrame) – A data frame that includes statistics for the production data per site in the data frame. Two statistical parameters are calculated and assigned to separate columns:

    • Actual # Time Stamps (datetime.datetime), total number of overlapping production time-stamps

    • Max # Time Stamps (datetime.datetime), maximum number of production time-stamps, including NANs

  • om_out (DataFrame) – A data frame that includes statistics for the O&M data per site in the data frame. Three statistical parameters are calculated and assigned to separate columns:

    • Earliest Event Start (datetime.datetime), column that specifies timestamp of earliest start of all events per site.

    • Latest Event End (datetime.datetime), column that specifies timestamp for latest conclusion of all events per site.

    • Total Events (int), column that specifies total number of events per site

pvops.text2time.utils.om_summary_stats(om_df, meta_df, om_col_dict, meta_col_dict)[source]

Adds columns to OM dataframe capturing statistics (e.g., event duration, month of occurrence, and age). Latter is calculated by using corresponding site commissioning date within the metadata dataframe.

Parameters:
  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been pre-processed by the QC and overlappingDFs functions. This data frame needs to have the columns specified in om_col_dict.

  • meta_df (DataFrame) – A data frame corresponding to the metadata that contains columns specified in meta_col_dict.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data which consist of at least:

    • siteid (string), should be assigned to column name for associated site-ID

    • datestart (string), should be assigned to column name for associated O&M event start-date

    • dateend (string), should be assigned to column name for associated O&M event end-date

    • eventdur (string), should be assigned to column name desired for calculated event duration (calculated here, in hours)

    • modatestart (string), should be assigned to column name desired for month of event start (calculated here)

    • agedatestart (string), should be assigned to column name desired for calculated age of site when event started (calculated here, in days)

  • meta_col_dict (dict) – A dictionary that contains the column names relevant for the meta-data

    • siteid (string), should be assigned to associated site-ID column name in meta_df

    • COD (string), should be asigned to column name corresponding to associated commisioning dates for all sites captured in om_df

Returns:

om_df (DataFrame) – An updated version of the input dataframe, but with three new columns added for visualizations: event duration, month of event occurrence, and age of system at time of event occurrence. See om_col_dict for mapping of expected variables to user-defined variables.

pvops.text2time.utils.overlapping_data(prod_df, om_df, prod_col_dict, om_col_dict)[source]

Finds the overlapping time-range between the production data and O&M data for any given site. The outputs are a truncated version of the input data frames, that contains only data with overlapping dates between the two DFs.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to the production data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in prod_col_dict. The time-stamp column should not have any NANs for proper operation of this function.

  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in om_col_dict. The time-stamp columns should not have any NANs for proper operation of this function.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the production data

    • siteid (string), should be assigned to associated site-ID column name in prod_df

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

    • energyprod (string), should be assigned to associated production column name in prod_df

    • irradiance (string), should be assigned to associated irradiance column name in prod_df

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • siteid (string), should be assigned to associated site-ID column name in om_df

    • datestart (string), should be assigned to associated O&M event start-date column name in om_df

    • dateend (string), should be assigned to associated O&M event end-date column name in om_df

Returns:

  • prod_df (DataFrame) – Production data frame similar to the input data frame, but truncated to only contain data that overlaps in time with the O&M data.

  • om_df (DataFrame) – O&M data frame similar to the input data frame, but truncated to only contain data that overlaps in time with the production data.

pvops.text2time.utils.prod_anomalies(prod_df, prod_col_dict, minval=1.0, repval=nan, ffill=True)[source]

For production data with cumulative energy entries, 1) addresses time-stamps where production unexpectedly drops to near zero and 2) replaces unexpected production drops with NANs or with user-specified value. If unexpected production drops are replaced with NANs and if ‘ffill’ is set to ‘True’ in the input argument, a forward-fill method is used to replace the unexpected drops.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to production data were production is logged on a cumulative basis.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names associated with the production data, which consist of at least:

    • energyprod (string), should be assigned to the associated cumulative production column name in prod_df

  • minval (float) – Cutoff value for production data that determines where anomalies are defined. Any production values below minval will be addressed by this function. Default minval is 1.0

  • repval (float) – Value that should replace the anomalies in a cumulative production data format. Default value is numpy’s NAN.

  • ffill (boolean) – Boolean flag that determines whether NANs in production column in prod_df should be filled using a forward-fill method.

Returns:

  • prod_df (DataFrame) – An updated version of the input dataframe, but with zero production values converted to user’s preference.

  • addressed (DataFrame) – A data frame showing rows from the input that were addressed by this function.

pvops.text2time.utils.prod_quant(prod_df, prod_col_dict, comp_type, ecumu=True)[source]

Compares performance of observed production data in relation to an expected baseline

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to the production data after having been processed by the QC and overlappingDFs functions. This data frame needs at least the columns specified in prod_col_dict.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the production data

    • siteid (string), should be assigned to associated site-ID column name in prod_df

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

    • energyprod (string), should be assigned to associated production column name in prod_df

    • baseline (string), should be assigned to associated expected baseline production column name in prod_df

    • compared (string), should be assigned to column name desired for quantified production data (calculated here)

    • energy_pstep (string), should be assigned to column name desired for energy per time-step (calculated here)

  • comp_type (str) – Flag that specifies how the energy production should be compared to the expected baseline. A flag of ‘diff’ shows the subtracted difference between the two (baseline - observed). A flag of ‘norm’ shows the ratio of the two (observed/baseline)

  • ecumu (bool) – Boolean flag that specifies whether the production (energy output) data is input as cumulative information (“True”) or on a per time-step basis (“False”).

Returns:

DataFrame – A data frame similar to the input, with an added column for the performance comparisons

pvops.text2time.utils.summarize_overlaps(prod_df, om_df, prod_col_dict, om_col_dict)[source]

Provides general overview of the overlapping production and O&M data.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to the production data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in prod_col_dict.

  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been processed by the perf_om_NA_qc function. This data frame needs the columns specified in om_col_dict.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the production data

    • siteid (string), should be assigned to associated site-ID column name in prod_df

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

    • energyprod (string), should be assigned to associated production column name in prod_df

    • irradiance (string), should be assigned to associated irradiance column name in prod_df

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • siteid (string), should be assigned to associated site-ID column name in om_df

    • datestart (string), should be assigned to associated O&M event start-date column name in om_df

    • dateend (string), should be assigned to associated O&M event end-date column name in om_df

Returns:

  • prod_output (DataFrame) – A data frame that includes statistics for the production data per site in the data frame. Two statistical parameters are calculated and assigned to separate columns:

    • Actual # Time Stamps (datetime.datetime), total number of overlapping production time-stamps

    • Max # Time Stamps (datetime.datetime), maximum number of production time-stamps, including NANs

  • om_out (DataFrame) – A data frame that includes statistics for the O&M data per site in the data frame. Three statistical parameters are calculated and assigned to separate columns:

    • Earliest Event Start (datetime.datetime), column that specifies timestamp of earliest start of all events per site.

    • Latest Event End (datetime.datetime*), column that specifies timestamp for latest conclusion of all events per site.

    • Total Events (int), column that specifies total number of events per site

text2time.visualize module

These functions focus on visualizing the processed O&M and production data

pvops.text2time.visualize.visualize_categorical_scatter(om_df, om_col_dict, cat_varx, cat_vary, fig_sets)[source]

Produces a seaborn categorical scatter plot to show the relationship between an O&M numerical column and a categorical column using sns.catplot()

Parameters:
  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been pre-processed to address NANs and date consistency, and after applying the om_summary_stats function. This data frame needs at least the columns specified in om_col_dict.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • eventdur (string), should be assigned to column name desired for repair duration. This column is calculated by om_summary_stats

    • agedatestart (string), should be assigned to column name desired for age of site when event started. This column is calculated by om_summary_stats

  • cat_varx (str) – Column name that contains categorical variable to be plotted

  • cat_vary (str) – Column name that contains numerical variable to be plotted

  • fig_sets (dict) – A dictionary that contains the settings to be used for the figure to be generated, and those settings should include:

    • figsize (tuple), which is a tuple of the figure settings (e.g. (12,10) )

    • fontsize (int), which is the desired font-size for the figure

Returns:

None

pvops.text2time.visualize.visualize_counts(om_df, om_col_dict, count_var, fig_sets)[source]

Produces a seaborn countplot of an O&M categorical column using sns.countplot()

Parameters:
  • om_df (DataFrame) – A data frame corresponding to the O&M data after having been pre-processed to address NANs and date consistency, and after applying the om_summary_stats function. This data frame needs at least the columns specified in om_col_dict.

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • siteid (string), should be assigned to column name for associated site-ID in om_df.

    • modatestart (string), should be assigned to column name desired for month of event start. This column is calculated by om_summary_stats

  • count_var (str) – Column name that contains categorical variable to be plotted

  • fig_sets (dict) – A dictionary that contains the settings to be used for the figure to be generated, and those settings should include:

    • figsize (tuple), which is a tuple of the figure settings (e.g. (12,10) )

    • fontsize (int), which is the desired font-size for the figure

Returns:

None

pvops.text2time.visualize.visualize_om_prod_overlap(prod_df, om_df, prod_col_dict, om_col_dict, prod_fldr, e_cumu, be_cumu, samp_freq='H', pshift=0.0, baselineflag=True)[source]

Creates Plotly figures of performance data overlaid with coinciding O&M tickets. A separate figure for each site in the production data frame (prod_df) is generated.

Parameters:
  • prod_df (DataFrame) – A data frame corresponding to the performance data after (ideally) having been processed by the perf_om_NA_qc and overlappingDFs functions. This data frame needs to contain the columns specified in prod_col_dict.

  • om_df (DataFrame) – A data frame corresponding to the O&M data after (ideally) having been processed by the perf_om_NA_qc and overlappingDFs functions. This data frame needs to contain the columns specified in om_col_dict.

  • prod_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the production data

    • siteid (string), should be assigned to associated site-ID column name in prod_df

    • timestamp (string), should be assigned to associated time-stamp column name in prod_df

    • energyprod (string), should be assigned to associated production column name in prod_df

    • irradiance (string), should be assigned to associated irradiance column name in prod_df. Data should be in [W/m^2].

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the O&M data

    • siteid (string), should be assigned to column name for user’s site-ID

    • datestart (string), should be assigned to column name for user’s O&M event start-date

    • dateend (string), should be assigned to column name for user’s O&M event end-date

    • workID (string), should be assigned to column name for user’s O&M unique event ID

    • worktype (string), should be assigned to column name for user’s O&M ticket type (corrective, predictive, etc)

    • asset (string), should be assigned to column name for affected asset in user’s O&M ticket

  • prod_fldr (str) – Path to directory where plots should be saved.

  • e_cumu (bool) – Boolean flag that specifies whether the production (energy output) data is input as cumulative information (“True”) or on a per time-step basis (“False”).

  • be_cumu (bool) – Boolean that specifies whether the baseline production data is input as cumulative information (“True”) or on a per time-step basis (“False”).

  • samp_freq (str) – Specifies how the performance data should be resampled. String value is any frequency that is valid for pandas.DataFrame.resample(). For example, a value of ‘D’ will resample on a daily basis, and a value of ‘H’ will resample on an hourly basis.

  • pshift (float) – Value that specifies how many hours the performance data should be shifted by to help align performance data with O&M data. Mostly necessary when resampling frequencies are larger than an hour

  • baselineflag (bool) – Boolean that specifies whether or not to display the baseline (i.e., expected production profile) as calculated with the irradiance data using the baseline production data. A value of ‘True’ will display the baseline production profile on the generated Plotly figures, and a value of ‘False’ will not.

Returns:

list – List of Plotly figure handles generated by function for each site within prod_df.