Text Module Tutorial

[56]:
import pandas as pd
import matplotlib.pyplot as plt

from pvops.text import utils
import text_class_example

Problem statements:

1. Text Preprocessing

Process the documents into concise, machine learning-ready documents. Additionally, extract dates from the text.

2. Text Classification

The written tickets are used to make an inference on the specified event descriptor.

Text processing

Import text data

[57]:
folder = 'example_data//'
filename = 'example_ML_ticket_data.csv'
df = pd.read_csv(folder+filename)
df.head(n=3)
[57]:
Date_EventStart Date_EventEnd Asset CompletionDesc Cause ImpactLevel randid
0 8/16/2018 9:00 8/22/2018 17:00 Combiner cb 1.18 was found to have contactor issue woul... 0000 - Unknown. Underperformance 38
1 9/17/2018 18:25 9/18/2018 9:50 Pad self resolved. techdispatched: no 004 - Under voltage. Underperformance 46
2 8/26/2019 9:00 11/5/2019 17:00 Facility all module rows washed, waiting for final repo... 0000 - Unknown Underperformance 62

Establish settings

Specify column names which will be used in this pipeline.

[58]:
DATA_COLUMN = "CompletionDesc"   # Contains document
LABEL_COLUMN = "Asset"           # Establish event descriptor which will be inferenced by classifiers
DATE_COLUMN = 'Date_EventStart'  # Date of ticket (start date, end date; any reflective date will do), used in date extracting pipeline to replace information not specified in ticket

Step 0: If needed, map raw labels to a cleaner set of labels

[59]:
asset_remap_filename = 'remappings_asset.csv'
REMAPPING_COL_FROM = 'in'
REMAPPING_COL_TO = 'out_'
remapping_df = pd.read_csv(folder+asset_remap_filename)
[60]:
remapping_col_dict = {
    'attribute_col': LABEL_COLUMN,
    'remapping_col_from': REMAPPING_COL_FROM,
    'remapping_col_to': REMAPPING_COL_TO
}

df_remapped_assets = utils.remap_attributes(df.iloc[30:].copy(), remapping_df.iloc[20:].copy(), remapping_col_dict, allow_missing_mappings=True)

df = df_remapped_assets
[61]:
df[LABEL_COLUMN].value_counts()
[61]:
Asset
inverter                  26
facility                  24
tracker                    6
combiner                   4
substation                 2
other                      2
transformer                1
ground-mount pv system     1
energy storage             1
energy meter               1
met station                1
pyranometer                1
Name: count, dtype: int64

Step 1: Establish example instance and render preliminary information about the tickets

[62]:
# Establish the class object (found in text_class_example.py)
print(df[LABEL_COLUMN].value_counts())

e = text_class_example.Example(df, LABEL_COLUMN)
e.summarize_text_data(DATA_COLUMN)
Asset
inverter                  26
facility                  24
tracker                    6
combiner                   4
substation                 2
other                      2
transformer                1
ground-mount pv system     1
energy storage             1
energy meter               1
met station                1
pyranometer                1
Name: count, dtype: int64
DETAILS
  70 samples
  0 invalid documents
  29.16 words per sample on average
  Number of unique words 881
  2041.00 total words

Visualize timeseries of ticket publications

[63]:
fig = e.visualize_attribute_timeseries(DATE_COLUMN)
plt.show()
../../_images/pages_tutorials_tutorial_textmodule_15_0.png

Functionality 1.1: Extract dates

[64]:
# Extract date from ticket, if any. This framework is not 100% correct.
dates_df = e.extract_dates(DATA_COLUMN, DATE_COLUMN, SAVE_DATE_COLUMN='ExtractedDates')
dates_df
[64]:
CompletionDesc ExtractedDates
0 8/39/19 inverter was faulted with lp15 (low pr... [2019-08-17 07:35:00]
1 11,july 2018 -upon arrival w-a6-2, inverter is... [2018-07-11 18:55:00, 2018-06-02 18:55:00, 201...
2 arrived site checked into c4. i was able to pi... [2020-05-26 14:45:00]
3 c4 closed site remotely. techdispatched: no []
4 inspection troubleshooting malfunctioning trac... []
... ... ...
65 cleared cleared alert however psi is -3 invert... [2016-11-03 09:28:00]
66 c4 closed remotely. techdispatched: no []
67 pure power fixed damaged source circuits did f... [2019-04-16 09:00:00, 2019-03-16 15:15:00]
68 checked network connection to rm-1 didn't see ... []
69 utility outage from 6/5 7am through 6/8 5:30pm... [2017-06-05 07:17:00, 2017-06-08 17:30:00]

70 rows × 2 columns

Functionality 1.2: Preprocess data for the Machine Learning classification

[65]:
preprocessed_df = e.prep_data_for_ML(DATA_COLUMN, DATE_COLUMN)
preprocessed_df
[65]:
CompletionDesc CleanDesc
0 either reboot datalogger worked, issue resolve... either reboot datalogger worked issue resolved...
1 . techdispatched: no techdispatched
2 inverter resolved. techdispatched: no inverter resolved techdispatched
3 10/2/19 e-1, row 51, e1-3-51-1. tracker tracki... row tracker tracking wrong lubed gear boxes tr...
4 confirmed that cb 1.1.6 was turned off. verifi... confirmed cb turned verified voltage array tur...
... ... ...
59 c4 closed remotely. techdispatched: no closed remotely techdispatched
60 switchgear breaker for 2.6 was tripped. breake... switchgear breaker tripped breaker inverter tr...
61 . techdispatched: no techdispatched
62 resolved.. techdispatched: no resolved techdispatched
63 8/39/19 inverter was faulted with lp15 (low pr... inverter faulted lp low pressure inverter show...

64 rows × 2 columns

Results of text processing

[66]:
print("Pre-text processing")
e.summarize_text_data(DATA_COLUMN)

print("\nPost-text processing")
e.summarize_text_data('CleanDesc')
Pre-text processing
DETAILS
  64 samples
  0 invalid documents
  27.95 words per sample on average
  Number of unique words 778
  1789.00 total words

Post-text processing
DETAILS
  64 samples
  0 invalid documents
  17.31 words per sample on average
  Number of unique words 489
  1108.00 total words

Visualizing entropy of clustering technique pre- and post- processing

[67]:
fig = e.visualize_cluster_entropy([DATA_COLUMN, 'CleanDesc'])
plt.show()
../../_images/pages_tutorials_tutorial_textmodule_23_0.png

Functionality 1.3: Frequency plot

[68]:
# Frequency plot on unprocessed data
fig = e.visualize_freqPlot(LBL_CAT='inverter', DATA_COLUMN=DATA_COLUMN)
plt.show()
../../_images/pages_tutorials_tutorial_textmodule_25_0.png
[69]:
# Frequency plot on processed data
fig = e.visualize_freqPlot(LBL_CAT='inverter',
                            # Optional, kwargs into nltk's FreqDist
                            graph_aargs = {
                                'linewidth':4
                            }
                        )
plt.show()
../../_images/pages_tutorials_tutorial_textmodule_26_0.png

Hint: Use the below code to visualize frequency plots for all assets

set_labels = list(set(e.df[e.LABEL_COLUMN].tolist()))
for lbl in set_labels:
    fig = e.visualize_freqPlot(LBL_CAT=lbl)
    plt.show()
[70]:
# Only supports two attributes
om_col_dict = {
    'attribute1_col': 'Asset',
    'attribute2_col': 'ImpactLevel'
}

fig, G = e.visualize_attribute_connectivity(
    om_col_dict,
    figsize=[10,5],
    graph_aargs = {'with_labels':True,
    'font_weight':'bold',
    'node_size': 1000,
    'font_size':10}
)
../../_images/pages_tutorials_tutorial_textmodule_28_0.png

Visualize Word Clusters

[76]:
fig = e.visualize_document_clusters(min_frequency=10, DATA_COLUMN='CleanDesc')
plt.show()
../../_images/pages_tutorials_tutorial_textmodule_37_0.png

Seeing the popularity of techdispatched, one might consider adding techdispatched to the stopwords list