Text Module Tutorial

import pandas as pd
import matplotlib.pyplot as plt

from pvops.text import utils
import text_class_example

Problem statements:

1. Text Preprocessing

Process the documents into concise, machine learning-ready documents. Additionally, extract dates from the text.

2. Text Classification

The written tickets are used to make an inference on the specified event descriptor.

Text processing

Import text data

folder = 'example_data//'
filename = 'example_ML_ticket_data.csv'
df = pd.read_csv(folder+filename)
Date_EventStart Date_EventEnd Asset CompletionDesc Cause ImpactLevel randid
0 8/16/2018 9:00 8/22/2018 17:00 Combiner cb 1.18 was found to have contactor issue woul... 0000 - Unknown. Underperformance 38
1 9/17/2018 18:25 9/18/2018 9:50 Pad self resolved. techdispatched: no 004 - Under voltage. Underperformance 46
2 8/26/2019 9:00 11/5/2019 17:00 Facility all module rows washed, waiting for final repo... 0000 - Unknown Underperformance 62

Establish settings

Specify column names which will be used in this pipeline.

DATA_COLUMN = "CompletionDesc"   # Contains document
LABEL_COLUMN = "Asset"           # Establish event descriptor which will be inferenced by classifiers
DATE_COLUMN = 'Date_EventStart'  # Date of ticket (start date, end date; any reflective date will do), used in date extracting pipeline to replace information not specified in ticket

Step 0: If needed, map raw labels to a cleaner set of labels

asset_remap_filename = 'remappings_asset.csv'
remapping_df = pd.read_csv(folder+asset_remap_filename)
remapping_col_dict = {
    'attribute_col': LABEL_COLUMN,
    'remapping_col_from': REMAPPING_COL_FROM,
    'remapping_col_to': REMAPPING_COL_TO

df_remapped_assets = utils.remap_attributes(df.iloc[30:].copy(), remapping_df.iloc[20:].copy(), remapping_col_dict, allow_missing_mappings=True)

df = df_remapped_assets
inverter                  26
facility                  24
tracker                    6
combiner                   4
substation                 2
other                      2
transformer                1
ground-mount pv system     1
energy storage             1
energy meter               1
met station                1
pyranometer                1
Name: count, dtype: int64

Step 1: Establish example instance and render preliminary information about the tickets

# Establish the class object (found in text_class_example.py)

e = text_class_example.Example(df, LABEL_COLUMN)
inverter                  26
facility                  24
tracker                    6
combiner                   4
substation                 2
other                      2
transformer                1
ground-mount pv system     1
energy storage             1
energy meter               1
met station                1
pyranometer                1
Name: count, dtype: int64
  70 samples
  0 invalid documents
  29.16 words per sample on average
  Number of unique words 881
  2041.00 total words

Visualize timeseries of ticket publications

fig = e.visualize_attribute_timeseries(DATE_COLUMN)

Functionality 1.1: Extract dates

# Extract date from ticket, if any. This framework is not 100% correct.
dates_df = e.extract_dates(DATA_COLUMN, DATE_COLUMN, SAVE_DATE_COLUMN='ExtractedDates')
CompletionDesc ExtractedDates
0 8/39/19 inverter was faulted with lp15 (low pr... [2019-08-17 07:35:00]
1 11,july 2018 -upon arrival w-a6-2, inverter is... [2018-07-11 18:55:00, 2018-06-02 18:55:00, 201...
2 arrived site checked into c4. i was able to pi... [2020-05-26 14:45:00]
3 c4 closed site remotely. techdispatched: no []
4 inspection troubleshooting malfunctioning trac... []
... ... ...
65 cleared cleared alert however psi is -3 invert... [2016-11-03 09:28:00]
66 c4 closed remotely. techdispatched: no []
67 pure power fixed damaged source circuits did f... [2019-04-16 09:00:00, 2019-03-16 15:15:00]
68 checked network connection to rm-1 didn't see ... []
69 utility outage from 6/5 7am through 6/8 5:30pm... [2017-06-05 07:17:00, 2017-06-08 17:30:00]

70 rows × 2 columns

Functionality 1.2: Preprocess data for the Machine Learning classification

preprocessed_df = e.prep_data_for_ML(DATA_COLUMN, DATE_COLUMN)
CompletionDesc CleanDesc
0 either reboot datalogger worked, issue resolve... either reboot datalogger worked issue resolved...
1 . techdispatched: no techdispatched
2 inverter resolved. techdispatched: no inverter resolved techdispatched
3 10/2/19 e-1, row 51, e1-3-51-1. tracker tracki... row tracker tracking wrong lubed gear boxes tr...
4 confirmed that cb 1.1.6 was turned off. verifi... confirmed cb turned verified voltage array tur...
... ... ...
59 c4 closed remotely. techdispatched: no closed remotely techdispatched
60 switchgear breaker for 2.6 was tripped. breake... switchgear breaker tripped breaker inverter tr...
61 . techdispatched: no techdispatched
62 resolved.. techdispatched: no resolved techdispatched
63 8/39/19 inverter was faulted with lp15 (low pr... inverter faulted lp low pressure inverter show...

64 rows × 2 columns

Results of text processing

print("Pre-text processing")

print("\nPost-text processing")
Pre-text processing
  64 samples
  0 invalid documents
  27.95 words per sample on average
  Number of unique words 778
  1789.00 total words

Post-text processing
  64 samples
  0 invalid documents
  17.31 words per sample on average
  Number of unique words 489
  1108.00 total words

Visualizing entropy of clustering technique pre- and post- processing

fig = e.visualize_cluster_entropy([DATA_COLUMN, 'CleanDesc'])

Functionality 1.3: Frequency plot

# Frequency plot on unprocessed data
fig = e.visualize_freqPlot(LBL_CAT='inverter', DATA_COLUMN=DATA_COLUMN)
# Frequency plot on processed data
fig = e.visualize_freqPlot(LBL_CAT='inverter',
                            # Optional, kwargs into nltk's FreqDist
                            graph_aargs = {

Hint: Use the below code to visualize frequency plots for all assets

set_labels = list(set(e.df[e.LABEL_COLUMN].tolist()))
for lbl in set_labels:
    fig = e.visualize_freqPlot(LBL_CAT=lbl)
# Only supports two attributes
om_col_dict = {
    'attribute1_col': 'Asset',
    'attribute2_col': 'ImpactLevel'

fig, G = e.visualize_attribute_connectivity(
    graph_aargs = {'with_labels':True,
    'node_size': 1000,

Visualize Word Clusters

fig = e.visualize_document_clusters(min_frequency=10, DATA_COLUMN='CleanDesc')

Seeing the popularity of techdispatched, one might consider adding techdispatched to the stopwords list