Text Module Tutorial 1: Preprocessing

This tutorial demonstrates how the text module can be used to perform various preprocessing steps on text data. Plots and other functions are used to compare the processed dataset compared to the raw dataset.

[3]:

# basic manipulation and plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# machine learning imports
from sklearn.cluster import KMeans
from gensim.models.doc2vec import Doc2Vec

# pvops functionality
from pvops.text import utils as text_utils
from pvops.text import nlp_utils as text_nlp_utils
from pvops.text import visualize as text_visualize
from pvops.text import preprocess as text_preprocess

Initial data cleaning and exploration

Here, we read in our example dataset.

[4]:

df = pd.read_csv('example_data/example_ML_ticket_data.csv')
df.head(5)

[4]:

	Date_EventStart	Date_EventEnd	Asset	CompletionDesc	Cause	ImpactLevel	randid
0	8/16/2018 9:00	8/22/2018 17:00	Combiner	cb 1.18 was found to have contactor issue woul...	0000 - Unknown.	Underperformance	38
1	9/17/2018 18:25	9/18/2018 9:50	Pad	self resolved. techdispatched: no	004 - Under voltage.	Underperformance	46
2	8/26/2019 9:00	11/5/2019 17:00	Facility	all module rows washed, waiting for final repo...	0000 - Unknown	Underperformance	62
3	11/14/2017 7:46	11/14/2017 8:35	Inverter	14 nov: we were alerted that e-c3-1 had faulte...	019 - Unplanned outage/derate.	Underperformance	54
4	4/27/2019 9:00	5/2/2019 17:00	Facility	assessed condition filters all inverters. litt...	.	NaN	45

Before performing text analysis, it may be necessary to clean up categorical labels to make the data easier to work with.

example_data/remappings_asset.csv has a helpful mapping for us. It consists of two columns: in, which is the input of our mapping, and out_, which is the corresponding output. pvops.text.utils offers a function remap_attributes to help us perform this remapping.

[5]:

remapping_df = pd.read_csv('example_data/remappings_asset.csv')

remapping_col_dict = {
    'attribute_col': 'Asset',
    'remapping_col_from': 'in',
    'remapping_col_to': 'out_'
}

df = text_utils.remap_attributes(df, remapping_df, remapping_col_dict, allow_missing_mappings=True)
df['Asset'].value_counts()

[5]:

Asset
facility          37
inverter          34
tracker            8
combiner           7
met station        3
substation         3
transformer        2
module             2
other              2
energy storage     1
energy meter       1
Name: count, dtype: int64

Let’s obtain a basic summary of the asset labels using pvops.nlp_utils.summarize_text_data. We can also plot the distibution of asset labels over time.

[6]:

text_nlp_utils.summarize_text_data(df, 'Asset')

df_to_plot = df.copy()
df_to_plot['Date_EventStart'] = pd.to_datetime(df['Date_EventStart'])
df_to_plot = df_to_plot.loc[df_to_plot['Date_EventStart'].notnull()]

_, ax = plt.subplots(figsize=(10,5))
text_visualize.visualize_attribute_timeseries(df_to_plot, dict(date='Date_EventStart', label='Asset'), ax=ax);

DETAILS
  100 samples
  0 invalid documents
  1.05 words per sample on average
  Number of unique words 13
  105.00 total words

../../_images/pages_tutorials_tutorial_text_1_preprocessing_9_1.png

text.preprocess submodule

The pvops.text.preprocess submodule performs some initial preprocessing steps to prepare textual data for machine learning. It offers its primary functionality through the preprocessor() function, which takes a DataFrame as input. This dataframe must have, at a minimum, a column that stores text logs and a column that stores the associated datetimes. The function then performs three primary steps:

Removes from the text column any numbers which might be confused for dates using pvops.text.preprocess.text_remove_nondate_nums().
Extracts dates using pvops.text.preprocess.get_dates(), considering both the text column and the datetime column, and only extracting dates after performing some basic quality control checks.
Cleans the text data, removing numbers as well as any stopwords passed into the function’s lst_stopwords argument using pvops.text.preprocess.text_remove_numbers_stopwords().

In the example below, we leverage the preprocessor function to clean up our text data:

[7]:

# drop null rows
preprocessed_df = df.copy().dropna(subset=['Asset', 'CompletionDesc'])

# let's only keep asset labels which have more than one instance in the dataset
label_counts = preprocessed_df.value_counts('Asset')
labels_with_multiple_occurrences = label_counts.loc[label_counts > 1].index
preprocessed_df = preprocessed_df.loc[preprocessed_df['Asset'].isin(labels_with_multiple_occurrences)]

# bulid a custom set of stopwords
stopwords = text_nlp_utils.create_stopwords(lst_add_words=['dtype', 'say', 'length', 'object', 'u', 'ha', 'wa'])

# run our preprocessing function to clean up the text data and prepare it for ML
col_dict = dict(data='CompletionDesc',
                eventstart='Date_EventStart',
                save_data_column='CleanDesc',
                save_date_column='ExtractedDates')

preprocessed_df = text_preprocess.preprocessor(preprocessed_df, stopwords, col_dict)
preprocessed_df.head()[['CompletionDesc', 'CleanDesc']]

[7]:

	CompletionDesc	CleanDesc
0	cb 1.18 was found to have contactor issue woul...	cb found contactor issue would close contactor...
1	self resolved. techdispatched: no	self resolved techdispatched
2	all module rows washed, waiting for final repo...	module rows washed waiting final report sun power
3	14 nov: we were alerted that e-c3-1 had faulte...	nov alerted e c faulted upon investigation not...
4	assessed condition filters all inverters. litt...	assessed condition filters inverters little cl...

Sometimes, a row may be missing the date in the appropriate timestamp column, but dates may still be provided in textual columns. In a case like this, we can also use the preprocessor function. Note this function isn’t 100% perfect, but should provide a good first pass attempt. Using the optional argument extract_dates_only=True, we can tell the function to only perform the date extraction (and not waste time removing stopwords).

[8]:

dates_df = text_preprocess.preprocessor(df, None, col_dict, extract_dates_only=True)
dates_df.head()[['CompletionDesc', 'ExtractedDates']]

[8]:

	CompletionDesc	ExtractedDates
0	cb 1.18 was found to have contactor issue woul...	[]
1	self resolved. techdispatched: no	[]
2	all module rows washed, waiting for final repo...	[2019-09-01 09:00:00]
3	14 nov: we were alerted that e-c3-1 had faulte...	[2017-11-14 07:46:00, 2017-03-01 07:46:00]
4	assessed condition filters all inverters. litt...	[]

Let’s go back to our ML-prepped text, which we stored in the 'CleanDesc' column of preprocessed_df. pvOps provides a few different ways of exploring how these preprocessing steps have modified the dataset. To start, we can see a set of simple statistics pre- and post-preprocessing:

[9]:

print('Pre-text processing:')
text_nlp_utils.summarize_text_data(df, 'CompletionDesc')

print('\nPost-text processing')
text_nlp_utils.summarize_text_data(preprocessed_df, 'CleanDesc');

Pre-text processing:
DETAILS
  100 samples
  0 invalid documents
  33.52 words per sample on average
  Number of unique words 1213
  3352.00 total words

Post-text processing
DETAILS
  98 samples
  0 invalid documents
  20.72 words per sample on average
  Number of unique words 726
  2031.00 total words

We can also consider the embeddings of our documents and quantify how distinct those embeddings are from each other. The function pvops.text.visualize.visualize_cluster_entropy will compute the Doc2Vec embedding of each row, cluster the embeddings using KMeans, and plot the sum-of-square distance from the embeddings to their respective cluster centroids (known as “inertia” in the context of KMeans).

The resulting plot below demonstrates that the embeddings of the cleaned descriptions are much more “distinct”, in the sense that they are further apart in embedding space and hence easier to differentiate.

[10]:

doc2vec_model = Doc2Vec(vector_size=40, min_count=2)
k_vals = np.arange(5, 56, 10)
eval_kmeans = lambda X, k: KMeans(n_clusters=k, n_init=5).fit(X)

_, ax = plt.subplots(figsize=(6,4))
text_visualize.visualize_cluster_entropy(doc2vec_model, eval_kmeans, preprocessed_df, ['CompletionDesc', 'CleanDesc'], k_vals, cmap_name='brg', ax=ax);

../../_images/pages_tutorials_tutorial_text_1_preprocessing_19_0.png

We can also, for instance, extract the documents labeled with an “inverter” asset and plot the frequency of the most commonly occuring words in the raw description, and then the cleaned one. We see that the “clean” version has a much more intuitive selection of words:

[11]:

inverter_documents_raw = preprocessed_df.loc[preprocessed_df['Asset'] == 'inverter', 'CompletionDesc']
tokenized_raw = text_preprocess.regex_tokenize(' '.join(inverter_documents_raw))

inverter_documents_clean = preprocessed_df.loc[preprocessed_df['Asset'] == 'inverter', 'CleanDesc']
tokenized_clean = text_preprocess.regex_tokenize(' '.join(inverter_documents_clean))

_, axes = plt.subplots(ncols=2, figsize=(12,4))
text_visualize.visualize_word_frequency_plot(tokenized_raw, 'Raw description', ax=axes[0]);
text_visualize.visualize_word_frequency_plot(tokenized_clean, 'Clean description', ax=axes[1]);

../../_images/pages_tutorials_tutorial_text_1_preprocessing_21_0.png

Finally, using the pvops.text.visualize.visualize_attribute_connectivity function, we can plot the correspondence between two columns in our dataset. Below, we can qualitatively see how often each asset type (the top row) is associated with each impact level (the bottom row).

[12]:

attributes_dict = dict(attribute1_col='Asset', attribute2_col='ImpactLevel')
filtered_df = preprocessed_df.dropna(subset=attributes_dict.values())

_, ax = plt.subplots(figsize=(14,4))
text_visualize.visualize_attribute_connectivity(filtered_df, attributes_dict, ax=ax, graph_aargs=dict(node_size=1000));

../../_images/pages_tutorials_tutorial_text_1_preprocessing_23_0.png