text module
text.classify module
- pvops.text.classify.classification_deployer(X, y, n_splits, classifiers, search_space, pipeline_steps, scoring, greater_is_better=True, verbose=3)[source]
The classification deployer builds a classifier evaluator with an ingrained hyperparameter fine-tuning grid search protocol. The output of this function will be a data frame showing the performance of each classifier when utilizing a specific hyperparameter configuration.
To see an example of this method’s application, see
tutorials//text_class_example.py
- Parameters:
X (list of str) – List of documents (str). The documents will be passed through the pipeline_steps, where they will be transformed into vectors.
y (list) – List of labels corresponding with the documents in X
n_splits (int) – Integer defining the number of splits in the cross validation split during training
classifiers (dict) – Dictionary with key as classifier identifier (str) and value as classifier instance following sklearn’s base model convention: sklearn_docs.
classifiers = { 'LinearSVC' : LinearSVC(), 'AdaBoostClassifier' : AdaBoostClassifier(), 'RidgeClassifier' : RidgeClassifier() }
See
supervised_classifier_defs.py
orunsupervised_classifier_defs.py
for this package’s defaults.search_space (dict) – Dictionary with classifier identifiers, as used in
classifiers
, mapped to its hyperparameters.search_space = { 'LinearSVC' : { 'clf__C' : [1e-2,1e-1], 'clf__max_iter':[800,1000], }, 'AdaBoostClassifier' : { 'clf__n_estimators' : [50,100], 'clf__learning_rate':[1.,0.9,0.8], 'clf__algorithm' : ['SAMME.R'] }, 'RidgeClassifier' : { 'clf__alpha' : [0.,1e-3,1.], 'clf__normalize' : [False,True] } }
See
supervised_classifier_defs.py
orunsupervised_classifier_defs.py
for this package’s defaults.pipeline_steps (list of tuples) – Define embedding and machine learning pipeline. The last tuple must be
('clf', None)
so that the output of the pipeline is a prediction. For supervised classifiers using a TFIDF embedding, one could specifypipeline_steps = [('tfidf', TfidfVectorizer()), ('clf', None)]
For unsupervised clusterers using a TFIDF embedding, one could specify
pipeline_steps = [('tfidf', TfidfVectorizer()), ('to_dense', DataDensifier.DataDensifier()), ('clf', None)]
A densifier is required from some clusters, which fail if sparse data is passed.
scoring (sklearn callable scorer (i.e., any statistic that summarizes predictions relative to observations).) – Example scorers include f1_score, accuracy, etc. Callable object that returns a scalar score created using sklearn.metrics.make_scorer For supervised classifiers, one could specify
scoring = make_scorer(f1_score, average = 'weighted')
For unsupervised classifiers, one could specify
scoring = make_scorer(homogeneity_score)
greater_is_better (bool) – Whether the scoring parameter is better when greater (i.e. accuracy) or not.
verbose (int) – Control the specificity of the prints. If greater than 1, a print out is shown when a new “best classifier” is found while iterating. Additionally, the verbosity during the grid search follows sklearn’s definitions. The frequency of the messages increase with the verbosity level.
- Returns:
DataFrame – Summarization of results from all of the classifiers
- pvops.text.classify.get_attributes_from_keywords(om_df, col_dict, reference_df, reference_col_dict)[source]
Find keywords of interest in specified column of dataframe, return as new column value.
If keywords of interest given in a reference dataframe are in the specified column of the dataframe, return the keyword category, or categories. For example, if the string ‘inverter’ is in the list of text, return [‘inverter’].
- Parameters:
om_df (pd.DataFrame) – Dataframe to search for keywords of interest, must include text_col.
col_dict (dict of {str : str}) – A dictionary that contains the column names needed:
data : string, should be assigned to associated column which stores the tokenized text logs
predicted_col : string, will be used to create keyword search label column
reference_df (DataFrame) – Holds columns that define the reference dictionary to search for keywords of interest, Note: This function can currently only handle single words, no n-gram functionality.
reference_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how referencing is going to be done
reference_col_from : string, should be assigned to associated column name in reference_df that are possible input reference values Example: pd.Series([‘inverter’, ‘invert’, ‘inv’])
reference_col_to : string, should be assigned to associated column name in reference_df that are the output reference values of interest Example: pd.Series([‘inverter’, ‘inverter’, ‘inverter’])
- Returns:
om_df (pd.DataFrame) – Input df with new_col added, where each found keyword is its own row, may result in duplicate rows if more than one keywords of interest was found in text_col.
text.defaults module
- pvops.text.defaults.supervised_classifier_defs(settings_flag)[source]
Establish supervised classifier definitions which are non-specific to embeddor, and therefore, non-specific to the natural language processing application
- Parameters:
settings_flag (str) – Either ‘light’, ‘normal’ or ‘detailed’; a setting which determines the number of hyperparameter combinations tested during the grid search. For instance, a dataset of 50 thousand samples may run for hours on the ‘normal’ setting but for days on ‘detailed’.
- Returns:
search_space (dict) – Hyperparameter instances for each clusterer
classifiers (dict) – Contains sklearn classifiers instances
- pvops.text.defaults.unsupervised_classifier_defs(setting_flag, n_clusters)[source]
Establish supervised classifier definitions which are non-specific to embeddor, and therefore, non-specific to the natural language processing application
- Parameters:
setting_flag (str) – Either ‘normal’ or ‘detailed’; a setting which determines the number of hyperparameter combinations tested during the grid search. For instance, a dataset of 50,000 samples may run for hours on the ‘normal’ setting but for days on ‘detailed’.
n_clusters (int,) – Number of clusters to organize the text data into. Usually set to the number of unique categories within data.
- Returns:
search_space (dict) – Hyperparameter instances for each clusterer
clusterers (dict) – Contains sklearn cluster instances
text.nlp_utils module
- class pvops.text.nlp_utils.DataDensifier[source]
Bases:
BaseEstimator
A data structure transformer which converts sparse data to dense data. This process is usually incorporated in this library when doing unsupervised machine learning. This class is built specifically to work inside a sklearn pipeline. Therefore, it uses the default
transform
,fit
,fit_transform
method structure.- fit(X, y=None)[source]
Placeholder method to conform to the sklearn class structure.
- Parameters:
X (array) – Input data
y (Not utilized.)
- Returns:
DataDensifier object
- class pvops.text.nlp_utils.Doc2VecModel(vector_size=100, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, dv=None, dv_mapfile=None, comment=None, trim_rule=None, callbacks=(), window=5, epochs=10)[source]
Bases:
BaseEstimator
Performs a gensim Doc2Vec transformation of the input documents to create embedded representations of the documents. See gensim’s Doc2Vec model for information regarding the hyperparameters.
- fit_transform(raw_documents, y=None)[source]
Utilizes the
fit()
andtransform()
methods in this class.
- set_fit_request(*, raw_documents: bool | None | str = '$UNCHANGED$') Doc2VecModel
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
raw_documents (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
raw_documents
parameter infit
.- Returns:
self (object) – The updated object.
- set_transform_request(*, raw_documents: bool | None | str = '$UNCHANGED$') Doc2VecModel
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
raw_documents (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
raw_documents
parameter intransform
.- Returns:
self (object) – The updated object.
- pvops.text.nlp_utils.create_stopwords(lst_langs=['english'], lst_add_words=[], lst_keep_words=[])[source]
Concatenate a list of stopwords using both words grabbed from nltk and user-specified words.
- Parameters:
lst_langs (list) – List of strings designating the languages for a nltk.corpus.stopwords.words query. If empty list is passed, no stopwords will be queried from nltk.
lst_add_words (list) – List of words(e.g., “road” or “street”) to add to stopwords list. If these words are already included in the nltk query, a duplicate will not be added.
lst_keep_words (list) – List of words(e.g., “before” or “until”) to remove from stopwords list. This is usually used to modify default stop words that might be of interest to PV.
- Returns:
list – List of alphabetized stopwords
- pvops.text.nlp_utils.summarize_text_data(om_df, colname)[source]
Display information about a set of documents located in a dataframe, including the number of samples, average number of words, vocabulary size, and number of words in total.
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the colname of interest
colname (str) – Column name of column with text
- Returns:
dict – dictionary containing printed summary data
text.preprocess module
- pvops.text.preprocess.get_dates(document, om_df, ind, col_dict, print_info, infer_date_surrounding_rows=True)[source]
Extract dates from the input document.
This method is utilized within
preprocessor.py
. For an easy way to extract dates, utilize the preprocessor and set extract_dates_only = True.- Parameters:
document (str) – String representation of a document
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the columns within col_dict.
ind (integer) – Designates the row of the dataframe which is currently being observed. This is required because if the current row does not have a valid date in the eventstart, then an iterative search is conducted by first starting at the nearest rows.
col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn
data : string, should be assigned to associated column which stores the text logs
eventstart : string, should be assigned to associated column which stores the log submission datetime
print_info (bool) – Flag indicating whether to print information about the preprocessing progress
infer_date_surrounding_rows (bool) – If True, utilizes iterative search in dataframe to infer the datetime from surrounding rows if the current row’s date value is nan If False, does not utilize the base datetime. Consequentially, today’s date is used to replace the missing parts of the datetime. Recommendation: set True if you frequently publish documents and your dataframe is ordered chronologically
- Returns:
list – List of dates found in text
- pvops.text.preprocess.get_keywords_of_interest(document_tok, reference_df, reference_col_dict)[source]
Find keywords of interest in list of strings from reference dict.
If keywords of interest given in a reference dict are in the list of strings, return the keyword category, or categories. For example, if the string ‘inverter’ is in the list of text, return [‘inverter’].
- Parameters:
document_tok (list of str) – Tokenized text, functionally a list of string values.
reference_df (DataFrame) – Holds columns that define the reference dictionary to search for keywords of interest, Note: This function can currently only handle single words, no n-gram functionality.
reference_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how referencing is going to be done
reference_col_from : string, should be assigned to associated column name in reference_df that are possible input reference values Example: pd.Series([‘inverter’, ‘invert’, ‘inv’])
reference_col_to : string, should be assigned to associated column name in reference_df that are the output reference values of interest Example: pd.Series([‘inverter’, ‘inverter’, ‘inverter’])
- Returns:
included_equipment (list of str) – List of keywords from reference_dict found in list_of_txt, can be more than one value.
- pvops.text.preprocess.preprocessor(om_df, lst_stopwords, col_dict, print_info=False, extract_dates_only=False)[source]
Preprocessing function which processes the raw text data into processed text data and extracts dates
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the columns within col_dict.
lst_stopwords (list) – List of stop words which will be filtered in final preprocessing step
col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn
data : string, should be assigned to associated column which stores the text logs
eventstart : string, should be assigned to associated column which stores the log submission datetime
save_data_column : string, should be assigned to associated column where the processed text should be stored
save_date_column : string, should be assigned to associated column where the extracted dates from the text should be stored
print_info (bool) – Flag indicating whether to print information about the preprocessing progress
extract_dates_only (bool) – If True, return after extracting dates in each ticket If False, return with preprocessed text and extracted dates
- Returns:
df (DataFrame) – Contains the original columns as well as the processed data, located in columns defined by the inputs
- pvops.text.preprocess.text_remove_nondate_nums(document, PRINT_INFO=False)[source]
Conduct initial text processing steps to prepare the text for date extractions. Function mostly uses regex-based text substitution to remove numerical structures within the text, which may be mistaken as a date by the date extractor.
- Parameters:
document (str) – String representation of a document
PRINT_INFO (bool) – Flag indicating whether to print information about the preprocessing progress
- Returns:
string – string of processed document
- pvops.text.preprocess.text_remove_numbers_stopwords(document, lst_stopwords)[source]
Conduct final processing steps after date extraction
- Parameters:
document (str) – String representation of a document
lst_stopwords (list) – List of stop words which will be filtered in final preprocessing step
- Returns:
string – string of processed document
text.utils module
- pvops.text.utils.remap_attributes(om_df, remapping_df, remapping_col_dict, allow_missing_mappings=False, print_info=False)[source]
- A utility function which remaps the attributes of om_df using columns
within remapping_df.
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M data, which needs to be remapped.
remapping_df (DataFrame) – Holds columns that define the remappings
remapping_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how remapping is going to be done
attribute_col : string, should be assigned to associated column name in om_df which will be remapped
remapping_col_from : string, should be assigned to associated column name in remapping_df that matches original attribute of interest in om_df
remapping_col_to : string, should be assigned to associated column name in remapping_df that contains the final mapped entries
allow_missing_mappings (bool) – If True, allow attributes without specified mappings to exist in the final dataframe. If False, only attributes specified in remapping_df will be in final dataframe.
print_info (bool) – If True, print information about remapping.
- Returns:
DataFrame – dataframe with remapped columns populated
- pvops.text.utils.remap_words_in_text(om_df, remapping_df, remapping_col_dict)[source]
- A utility function which remaps a text column of om_df using columns
within remapping_df.
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M note data
remapping_df (DataFrame) – Holds columns that define the remappings
remapping_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how remapping is going to be done
data : string, should be assigned to associated column name in om_df which will have its text tokenized and remapped
remapping_col_from : string, should be assigned to associated column name in remapping_df that matches original attribute of interest in om_df
remapping_col_to : string, should be assigned to associated column name in remapping_df that contains the final mapped entries
- Returns:
DataFrame – dataframe with remapped columns populated
text.visualize module
- pvops.text.visualize.visualize_attribute_connectivity(om_df, om_col_dict, figsize=(20, 10), attribute_colors=['lightgreen', 'cornflowerblue'], edge_width_scalar=10, graph_aargs={})[source]
Visualize a knowledge graph which shows the frequency of combinations between attributes
ATTRIBUTE1_COL
andATTRIBUTE2_COL
NOW USES BIPARTITE LAYOUT ATTRIBUTE2_COL is colored using a colormap.
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict
om_col_dict (dict of {str : str}) – A dictionary that contains the column names to be used in visualization:
{ 'attribute1_col' : string, 'attribute2_col' : string }
figsize (tuple) – Figure size, defaults to (20,10)
attribute_colors (list[str]) – List of two strings which designate the colors for Attribute1 and Attribute 2, respectively.
edge_width_scalar (numeric) – Weight utilized to scale widths based on number of connections between Attribute 1 and Attribute 2. Larger values will produce larger widths, and smaller values will produce smaller widths.
graph_aargs (dict) – Optional, arguments passed to networkx graph drawer. Suggested attributes to pass:
with_labels=True
font_weight=’bold’
node_size=19000
font_size=35
- Returns:
Matplotlib axis,
networkx graph
- pvops.text.visualize.visualize_attribute_timeseries(om_df, om_col_dict, date_structure='%Y-%m', figsize=(12, 6), cmap_name='brg')[source]
Visualize stacked bar chart of attribute frequency over time, where x-axis is time and y-axis is count, displaying separate bars for each label within the label column
- Parameters:
om_df (DataFrame) – A pandas dataframe of O&M data, which contains columns in om_col_dict
om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn
label (string), should be assigned to associated column name for the label/attribute of interest in om_df
date (string), should be assigned to associated column name for the dates relating to the documents in om_df
date_structure (str) – Controls the resolution of the bar chart’s timeseries Default : “%Y-%m”. Can change to include finer resolutions (e.g., by including day, “%Y-%m-%d”) or coarser resolutions (e.g., by year, “%Y”)
figsize (tuple) – Optional, figure size
cmap_name (str) – Optional, color map name in matplotlib
- Returns:
Matplotlib figure instance
- pvops.text.visualize.visualize_classification_confusion_matrix(om_df, col_dict, title='')[source]
Visualize confusion matrix comparing known categorical values, and predicted categorical values.
- Parameters:
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict
col_dict (dict of {str : str}) – A dictionary that contains the column names needed:
data : string, should be assigned to associated column which stores the tokenized text logs
attribute_col : string, will be assigned to attribute column and used to create new attribute_col
predicted_col : string, will be used to create keyword search label column
title (str) – Optional, title of plot
- Returns:
Matplotlib figure instance
- pvops.text.visualize.visualize_cluster_entropy(doc2vec, eval_kmeans, om_df, data_cols, ks, cmap_name='brg')[source]
Visualize entropy of embedding space parition. Currently only supports doc2vec embedding.
- Parameters:
doc2vec (Doc2Vec model instance) – Instance of gensim.models.doc2vec.Doc2Vec
eval_kmeans (callable) – Callable cluster fit function For instance,
def eval_kmeans(X,k): km = KMeans(n_clusters=k) km.fit(X) return km
om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict
data_cols (list) – List of column names (str) which have text data.
ks (list) – List of k parameters required for the clustering mechanic eval_kmeans
cmap_name – Optional, color map
- Returns:
Matplotlib figure instance
- pvops.text.visualize.visualize_document_clusters(cluster_tokens, min_frequency=20)[source]
Visualize words most frequently occurring in a cluster. Especially useful when visualizing the results of an unsupervised partitioning of documents.
- Parameters:
cluster_tokens (list) – List of tokenized documents
min_frequency (int) – Minimum number of occurrences that a word must have in a cluster for it to be visualized
- Returns:
Matplotlib figure instance
- pvops.text.visualize.visualize_word_frequency_plot(tokenized_words, title='', font_size=16, graph_aargs={})[source]
Visualize the frequency distribution of words within a set of documents
- Parameters:
tokenized_words (list) – List of tokenized words
title (str) – Optional, title of plot
font_size (int) – Optional, font size
aargs – Optional, other parameters passed to nltk.FreqDist.plot()
- Returns:
Matplotlib figure instance