text module

text.classify module

pvops.text.classify.classification_deployer(X, y, n_splits, classifiers, search_space, pipeline_steps, scoring, greater_is_better=True, verbose=3)[source]

The classification deployer builds a classifier evaluator with an ingrained hyperparameter fine-tuning grid search protocol. The output of this function will be a data frame showing the performance of each classifier when utilizing a specific hyperparameter configuration.

To see an example of this method’s application, see tutorials//text_class_example.py

Parameters:
  • X (list of str) – List of documents (str). The documents will be passed through the pipeline_steps, where they will be transformed into vectors.

  • y (list) – List of labels corresponding with the documents in X

  • n_splits (int) – Integer defining the number of splits in the cross validation split during training

  • classifiers (dict) – Dictionary with key as classifier identifier (str) and value as classifier instance following sklearn’s base model convention: sklearn_docs.

    classifiers = {
        'LinearSVC' : LinearSVC(),
        'AdaBoostClassifier' : AdaBoostClassifier(),
        'RidgeClassifier' : RidgeClassifier()
    }
    

    See supervised_classifier_defs.py or unsupervised_classifier_defs.py for this package’s defaults.

  • search_space (dict) – Dictionary with classifier identifiers, as used in classifiers, mapped to its hyperparameters.

    search_space = {
        'LinearSVC' : {
        'clf__C' : [1e-2,1e-1],
        'clf__max_iter':[800,1000],
        },
        'AdaBoostClassifier' : {
        'clf__n_estimators' : [50,100],
        'clf__learning_rate':[1.,0.9,0.8],
        'clf__algorithm' : ['SAMME.R']
        },
        'RidgeClassifier' : {
        'clf__alpha' : [0.,1e-3,1.],
        'clf__normalize' : [False,True]
        }
    }
    

    See supervised_classifier_defs.py or unsupervised_classifier_defs.py for this package’s defaults.

  • pipeline_steps (list of tuples) – Define embedding and machine learning pipeline. The last tuple must be ('clf', None) so that the output of the pipeline is a prediction. For supervised classifiers using a TFIDF embedding, one could specify

    pipeline_steps = [('tfidf', TfidfVectorizer()),
                      ('clf', None)]
    

    For unsupervised clusterers using a TFIDF embedding, one could specify

    pipeline_steps = [('tfidf', TfidfVectorizer()),
                      ('to_dense', DataDensifier.DataDensifier()),
                      ('clf', None)]
    

    A densifier is required from some clusters, which fail if sparse data is passed.

  • scoring (sklearn callable scorer (i.e., any statistic that summarizes predictions relative to observations).) – Example scorers include f1_score, accuracy, etc. Callable object that returns a scalar score created using sklearn.metrics.make_scorer For supervised classifiers, one could specify

    scoring = make_scorer(f1_score, average = 'weighted')
    

    For unsupervised classifiers, one could specify

    scoring = make_scorer(homogeneity_score)
    
  • greater_is_better (bool) – Whether the scoring parameter is better when greater (i.e. accuracy) or not.

  • verbose (int) – Control the specificity of the prints. If greater than 1, a print out is shown when a new “best classifier” is found while iterating. Additionally, the verbosity during the grid search follows sklearn’s definitions. The frequency of the messages increase with the verbosity level.

Returns:

DataFrame – Summarization of results from all of the classifiers

pvops.text.classify.get_attributes_from_keywords(om_df, col_dict, reference_df, reference_col_dict)[source]

Find keywords of interest in specified column of dataframe, return as new column value.

If keywords of interest given in a reference dataframe are in the specified column of the dataframe, return the keyword category, or categories. For example, if the string ‘inverter’ is in the list of text, return [‘inverter’].

Parameters:
  • om_df (pd.DataFrame) – Dataframe to search for keywords of interest, must include text_col.

  • col_dict (dict of {str : str}) – A dictionary that contains the column names needed:

    • data : string, should be assigned to associated column which stores the tokenized text logs

    • predicted_col : string, will be used to create keyword search label column

  • reference_df (DataFrame) – Holds columns that define the reference dictionary to search for keywords of interest, Note: This function can currently only handle single words, no n-gram functionality.

  • reference_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how referencing is going to be done

    • reference_col_from : string, should be assigned to associated column name in reference_df that are possible input reference values Example: pd.Series([‘inverter’, ‘invert’, ‘inv’])

    • reference_col_to : string, should be assigned to associated column name in reference_df that are the output reference values of interest Example: pd.Series([‘inverter’, ‘inverter’, ‘inverter’])

Returns:

om_df (pd.DataFrame) – Input df with new_col added, where each found keyword is its own row, may result in duplicate rows if more than one keywords of interest was found in text_col.

text.defaults module

pvops.text.defaults.supervised_classifier_defs(settings_flag)[source]

Establish supervised classifier definitions which are non-specific to embeddor, and therefore, non-specific to the natural language processing application

Parameters:

settings_flag (str) – Either ‘light’, ‘normal’ or ‘detailed’; a setting which determines the number of hyperparameter combinations tested during the grid search. For instance, a dataset of 50 thousand samples may run for hours on the ‘normal’ setting but for days on ‘detailed’.

Returns:

  • search_space (dict) – Hyperparameter instances for each clusterer

  • classifiers (dict) – Contains sklearn classifiers instances

pvops.text.defaults.unsupervised_classifier_defs(setting_flag, n_clusters)[source]

Establish supervised classifier definitions which are non-specific to embeddor, and therefore, non-specific to the natural language processing application

Parameters:
  • setting_flag (str) – Either ‘normal’ or ‘detailed’; a setting which determines the number of hyperparameter combinations tested during the grid search. For instance, a dataset of 50,000 samples may run for hours on the ‘normal’ setting but for days on ‘detailed’.

  • n_clusters (int,) – Number of clusters to organize the text data into. Usually set to the number of unique categories within data.

Returns:

  • search_space (dict) – Hyperparameter instances for each clusterer

  • clusterers (dict) – Contains sklearn cluster instances

text.nlp_utils module

class pvops.text.nlp_utils.DataDensifier[source]

Bases: BaseEstimator

A data structure transformer which converts sparse data to dense data. This process is usually incorporated in this library when doing unsupervised machine learning. This class is built specifically to work inside a sklearn pipeline. Therefore, it uses the default transform, fit, fit_transform method structure.

fit(X, y=None)[source]

Placeholder method to conform to the sklearn class structure.

Parameters:
  • X (array) – Input data

  • y (Not utilized.)

Returns:

DataDensifier object

fit_transform(X, y=None)[source]

Performs same action as DataDensifier.transform(), which returns a dense array when the input is sparse.

Parameters:
  • X (array) – Input data

  • y (Not utilized.)

Returns:

dense array

transform(X, y=None)[source]

Return a dense array if the input array is sparse.

Parameters:

X (array) – Input data of numerical values. For this package, these values could represent embedded representations of documents.

Returns:

dense array

class pvops.text.nlp_utils.Doc2VecModel(vector_size=100, dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, dv=None, dv_mapfile=None, comment=None, trim_rule=None, callbacks=(), window=5, epochs=10)[source]

Bases: BaseEstimator

Performs a gensim Doc2Vec transformation of the input documents to create embedded representations of the documents. See gensim’s Doc2Vec model for information regarding the hyperparameters.

fit(raw_documents, y=None)[source]

Fits the Doc2Vec model.

fit_transform(raw_documents, y=None)[source]

Utilizes the fit() and transform() methods in this class.

set_fit_request(*, raw_documents: bool | None | str = '$UNCHANGED$') Doc2VecModel

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

raw_documents (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for raw_documents parameter in fit.

Returns:

self (object) – The updated object.

set_transform_request(*, raw_documents: bool | None | str = '$UNCHANGED$') Doc2VecModel

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

raw_documents (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for raw_documents parameter in transform.

Returns:

self (object) – The updated object.

transform(raw_documents)[source]

Transforms the documents into Doc2Vec vectors.

pvops.text.nlp_utils.create_stopwords(lst_langs=['english'], lst_add_words=[], lst_keep_words=[])[source]

Concatenate a list of stopwords using both words grabbed from nltk and user-specified words.

Parameters:
  • lst_langs (list) – List of strings designating the languages for a nltk.corpus.stopwords.words query. If empty list is passed, no stopwords will be queried from nltk.

  • lst_add_words (list) – List of words(e.g., “road” or “street”) to add to stopwords list. If these words are already included in the nltk query, a duplicate will not be added.

  • lst_keep_words (list) – List of words(e.g., “before” or “until”) to remove from stopwords list. This is usually used to modify default stop words that might be of interest to PV.

Returns:

list – List of alphabetized stopwords

pvops.text.nlp_utils.summarize_text_data(om_df, colname)[source]

Display information about a set of documents located in a dataframe, including the number of samples, average number of words, vocabulary size, and number of words in total.

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the colname of interest

  • colname (str) – Column name of column with text

Returns:

dict – dictionary containing printed summary data

text.preprocess module

pvops.text.preprocess.get_dates(document, om_df, ind, col_dict, print_info, infer_date_surrounding_rows=True)[source]

Extract dates from the input document.

This method is utilized within preprocessor.py. For an easy way to extract dates, utilize the preprocessor and set extract_dates_only = True.

Parameters:
  • document (str) – String representation of a document

  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the columns within col_dict.

  • ind (integer) – Designates the row of the dataframe which is currently being observed. This is required because if the current row does not have a valid date in the eventstart, then an iterative search is conducted by first starting at the nearest rows.

  • col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn

    • data : string, should be assigned to associated column which stores the text logs

    • eventstart : string, should be assigned to associated column which stores the log submission datetime

  • print_info (bool) – Flag indicating whether to print information about the preprocessing progress

  • infer_date_surrounding_rows (bool) – If True, utilizes iterative search in dataframe to infer the datetime from surrounding rows if the current row’s date value is nan If False, does not utilize the base datetime. Consequentially, today’s date is used to replace the missing parts of the datetime. Recommendation: set True if you frequently publish documents and your dataframe is ordered chronologically

Returns:

list – List of dates found in text

pvops.text.preprocess.get_keywords_of_interest(document_tok, reference_df, reference_col_dict)[source]

Find keywords of interest in list of strings from reference dict.

If keywords of interest given in a reference dict are in the list of strings, return the keyword category, or categories. For example, if the string ‘inverter’ is in the list of text, return [‘inverter’].

Parameters:
  • document_tok (list of str) – Tokenized text, functionally a list of string values.

  • reference_df (DataFrame) – Holds columns that define the reference dictionary to search for keywords of interest, Note: This function can currently only handle single words, no n-gram functionality.

  • reference_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how referencing is going to be done

    • reference_col_from : string, should be assigned to associated column name in reference_df that are possible input reference values Example: pd.Series([‘inverter’, ‘invert’, ‘inv’])

    • reference_col_to : string, should be assigned to associated column name in reference_df that are the output reference values of interest Example: pd.Series([‘inverter’, ‘inverter’, ‘inverter’])

Returns:

included_equipment (list of str) – List of keywords from reference_dict found in list_of_txt, can be more than one value.

pvops.text.preprocess.preprocessor(om_df, lst_stopwords, col_dict, print_info=False, extract_dates_only=False)[source]

Preprocessing function which processes the raw text data into processed text data and extracts dates

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains at least the columns within col_dict.

  • lst_stopwords (list) – List of stop words which will be filtered in final preprocessing step

  • col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn

    • data : string, should be assigned to associated column which stores the text logs

    • eventstart : string, should be assigned to associated column which stores the log submission datetime

    • save_data_column : string, should be assigned to associated column where the processed text should be stored

    • save_date_column : string, should be assigned to associated column where the extracted dates from the text should be stored

  • print_info (bool) – Flag indicating whether to print information about the preprocessing progress

  • extract_dates_only (bool) – If True, return after extracting dates in each ticket If False, return with preprocessed text and extracted dates

Returns:

df (DataFrame) – Contains the original columns as well as the processed data, located in columns defined by the inputs

pvops.text.preprocess.text_remove_nondate_nums(document, PRINT_INFO=False)[source]

Conduct initial text processing steps to prepare the text for date extractions. Function mostly uses regex-based text substitution to remove numerical structures within the text, which may be mistaken as a date by the date extractor.

Parameters:
  • document (str) – String representation of a document

  • PRINT_INFO (bool) – Flag indicating whether to print information about the preprocessing progress

Returns:

string – string of processed document

pvops.text.preprocess.text_remove_numbers_stopwords(document, lst_stopwords)[source]

Conduct final processing steps after date extraction

Parameters:
  • document (str) – String representation of a document

  • lst_stopwords (list) – List of stop words which will be filtered in final preprocessing step

Returns:

string – string of processed document

text.utils module

pvops.text.utils.remap_attributes(om_df, remapping_df, remapping_col_dict, allow_missing_mappings=False, print_info=False)[source]
A utility function which remaps the attributes of om_df using columns

within remapping_df.

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which needs to be remapped.

  • remapping_df (DataFrame) – Holds columns that define the remappings

  • remapping_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how remapping is going to be done

    • attribute_col : string, should be assigned to associated column name in om_df which will be remapped

    • remapping_col_from : string, should be assigned to associated column name in remapping_df that matches original attribute of interest in om_df

    • remapping_col_to : string, should be assigned to associated column name in remapping_df that contains the final mapped entries

  • allow_missing_mappings (bool) – If True, allow attributes without specified mappings to exist in the final dataframe. If False, only attributes specified in remapping_df will be in final dataframe.

  • print_info (bool) – If True, print information about remapping.

Returns:

DataFrame – dataframe with remapped columns populated

pvops.text.utils.remap_words_in_text(om_df, remapping_df, remapping_col_dict)[source]
A utility function which remaps a text column of om_df using columns

within remapping_df.

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M note data

  • remapping_df (DataFrame) – Holds columns that define the remappings

  • remapping_col_dict (dict of {str : str}) – A dictionary that contains the column names that describes how remapping is going to be done

    • data : string, should be assigned to associated column name in om_df which will have its text tokenized and remapped

    • remapping_col_from : string, should be assigned to associated column name in remapping_df that matches original attribute of interest in om_df

    • remapping_col_to : string, should be assigned to associated column name in remapping_df that contains the final mapped entries

Returns:

DataFrame – dataframe with remapped columns populated

text.visualize module

pvops.text.visualize.visualize_attribute_connectivity(om_df, om_col_dict, figsize=(20, 10), attribute_colors=['lightgreen', 'cornflowerblue'], edge_width_scalar=10, graph_aargs={})[source]

Visualize a knowledge graph which shows the frequency of combinations between attributes ATTRIBUTE1_COL and ATTRIBUTE2_COL

NOW USES BIPARTITE LAYOUT ATTRIBUTE2_COL is colored using a colormap.

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names to be used in visualization:

    {
        'attribute1_col' : string,
        'attribute2_col' : string
    }
    
  • figsize (tuple) – Figure size, defaults to (20,10)

  • attribute_colors (list[str]) – List of two strings which designate the colors for Attribute1 and Attribute 2, respectively.

  • edge_width_scalar (numeric) – Weight utilized to scale widths based on number of connections between Attribute 1 and Attribute 2. Larger values will produce larger widths, and smaller values will produce smaller widths.

  • graph_aargs (dict) – Optional, arguments passed to networkx graph drawer. Suggested attributes to pass:

    • with_labels=True

    • font_weight=’bold’

    • node_size=19000

    • font_size=35

Returns:

  • Matplotlib axis,

  • networkx graph

pvops.text.visualize.visualize_attribute_timeseries(om_df, om_col_dict, date_structure='%Y-%m', figsize=(12, 6), cmap_name='brg')[source]

Visualize stacked bar chart of attribute frequency over time, where x-axis is time and y-axis is count, displaying separate bars for each label within the label column

Parameters:
  • om_df (DataFrame) – A pandas dataframe of O&M data, which contains columns in om_col_dict

  • om_col_dict (dict of {str : str}) – A dictionary that contains the column names relevant for the get_dates fn

    • label (string), should be assigned to associated column name for the label/attribute of interest in om_df

    • date (string), should be assigned to associated column name for the dates relating to the documents in om_df

  • date_structure (str) – Controls the resolution of the bar chart’s timeseries Default : “%Y-%m”. Can change to include finer resolutions (e.g., by including day, “%Y-%m-%d”) or coarser resolutions (e.g., by year, “%Y”)

  • figsize (tuple) – Optional, figure size

  • cmap_name (str) – Optional, color map name in matplotlib

Returns:

Matplotlib figure instance

pvops.text.visualize.visualize_classification_confusion_matrix(om_df, col_dict, title='')[source]

Visualize confusion matrix comparing known categorical values, and predicted categorical values.

Parameters:
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict

  • col_dict (dict of {str : str}) – A dictionary that contains the column names needed:

    • data : string, should be assigned to associated column which stores the tokenized text logs

    • attribute_col : string, will be assigned to attribute column and used to create new attribute_col

    • predicted_col : string, will be used to create keyword search label column

  • title (str) – Optional, title of plot

Returns:

Matplotlib figure instance

pvops.text.visualize.visualize_cluster_entropy(doc2vec, eval_kmeans, om_df, data_cols, ks, cmap_name='brg')[source]

Visualize entropy of embedding space parition. Currently only supports doc2vec embedding.

Parameters:
  • doc2vec (Doc2Vec model instance) – Instance of gensim.models.doc2vec.Doc2Vec

  • eval_kmeans (callable) – Callable cluster fit function For instance,

    def eval_kmeans(X,k):
        km = KMeans(n_clusters=k)
        km.fit(X)
        return km
    
  • om_df (DataFrame) – A pandas dataframe containing O&M data, which contains columns specified in om_col_dict

  • data_cols (list) – List of column names (str) which have text data.

  • ks (list) – List of k parameters required for the clustering mechanic eval_kmeans

  • cmap_name – Optional, color map

Returns:

Matplotlib figure instance

pvops.text.visualize.visualize_document_clusters(cluster_tokens, min_frequency=20)[source]

Visualize words most frequently occurring in a cluster. Especially useful when visualizing the results of an unsupervised partitioning of documents.

Parameters:
  • cluster_tokens (list) – List of tokenized documents

  • min_frequency (int) – Minimum number of occurrences that a word must have in a cluster for it to be visualized

Returns:

Matplotlib figure instance

pvops.text.visualize.visualize_word_frequency_plot(tokenized_words, title='', font_size=16, graph_aargs={})[source]

Visualize the frequency distribution of words within a set of documents

Parameters:
  • tokenized_words (list) – List of tokenized words

  • title (str) – Optional, title of plot

  • font_size (int) – Optional, font size

  • aargs – Optional, other parameters passed to nltk.FreqDist.plot()

Returns:

Matplotlib figure instance