autobot package

Submodules

autobot.baselines module

autobot.baselines.get_bert_base(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, model='bert-base')

textlabels

autobot.baselines.get_doc2vec(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, classifier='LR')
autobot.baselines.get_lr_char_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_lr_word_char_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_lr_word_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_majority(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_svm_char_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_svm_word_char_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_svm_word_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)
autobot.baselines.get_tpot_word_pipeline(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)

autobot.conceptnet_features module

class autobot.conceptnet_features.ConceptFeatures(max_features=10000, targets=None, knowledge_graph='../memory/conceptnet.txt.gz')

Bases: object

Core class describing sentence embedding methodology employed here.

concept_graph(document_space, graph_path)

If no prior knowledge graph is supplied, one is constructed. :param document_space: The list of input documents :param graph_path: The path of the knowledge graph used. :return grounded: Grounded relations.

fit(text_vector, refit=False)

Fit the model to a text vector.

Parameters

text_vector – Input list of documents.

fit_transform(text_vector, b=None)

A classifc fit-transform method.

Parameters

text_vector – The input list of documents.

Return transformedObj

Transformed texts (to features).

get_feature_names()
get_propositionalized_rep(documents)

The method for constructing the representation.

Parameters

documents – The input list of documents.

transform(text_vector, use_conc_docs=False)

Transform the data into suitable form.

autobot.data_utils module

class autobot.data_utils.DataProcessor

Bases: object

Base class for data converters for sequence classification data sets.

get_dev_examples(data_dir)

Gets a collection of `InputExample`s for the dev set.

get_labels()

Gets the list of labels for this data set.

get_train_examples(data_dir)

Gets a collection of `InputExample`s for the train set.

read_pandas_tsv(input_file)
autobot.data_utils.acc_and_f1(preds, labels, average=None)
autobot.data_utils.compute_metrics(task_name, preds, labels)
class autobot.data_utils.genericProcessor

Bases: autobot.data_utils.DataProcessor

get_dev_examples(data_dir)

See base class.

get_test_examples(data_dir)

See base class.

get_train_examples(data_dir)

See base class.

autobot.data_utils.pearson_and_spearman(preds, labels)
autobot.data_utils.simple_accuracy(preds, labels)

autobot.feature_constructors module

AutoBOT. Skrlj et al. 2021

class autobot.feature_constructors.FeaturePrunner(max_num_feat=2048)

Bases: object

Core class describing sentence embedding methodology employed here.

fit(input_data, y=None)
get_feature_names()
transform(input_data)
autobot.feature_constructors.build_dataframe(data_docs)

One of the core methods responsible for construction of a dataframe object.

Parameters

data_docs – The input data documents

Return df_data

A dataframe corresponding to text representations

class autobot.feature_constructors.digit_col

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Dealing with numeric features

Parameters
  • BaseExtimator – Core estimator

  • TransformerMixin – Transformer object

Return object

Returns transformed (scaled) space

fit(x, y=None)
transform(hd_searches)
autobot.feature_constructors.fast_screening_sgd(training, targets)
autobot.feature_constructors.get_affix(text)

This method gets the affix information

autobot.feature_constructors.get_autoBOT_manual(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, clf_type='LR')
autobot.feature_constructors.get_features(df_data, representation_type='neurosymbolic', targets=None, sparsity=0.1, embedding_dim=512, memory_location='memory/conceptnet.txt.gz', custom_pipeline=None, concept_features=True, combine_with_existing_representation=False)

Method that computes various TF-IDF-alike features.

autobot.feature_constructors.get_pos_tags(text)

This method yields pos tags

Parameters

text – Input string of text

Return string

space delimited pos tags.

autobot.feature_constructors.get_simple_features(df_data, max_num_feat=10000)
autobot.feature_constructors.get_subset(indice_list, data_matrix, vectorizer)
autobot.feature_constructors.parallelize(data, method)

Helper method for parallelization

Parameters
  • data – Input data to be transformed

  • method – The method to parallelize

Return data

Returns the transformed data

autobot.feature_constructors.remove_hashtags(text, replace_token)

This method removes hashtags

Parameters
  • text – Input string of text

  • replace_token – The token to be replaced

Return string

A new text

autobot.feature_constructors.remove_mentions(text, replace_token)

This method removes mentions (relevant for tweets)

Parameters
  • text – Input string of text

  • replace_token – A token to be replaced

Return string

A new text string

autobot.feature_constructors.remove_punctuation(text)

This method removes punctuation

autobot.feature_constructors.remove_stopwords(text)

This method removes stopwords

Parameters

text – Input string of text

Return string

Preprocessed text

autobot.feature_constructors.remove_url(text, replace_token)

Removal of URLs

Parameters
  • text – Input string of text

  • replace_token – The token to be replaced

Return string

A new text

class autobot.feature_constructors.text_col(key)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A helper processor class

Parameters
  • BaseExtimator – Core estimator

  • TransformerMixin – Transformer object

Return object

Returns particular text column

fit(x, y=None)
transform(data_dict)
autobot.feature_constructors.ttr(text)

Number of unique tokens

Parameters

text – Input string of text

Return float

Ratio of the unique/overall tokens

autobot.keyword_features module

class autobot.keyword_features.KeywordFeatures(max_features=10000, targets=None)

Bases: object

Core class describing sentence embedding methodology employed here.

fit(text_vector, refit=False)

Fit the model to a text vector.

Parameters

text_vector – The input list of texts

fit_transform(text_vector, b=None)

A classifc fit-transform method.

Parameters

text_vector – Input list of texts.

Return transformedObject

Transformed list of texts

get_feature_names()
transform(text_vector)

Transform the data into suitable form.

Parameters

text_vector – The input list of texts.

Return transformedObject

The transformed input texts (feature space)

autobot.metrics module

autobot.metrics.get_metric_report(y_true, y_prediction)

A generic metric report; suitable for multiobjective experiments (not the core paper)

autobot.rakun module

RaKUn is an algorithm for graph-absed keyword extraction.

class autobot.rakun.RakunDetector(hyperparameters, verbose=True)

Bases: object

calculate_edit_distance(key1, key2)
calculate_embedding_distance(key1, key2)
corpus_graph(language_file, limit_range=3000000, verbose=False, lemmatizer=None, stopwords=None, min_char=4, stemmer=None, input_type='file')
find_keywords(document, input_type='file', validate=False)
generate_hypervertices(G)

This node generates hypervertices.

hypervertex_prunning(graph, distance_threshold, pair_diff_max=2, distance_method='editdistance')
visualize_network(visualization_parameters=None, display=True)

autobot.sentence_embeddings module

class autobot.sentence_embeddings.documentEmbedder(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)

Bases: object

Core class describing sentence embedding methodology employed here. The class functions as a sklearn-like object.

fit(text_vector, b=None, refit=False)

Fit the model to a text vector. :param text_vector: a list of texts

fit_transform(text_vector, a2=None)

A classifc fit-transform method. :param text_vector: a text vector used to build and transform a corpus.

get_feature_names()
transform(text_vector)

Transform the data into suitable form. :param text_vector: The text vector to be transformed via a trained model

autobot.strategy_ga module

This is the main GA underlying the autoBOT approach. This file contains, without warranty, the code that performs the optimization. Made by Blaz Skrlj, Ljubljana 2020, Jozef Stefan Institute

class autobot.strategy_ga.GAlearner(train_sequences_raw, train_targets, time_constraint, num_cpu='all', task_name='update:', latent_dim=512, sparsity=0.1, hof_size=3, scoring_metric=None, top_k_importances=25, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', classifier=None, n_fold_cv=5, classifier_hyperparameters=None, custom_transformer_pipeline=None, combine_with_existing_representation=False, verbose=1)

Bases: object

The core GA class. It includes methods for evolution of a learner assembly. Each instance of autoBOT must be first instantiated. In general, the workflow for working with this class is as follows: 1.) Instantiate the class 2.) Evolve 3.) Predict

apply_weights(parameters, custom_feature_space=False, custom_feature_matrix=None)

This method applies weights to individual parts of the feature space.

Parameters
  • parameters – a vector of real-valued parameters - solution = an individual

  • custom_feature_space – Custom feature space, relevant during making of predictions.

Return tmp_space

Temporary weighted space (individual)

compute_time_diff()

A method for approximate time monitoring.

cross_val_scores(tmp_feature_space, final_run=False, n_cpu=None)

Compute the learnability of the representation.

Parameters
  • tmp_feature_space – An individual’s solution space.

  • final_run – Last run is more extensive.

  • n_cpu – Number of CPUs to use.

Return f1_perf, clf

F1 performance and the learned classifier.

custom_initialization()

Custom initialization employs random uniform prior. See the paper for more details.

evaluate_fitness(individual, max_num_feat=1000, return_clf_and_vec=False)

A helper method for evaluating an individual solution. Given a real-valued vector, this constructs the representations and evaluates a given learner.

Parameters
  • individual – an individual (solution)

  • max_num_feat – maximum number of features that are outputted

  • return_clf_and_vec – return classifier and vectorizer? This is useful for deployment.

Return score

The fitness score.

evolve(nind=10, crossover_proba=0.4, mutpb=0.15, stopping_interval=20, strategy='evolution', validation_type='cv')

The core evolution method. First constrain the maximum number of features to be taken into account by lowering the bound w.r.t performance. next, evolve.

Parameters
  • nind – number of individuals (int)

  • crossover_proba – crossover probability (float)

  • mutpb – mutation probability (float)

  • stopping_interval – stopping interval -> for how long no improvement is tolerated before a hard reset (int)

  • strategy – type of evolution (str)

  • validation_type – type of validation, either train_val or cv (cross validation or train-val split)

feature_type_importances(solution_index=0)

A method which prints feature type importances as a pandas df.

Parameters

solution_index – Which consequent individual to inspect.

Return feature_ranking

Final table of rankings

generate_and_update_stats(fits)

A helper method for generating stats.

Parameters

fits – fitness values of the current population

generate_id_intervals()

Generate independent intervals.

generate_random_initial_state(weights_importances)

The initialization method, capable of generation of individuals.

get_feature_importance_report(individual, fitnesses)

Report feature importances.

Parameters
  • individual – an individual solution (a vector of floats)

  • fitnesses – fitness space (list of reals)

Return report

A prinout of current performance.

get_feature_space()

Extract final feature space considered for learning purposes.

instantiate_validation_env()

This method refreshes the feature space. This is needed to maximize efficiency.

mutReg(individual, p=1)

Custom mutation operator used for regularization optimization.

Parameters

individual – individual (vector of floats)

Return individual

An individual solution.

parallelize_dataframe(df, func)

A method for parallel traversal of a given dataframe.

Parameters
  • df – dataframe of text (Pandas object)

  • func – function to be executed (a function)

predict(instances)

Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.

Parameters

instances – predict labels for new instances = texts.

report_performance(fits, gen=0)

A helper method for performance reports.

Parameters
  • fits – fitness values (vector of floats)

  • gen – generation to be reported (int)

return_dataframe_from_text(text)

A helper method that returns a given dataframe from text.

Parameters

text – list of texts.

Return parsed df

A parsed text (a DataFrame)

softmax(x)

Compute softmax values for each sets of scores in x.

Param

x: (vector of floats)

summarise_final_learners()
update_global_feature_importances()

Aggregate feature importances across top learners to obtain the final ranking.

update_intermediary_feature_space(custom_space=None)

Create the subset of the origin feature space based on the starting_feature_numbers vector that gets evolved.

visualize_fitness(image_path='fitnessExample.png')

A method for visualizing fitness.

Parameters

image_path – Path to file, ending denotes file type. If set to None, only DataFrame of statistics is returned.

Return dfx

DataFrame of evolution evaluations

autobot.strategy_random_search module

autobot.word_relations module

class autobot.word_relations.relationExtractor(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=1, min_token='bigrams')

Bases: object

The main token relation extraction class. Works for arbitrary tokens.

compute_distance(pair, token_dict)

A core distance for computing index-based differences.

Parameters
  • pair – the pair of tokens

  • token_dict – distance map

Return pair[0], pair[1], dist

The two tokens and the distance

fit(text_vector, b=None)

Fit the model to a text vector.

Parameters

text_vector – The input listr of texts.

fit_transform(text_vector, a2)

A classifc fit-transform method.

Parameters

text_vector – Input list of texts.

get_feature_names()

Return exact feature names.

transform(text_vector)

Transform the data into suitable form.

Parameters

text_vector – The input list of texts.

witem_kernel(instance)

A simple kernel for traversing a given document.

Parameters

instance – a piece of text

Return global_distances

Distances between tokens

Module contents