autobot package¶
Submodules¶
autobot.baselines module¶
-
autobot.baselines.
get_bert_base
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, model='bert-base')¶ textlabels
-
autobot.baselines.
get_doc2vec
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, classifier='LR')¶
-
autobot.baselines.
get_lr_char_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_lr_word_char_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_lr_word_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_majority
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_svm_char_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_svm_word_char_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_svm_word_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
-
autobot.baselines.
get_tpot_word_pipeline
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000)¶
autobot.conceptnet_features module¶
-
class
autobot.conceptnet_features.
ConceptFeatures
(max_features=10000, targets=None, knowledge_graph='../memory/conceptnet.txt.gz')¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
concept_graph
(document_space, graph_path)¶ If no prior knowledge graph is supplied, one is constructed. :param document_space: The list of input documents :param graph_path: The path of the knowledge graph used. :return grounded: Grounded relations.
-
fit
(text_vector, refit=False)¶ Fit the model to a text vector.
- Parameters
text_vector – Input list of documents.
-
fit_transform
(text_vector, b=None)¶ A classifc fit-transform method.
- Parameters
text_vector – The input list of documents.
- Return transformedObj
Transformed texts (to features).
-
get_feature_names
()¶
-
get_propositionalized_rep
(documents)¶ The method for constructing the representation.
- Parameters
documents – The input list of documents.
-
transform
(text_vector, use_conc_docs=False)¶ Transform the data into suitable form.
-
autobot.data_utils module¶
-
class
autobot.data_utils.
DataProcessor
¶ Bases:
object
Base class for data converters for sequence classification data sets.
-
get_labels
()¶ Gets the list of labels for this data set.
-
read_pandas_tsv
(input_file)¶
-
-
autobot.data_utils.
acc_and_f1
(preds, labels, average=None)¶
-
autobot.data_utils.
compute_metrics
(task_name, preds, labels)¶
-
class
autobot.data_utils.
genericProcessor
¶ Bases:
autobot.data_utils.DataProcessor
-
get_dev_examples
(data_dir)¶ See base class.
-
get_test_examples
(data_dir)¶ See base class.
-
get_train_examples
(data_dir)¶ See base class.
-
-
autobot.data_utils.
pearson_and_spearman
(preds, labels)¶
-
autobot.data_utils.
simple_accuracy
(preds, labels)¶
autobot.feature_constructors module¶
AutoBOT. Skrlj et al. 2021
-
class
autobot.feature_constructors.
FeaturePrunner
(max_num_feat=2048)¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
fit
(input_data, y=None)¶
-
get_feature_names
()¶
-
transform
(input_data)¶
-
-
autobot.feature_constructors.
build_dataframe
(data_docs)¶ One of the core methods responsible for construction of a dataframe object.
- Parameters
data_docs – The input data documents
- Return df_data
A dataframe corresponding to text representations
-
class
autobot.feature_constructors.
digit_col
¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Dealing with numeric features
- Parameters
BaseExtimator – Core estimator
TransformerMixin – Transformer object
- Return object
Returns transformed (scaled) space
-
fit
(x, y=None)¶
-
transform
(hd_searches)¶
-
autobot.feature_constructors.
fast_screening_sgd
(training, targets)¶
-
autobot.feature_constructors.
get_affix
(text)¶ This method gets the affix information
-
autobot.feature_constructors.
get_autoBOT_manual
(train_sequences, dev_sequences, train_targets, dev_targets, time_constraint=1, num_cpu=1, max_features=1000, clf_type='LR')¶
-
autobot.feature_constructors.
get_features
(df_data, representation_type='neurosymbolic', targets=None, sparsity=0.1, embedding_dim=512, memory_location='memory/conceptnet.txt.gz', custom_pipeline=None, concept_features=True, combine_with_existing_representation=False)¶ Method that computes various TF-IDF-alike features.
This method yields pos tags
- Parameters
text – Input string of text
- Return string
space delimited pos tags.
-
autobot.feature_constructors.
get_simple_features
(df_data, max_num_feat=10000)¶
-
autobot.feature_constructors.
get_subset
(indice_list, data_matrix, vectorizer)¶
-
autobot.feature_constructors.
parallelize
(data, method)¶ Helper method for parallelization
- Parameters
data – Input data to be transformed
method – The method to parallelize
- Return data
Returns the transformed data
This method removes hashtags
- Parameters
text – Input string of text
replace_token – The token to be replaced
- Return string
A new text
-
autobot.feature_constructors.
remove_mentions
(text, replace_token)¶ This method removes mentions (relevant for tweets)
- Parameters
text – Input string of text
replace_token – A token to be replaced
- Return string
A new text string
-
autobot.feature_constructors.
remove_punctuation
(text)¶ This method removes punctuation
-
autobot.feature_constructors.
remove_stopwords
(text)¶ This method removes stopwords
- Parameters
text – Input string of text
- Return string
Preprocessed text
-
autobot.feature_constructors.
remove_url
(text, replace_token)¶ Removal of URLs
- Parameters
text – Input string of text
replace_token – The token to be replaced
- Return string
A new text
-
class
autobot.feature_constructors.
text_col
(key)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A helper processor class
- Parameters
BaseExtimator – Core estimator
TransformerMixin – Transformer object
- Return object
Returns particular text column
-
fit
(x, y=None)¶
-
transform
(data_dict)¶
-
autobot.feature_constructors.
ttr
(text)¶ Number of unique tokens
- Parameters
text – Input string of text
- Return float
Ratio of the unique/overall tokens
autobot.keyword_features module¶
-
class
autobot.keyword_features.
KeywordFeatures
(max_features=10000, targets=None)¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
fit
(text_vector, refit=False)¶ Fit the model to a text vector.
- Parameters
text_vector – The input list of texts
-
fit_transform
(text_vector, b=None)¶ A classifc fit-transform method.
- Parameters
text_vector – Input list of texts.
- Return transformedObject
Transformed list of texts
-
get_feature_names
()¶
-
transform
(text_vector)¶ Transform the data into suitable form.
- Parameters
text_vector – The input list of texts.
- Return transformedObject
The transformed input texts (feature space)
-
autobot.metrics module¶
-
autobot.metrics.
get_metric_report
(y_true, y_prediction)¶ A generic metric report; suitable for multiobjective experiments (not the core paper)
autobot.rakun module¶
RaKUn is an algorithm for graph-absed keyword extraction.
-
class
autobot.rakun.
RakunDetector
(hyperparameters, verbose=True)¶ Bases:
object
-
calculate_edit_distance
(key1, key2)¶
-
calculate_embedding_distance
(key1, key2)¶
-
corpus_graph
(language_file, limit_range=3000000, verbose=False, lemmatizer=None, stopwords=None, min_char=4, stemmer=None, input_type='file')¶
-
find_keywords
(document, input_type='file', validate=False)¶
-
generate_hypervertices
(G)¶ This node generates hypervertices.
-
hypervertex_prunning
(graph, distance_threshold, pair_diff_max=2, distance_method='editdistance')¶
-
visualize_network
(visualization_parameters=None, display=True)¶
-
autobot.sentence_embeddings module¶
-
class
autobot.sentence_embeddings.
documentEmbedder
(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)¶ Bases:
object
Core class describing sentence embedding methodology employed here. The class functions as a sklearn-like object.
-
fit
(text_vector, b=None, refit=False)¶ Fit the model to a text vector. :param text_vector: a list of texts
-
fit_transform
(text_vector, a2=None)¶ A classifc fit-transform method. :param text_vector: a text vector used to build and transform a corpus.
-
get_feature_names
()¶
-
transform
(text_vector)¶ Transform the data into suitable form. :param text_vector: The text vector to be transformed via a trained model
-
autobot.strategy_ga module¶
This is the main GA underlying the autoBOT approach. This file contains, without warranty, the code that performs the optimization. Made by Blaz Skrlj, Ljubljana 2020, Jozef Stefan Institute
-
class
autobot.strategy_ga.
GAlearner
(train_sequences_raw, train_targets, time_constraint, num_cpu='all', task_name='update:', latent_dim=512, sparsity=0.1, hof_size=3, scoring_metric=None, top_k_importances=25, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', classifier=None, n_fold_cv=5, classifier_hyperparameters=None, custom_transformer_pipeline=None, combine_with_existing_representation=False, verbose=1)¶ Bases:
object
The core GA class. It includes methods for evolution of a learner assembly. Each instance of autoBOT must be first instantiated. In general, the workflow for working with this class is as follows: 1.) Instantiate the class 2.) Evolve 3.) Predict
-
apply_weights
(parameters, custom_feature_space=False, custom_feature_matrix=None)¶ This method applies weights to individual parts of the feature space.
- Parameters
parameters – a vector of real-valued parameters - solution = an individual
custom_feature_space – Custom feature space, relevant during making of predictions.
- Return tmp_space
Temporary weighted space (individual)
-
compute_time_diff
()¶ A method for approximate time monitoring.
-
cross_val_scores
(tmp_feature_space, final_run=False, n_cpu=None)¶ Compute the learnability of the representation.
- Parameters
tmp_feature_space – An individual’s solution space.
final_run – Last run is more extensive.
n_cpu – Number of CPUs to use.
- Return f1_perf, clf
F1 performance and the learned classifier.
-
custom_initialization
()¶ Custom initialization employs random uniform prior. See the paper for more details.
-
evaluate_fitness
(individual, max_num_feat=1000, return_clf_and_vec=False)¶ A helper method for evaluating an individual solution. Given a real-valued vector, this constructs the representations and evaluates a given learner.
- Parameters
individual – an individual (solution)
max_num_feat – maximum number of features that are outputted
return_clf_and_vec – return classifier and vectorizer? This is useful for deployment.
- Return score
The fitness score.
-
evolve
(nind=10, crossover_proba=0.4, mutpb=0.15, stopping_interval=20, strategy='evolution', validation_type='cv')¶ The core evolution method. First constrain the maximum number of features to be taken into account by lowering the bound w.r.t performance. next, evolve.
- Parameters
nind – number of individuals (int)
crossover_proba – crossover probability (float)
mutpb – mutation probability (float)
stopping_interval – stopping interval -> for how long no improvement is tolerated before a hard reset (int)
strategy – type of evolution (str)
validation_type – type of validation, either train_val or cv (cross validation or train-val split)
-
feature_type_importances
(solution_index=0)¶ A method which prints feature type importances as a pandas df.
- Parameters
solution_index – Which consequent individual to inspect.
- Return feature_ranking
Final table of rankings
-
generate_and_update_stats
(fits)¶ A helper method for generating stats.
- Parameters
fits – fitness values of the current population
-
generate_id_intervals
()¶ Generate independent intervals.
-
generate_random_initial_state
(weights_importances)¶ The initialization method, capable of generation of individuals.
-
get_feature_importance_report
(individual, fitnesses)¶ Report feature importances.
- Parameters
individual – an individual solution (a vector of floats)
fitnesses – fitness space (list of reals)
- Return report
A prinout of current performance.
-
get_feature_space
()¶ Extract final feature space considered for learning purposes.
-
instantiate_validation_env
()¶ This method refreshes the feature space. This is needed to maximize efficiency.
-
mutReg
(individual, p=1)¶ Custom mutation operator used for regularization optimization.
- Parameters
individual – individual (vector of floats)
- Return individual
An individual solution.
-
parallelize_dataframe
(df, func)¶ A method for parallel traversal of a given dataframe.
- Parameters
df – dataframe of text (Pandas object)
func – function to be executed (a function)
-
predict
(instances)¶ Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.
- Parameters
instances – predict labels for new instances = texts.
-
report_performance
(fits, gen=0)¶ A helper method for performance reports.
- Parameters
fits – fitness values (vector of floats)
gen – generation to be reported (int)
-
return_dataframe_from_text
(text)¶ A helper method that returns a given dataframe from text.
- Parameters
text – list of texts.
- Return parsed df
A parsed text (a DataFrame)
-
softmax
(x)¶ Compute softmax values for each sets of scores in x.
- Param
x: (vector of floats)
-
summarise_final_learners
()¶
-
update_global_feature_importances
()¶ Aggregate feature importances across top learners to obtain the final ranking.
-
update_intermediary_feature_space
(custom_space=None)¶ Create the subset of the origin feature space based on the starting_feature_numbers vector that gets evolved.
-
visualize_fitness
(image_path='fitnessExample.png')¶ A method for visualizing fitness.
- Parameters
image_path – Path to file, ending denotes file type. If set to None, only DataFrame of statistics is returned.
- Return dfx
DataFrame of evolution evaluations
-
autobot.strategy_random_search module¶
autobot.word_relations module¶
-
class
autobot.word_relations.
relationExtractor
(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=1, min_token='bigrams')¶ Bases:
object
The main token relation extraction class. Works for arbitrary tokens.
-
compute_distance
(pair, token_dict)¶ A core distance for computing index-based differences.
- Parameters
pair – the pair of tokens
token_dict – distance map
- Return pair[0], pair[1], dist
The two tokens and the distance
-
fit
(text_vector, b=None)¶ Fit the model to a text vector.
- Parameters
text_vector – The input listr of texts.
-
fit_transform
(text_vector, a2)¶ A classifc fit-transform method.
- Parameters
text_vector – Input list of texts.
-
get_feature_names
()¶ Return exact feature names.
-
transform
(text_vector)¶ Transform the data into suitable form.
- Parameters
text_vector – The input list of texts.
-
witem_kernel
(instance)¶ A simple kernel for traversing a given document.
- Parameters
instance – a piece of text
- Return global_distances
Distances between tokens
-