skll
Package¶
The most useful parts of our API are available at the package level in addition to the module level. They are documented in both places for convenience.
From data
Package¶
-
class
skll.
FeatureSet
(name, ids, labels=None, features=None, vectorizer=None)[source]¶ Bases:
object
Encapsulation of all of the features, values, and metadata about a given set of data.
Warning
FeatureSets can only be equal if the order of the instances is identical because these are stored as lists/arrays.
This replaces
ExamplesTuple
from older versions.Parameters: - name (str) – The name of this feature set.
- ids (np.array) – Example IDs for this set.
- labels (np.array) – labels for this set.
- features (list of dict or array-like) – The features for each instance represented as either a list of dictionaries or an array-like (if vectorizer is also specified).
- vectorizer (DictVectorizer or FeatureHasher) – Vectorizer that created feature matrix.
Note
If ids, labels, and/or features are not None, the number of rows in each array must be equal.
-
filter
(ids=None, labels=None, features=None, inverse=False)[source]¶ Removes or keeps features and/or examples from the Featureset depending on the passed in parameters.
Parameters: - ids (list of str/float) – Examples to keep in the FeatureSet. If None, no ID filtering takes place.
- labels (list of str/float) – labels that we want to retain examples for. If None, no label filtering takes place.
- features (list of str) – Features to keep in the FeatureSet. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization.
- inverse (bool) – Instead of keeping features and/or examples in lists, remove them.
-
filtered_iter
(ids=None, labels=None, features=None, inverse=False)[source]¶ A version of
__iter__
that retains only the specified features and/or examples from the output.Parameters: - ids (list of str/float) – Examples in the FeatureSet to keep. If None, no ID filtering takes place.
- labels (list of str/float) – labels that we want to retain examples for. If None, no label filtering takes place.
- features (list of str) – Features in the FeatureSet to keep. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization.
- inverse (bool) – Instead of keeping features and/or examples in lists, remove them.
-
static
from_data_frame
(df, name, labels_column=None, vectorizer=None)[source]¶ Helper function to create a FeatureSet object from a pandas.DataFrame. Will raise an Exception if pandas is not installed in your environment. FeatureSet ids will be the index on df.
Parameters: - df (pandas.DataFrame) – The pandas.DataFrame object you’d like to use as a feature set.
- name (str) – The name of this feature set.
- labels_column (str or None) – The name of the column containing the labels (data to predict).
- vectorizer (DictVectorizer or FeatureHasher) – Vectorizer that created feature matrix.
-
has_labels
¶ Returns: Whether or not this FeatureSet has any finite labels.
-
class
skll.
Reader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
object
A little helper class to make picklable iterators out of example dictionary generators
Parameters: - path_or_list (str or list of dict) – Path or a list of example dictionaries.
- quiet (bool) – Do not print “Loading...” status message to stderr.
- ids_to_floats (bool) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
- id_col (str) – Name of the column which contains the instance IDs for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the IDs will be generated automatically.
- label_col (str) – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the data is considered to be unlabelled.
- class_map (dict from str to str) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same.
- sparse (bool) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
- feature_hasher (bool) – Whether or not a FeatureHasher should be used to vectorize the features.
- num_features (int) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
-
classmethod
for_path
(path_or_list, **kwargs)[source]¶ Parameters: - path (str or dict) – The path to the file to load the examples from, or a list of example dictionaries.
- quiet (bool) – Do not print “Loading...” status message to stderr.
- sparse (bool) – Whether or not to store the features in a numpy CSR matrix.
- id_col (str) – Name of the column which contains the instance IDs for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the IDs will be generated automatically.
- label_col (str) – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the data is considered to be unlabelled.
- ids_to_floats (bool) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
- class_map (dict from str to str) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple classes into a single class. Anything not in the mapping will be kept the same.
Returns: New instance of the
Reader
sub-class that is appropriate for the given path, orDictListReader
if given a list of dictionaries.
-
read
()[source]¶ Loads examples in the
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
formats.Returns: FeatureSet
representing the file we read in.
-
class
skll.
Writer
(path, feature_set, **kwargs)[source]¶ Bases:
object
Helper class for writing out FeatureSets to files.
Parameters: - path (str) – A path to the feature file we would like to create. The suffix
to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- requires_binary (bool) – Whether or not the Writer must open the file in binary mode for writing with Python 2.
- subsets (dict (str to list of str)) – A mapping from subset names to lists of feature names
that are included in those sets. If given, a feature
file will be written for every subset (with the name
containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.
-
classmethod
for_path
(path, feature_set, **kwargs)[source]¶ Parameters: - path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- kwargs (dict) – The keyword arguments for
for_path
are the same as the initializer for the desiredWriter
subclass.
Returns: New instance of the Writer sub-class that is appropriate for the given path.
- path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
- path (str) – A path to the feature file we would like to create. The suffix
to this filename must be
From experiments
Module¶
-
skll.
run_configuration
(config_file, local=False, overwrite=True, queue='all.q', hosts=None, write_summary=True, quiet=False, ablation=0, resume=False)[source]¶ Takes a configuration file and runs the specified jobs on the grid.
Parameters: - config_path (str) – Path to the configuration file we would like to use.
- local (bool) – Should this be run locally instead of on the cluster?
- overwrite (bool) – If the model files already exist, should we overwrite them instead of re-using them?
- queue (str) – The DRMAA queue to use if we’re running on the cluster.
- hosts (list of str) – If running on the cluster, these are the machines we should use.
- write_summary (bool) – Write a tsv file with a summary of the results.
- quiet (bool) – Suppress printing of “Loading...” messages.
- ablation (int or None) – Number of features to remove when doing an ablation
experiment. If positive, we will perform repeated ablation
runs for all combinations of features removing the
specified number at a time. If
None
, we will use all combinations of all lengths. If 0, the default, no ablation is performed. If negative, aValueError
is raised. - resume (bool) – If result files already exist for an experiment, do not overwrite them. This is very useful when doing a large ablation experiment and part of it crashes.
Returns: A list of paths to .json results files for each variation in the experiment.
Return type: list of str
From learner
Module¶
-
class
skll.
Learner
(model_type, probability=False, feature_scaling='none', model_kwargs=None, pos_label_str=None, min_feature_count=1, sampler=None, sampler_kwargs=None, custom_learner_path=None)[source]¶ Bases:
object
A simpler learner interface around many scikit-learn classification and regression functions.
Parameters: - model_type (str) – Type of estimator to create (e.g., LogisticRegression). See the skll package documentation for valid options.
- probability (bool) – Should learner return probabilities of all labels (instead of just label with highest probability)?
- feature_scaling (str) – how to scale the features, if at all. Options are: ‘with_std’: scale features using the standard deviation, ‘with_mean’: center features using the mean, ‘both’: do both scaling as well as centering, ‘none’: do neither scaling nor centering
- model_kwargs (dict) – A dictionary of keyword arguments to pass to the initializer for the specified model.
- pos_label_str (str) – The string for the positive label in the binary classification setting. Otherwise, an arbitrary label is picked.
- min_feature_count (int) – The minimum number of examples a feature must have a nonzero value in to be included.
- sampler (str) – The sampler to use for kernel approximation, if desired.
Valid values are:
'AdditiveChi2Sampler'
,'Nystroem'
,'RBFSampler'
, and'SkewedChi2Sampler'
. - sampler_kwargs (dict) – A dictionary of keyword arguments to pass to the initializer for the specified sampler.
- custom_learner_path (str) – Path to module where a custom classifier is defined.
-
cross_validate
(examples, stratified=True, cv_folds=10, grid_search=False, grid_search_folds=3, grid_jobs=None, grid_objective='f1_score_micro', prediction_prefix=None, param_grid=None, shuffle=False, save_cv_folds=False)[source]¶ Cross-validates a given model on the training examples.
Parameters: - examples (FeatureSet) – The data to cross-validate learner performance on.
- stratified (bool) – Should we stratify the folds to ensure an even distribution of labels for each fold?
- cv_folds (int or dict) – The number of folds to use for cross-validation, or a mapping from example IDs to folds.
- grid_search (bool) – Should we do grid search when training each fold? Note: This will make this take much longer.
- grid_search_folds (int) – The number of folds to use when doing the grid search (ignored if cv_folds is set to a dictionary mapping examples to folds).
- grid_jobs (int) – The number of jobs to run in parallel when doing the grid search. If unspecified or 0, the number of grid search folds will be used.
- grid_objective (function) – The objective function to use when doing the grid search.
- param_grid (list of dicts mapping from strs to lists of parameter values) – The parameter grid to search through for grid search. If unspecified, a default parameter grid will be used.
- prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
- shuffle (bool) – Shuffle examples before splitting into folds for CV.
- save_cv_folds (bool) – Whether to save the cv fold ids or not
Returns: The confusion matrix, overall accuracy, per-label PRFs, and model parameters for each fold in one list, and another list with the grid search scores for each fold. Also return a dictionary containing the test-fold number for each id if save_cv_folds is True, otherwise None.
Return type: (list of 4-tuples, list of float, dict)
-
evaluate
(examples, prediction_prefix=None, append=False, grid_objective=None)[source]¶ Evaluates a given model on a given dev or test example set.
Parameters: - examples (FeatureSet) – The examples to evaluate the performance of the model on.
- prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
- append (bool) – Should we append the current predictions to the file if it exists?
- grid_objective (function) – The objective function that was used when doing the grid search.
Returns: The confusion matrix, the overall accuracy, the per-label PRFs, the model parameters, and the grid search objective function score.
Return type: 5-tuple
-
classmethod
from_file
(learner_path)[source]¶ Returns: New instance of Learner from the pickle at the specified path.
-
load
(learner_path)[source]¶ Replace the current learner instance with a saved learner.
Parameters: learner_path (str) – The path to the file to load.
-
model
¶ The underlying scikit-learn model
-
model_kwargs
¶ A dictionary of the underlying scikit-learn model’s keyword arguments
-
model_params
¶ Model parameters (i.e., weights) for
LinearModel
(e.g.,Ridge
) regression and liblinear models.Returns: Labeled weights and (labeled if more than one) intercept value(s) Return type: tuple of ( weights
,intercepts
), whereweights
is a dict andintercepts
is a dictionary
-
model_type
¶ The model type (i.e., the class)
-
predict
(examples, prediction_prefix=None, append=False, class_labels=False)[source]¶ Uses a given model to generate predictions on a given data set
Parameters: - examples (FeatureSet) – The examples to predict the labels for.
- prediction_prefix (str) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by ”.predictions”
- append (bool) – Should we append the current predictions to the file if it exists?
- class_labels (bool) – For classifier, should we convert class indices to their (str) labels?
Returns: The predictions returned by the learner.
Return type: array
-
probability
¶ Should learner return probabilities of all labels (instead of just label with highest probability)?
-
save
(learner_path)[source]¶ Save the learner to a file.
Parameters: learner_path (str) – The path to where you want to save the learner.
-
train
(examples, param_grid=None, grid_search_folds=3, grid_search=True, grid_objective='f1_score_micro', grid_jobs=None, shuffle=False, create_label_dict=True)[source]¶ Train a classification model and return the model, score, feature vectorizer, scaler, label dictionary, and inverse label dictionary.
Parameters: - examples (FeatureSet) – The examples to train the model on.
- param_grid (list of dicts mapping from strs to lists of parameter values) – The parameter grid to search through for grid search. If unspecified, a default parameter grid will be used.
- grid_search_folds (int or dict) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds.
- grid_search (bool) – Should we do grid search?
- grid_objective (function) – The objective function to use when doing the grid search.
- grid_jobs (int) – The number of jobs to run in parallel when doing the grid search. If unspecified or 0, the number of grid search folds will be used.
- shuffle (bool) – Shuffle examples (e.g., for grid search CV.)
- create_label_dict (bool) – Should we create the label dictionary? This
dictionary is used to map between string
labels and their corresponding numerical
values. This should only be done once per
experiment, so when
cross_validate
callstrain
,create_label_dict
gets set toFalse
.
Returns: The best grid search objective function score, or 0 if we’re not doing grid search.
Return type: float
From metrics
Module¶
-
skll.
f1_score_least_frequent
(y_true, y_pred)[source]¶ Calculate the F1 score of the least frequent label/class in
y_true
fory_pred
.Parameters: - y_true (array-like of float) – The true/actual/gold labels for the data.
- y_pred (array-like of float) – The predicted/observed labels for the data.
Returns: F1 score of the least frequent label
-
skll.
kappa
(y_true, y_pred, weights=None, allow_off_by_one=False)[source]¶ Calculates the kappa inter-rater agreement between two the gold standard and the predicted ratings. Potential values range from -1 (representing complete disagreement) to 1 (representing complete agreement). A kappa value of 0 is expected if all agreement is due to chance.
In the course of calculating kappa, all items in y_true and y_pred will first be converted to floats and then rounded to integers.
It is assumed that y_true and y_pred contain the complete range of possible ratings.
This function contains a combination of code from yorchopolis’s kappa-stats and Ben Hamner’s Metrics projects on Github.
Parameters: - y_true (array-like of float) – The true/actual/gold labels for the data.
- y_pred (array-like of float) – The predicted/observed labels for the data.
- weights (str or numpy array) –
Specifies the weight matrix for the calculation. Options are:
- None = unweighted-kappa
- ‘quadratic’ = quadratic-weighted kappa
- ‘linear’ = linear-weighted kappa
- two-dimensional numpy array = a custom matrix of weights. Each weight corresponds to the \(w_{ij}\) values in the wikipedia description of how to calculate weighted Cohen’s kappa.
- allow_off_by_one (bool) – If true, ratings that are off by one are counted as equal, and all other differences are reduced by one. For example, 1 and 2 will be considered to be equal, whereas 1 and 3 will have a difference of 1 for when building the weights matrix.
-
skll.
kendall_tau
(y_true, y_pred)[source]¶ Calculate Kendall’s tau between
y_true
andy_pred
.Parameters: - y_true (array-like of float) – The true/actual/gold labels for the data.
- y_pred (array-like of float) – The predicted/observed labels for the data.
Returns: Kendall’s tau if well-defined, else 0
-
skll.
spearman
(y_true, y_pred)[source]¶ Calculate Spearman’s rank correlation coefficient between
y_true
andy_pred
.Parameters: - y_true (array-like of float) – The true/actual/gold labels for the data.
- y_pred (array-like of float) – The predicted/observed labels for the data.
Returns: Spearman’s rank correlation coefficient if well-defined, else 0
-
skll.
pearson
(y_true, y_pred)[source]¶ Calculate Pearson product-moment correlation coefficient between
y_true
andy_pred
.Parameters: - y_true (array-like of float) – The true/actual/gold labels for the data.
- y_pred (array-like of float) – The predicted/observed labels for the data.
Returns: Pearson product-moment correlation coefficient if well-defined, else 0