data
Package¶
data.featureset
Module¶
Classes related to storing/merging feature sets.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
organization: | ETS |
-
class
skll.data.featureset.
FeatureSet
(name, ids, labels=None, features=None, vectorizer=None)[source]¶ Bases:
object
Encapsulation of all of the features, values, and metadata about a given set of data.
Warning
FeatureSets can only be equal if the order of the instances is identical because these are stored as lists/arrays.
This replaces
ExamplesTuple
from older versions.Parameters: - name (str) – The name of this feature set.
- ids (np.array) – Example IDs for this set.
- labels (np.array) – labels for this set.
- features (list of dict or array-like) – The features for each instance represented as either a list of dictionaries or an array-like (if vectorizer is also specified).
- vectorizer (DictVectorizer or FeatureHasher) – Vectorizer that created feature matrix.
Note
If ids, labels, and/or features are not None, the number of rows in each array must be equal.
-
filter
(ids=None, labels=None, features=None, inverse=False)[source]¶ Removes or keeps features and/or examples from the Featureset depending on the passed in parameters.
Parameters: - ids (list of str/float) – Examples to keep in the FeatureSet. If None, no ID filtering takes place.
- labels (list of str/float) – labels that we want to retain examples for. If None, no label filtering takes place.
- features (list of str) – Features to keep in the FeatureSet. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization.
- inverse (bool) – Instead of keeping features and/or examples in lists, remove them.
-
filtered_iter
(ids=None, labels=None, features=None, inverse=False)[source]¶ A version of
__iter__
that retains only the specified features and/or examples from the output.Parameters: - ids (list of str/float) – Examples in the FeatureSet to keep. If None, no ID filtering takes place.
- labels (list of str/float) – labels that we want to retain examples for. If None, no label filtering takes place.
- features (list of str) – Features in the FeatureSet to keep. To help with filtering string-valued features that were converted to sequences of boolean features when read in, any features in the FeatureSet that contain a = will be split on the first occurrence and the prefix will be checked to see if it is in features. If None, no feature filtering takes place. Cannot be used if FeatureSet uses a FeatureHasher for vectorization.
- inverse (bool) – Instead of keeping features and/or examples in lists, remove them.
-
static
from_data_frame
(df, name, labels_column=None, vectorizer=None)[source]¶ Helper function to create a FeatureSet object from a pandas.DataFrame. Will raise an Exception if pandas is not installed in your environment. FeatureSet ids will be the index on df.
Parameters: - df (pandas.DataFrame) – The pandas.DataFrame object you’d like to use as a feature set.
- name (str) – The name of this feature set.
- labels_column (str or None) – The name of the column containing the labels (data to predict).
- vectorizer (DictVectorizer or FeatureHasher) – Vectorizer that created feature matrix.
-
has_labels
¶ Returns: Whether or not this FeatureSet has any finite labels.
data.readers
Module¶
Handles loading data from various types of data files.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
author: | Michael Heilman (mheilman@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
organization: | ETS |
-
class
skll.data.readers.
ARFFReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
out of an ARFF file.If you would like to include example/instance IDs in your files, they must be specified as an “id” column.
Also, there must be a column with the name specified by label_col if the data is labelled, and this column must be the final one (as it is in Weka).
-
class
skll.data.readers.
CSVReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
out of a CSV file.If you would like to include example/instance IDs in your files, they must be specified as an “id” column.
Also, there must be a column with the name specified by label_col if the data is labelled.
-
class
skll.data.readers.
DelimitedReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.Reader
Reader for creating a
FeatureSet
out of a delimited (CSV/TSV) file.If you would like to include example/instance IDs in your files, they must be specified as an
id
column.Also, for ARFF, CSV, and TSV files, there must be a column with the name specified by label_col if the data is labelled. For ARFF files, this column must also be the final one (as it is in Weka).
Parameters: dialect (str) – The dialect of to pass on to the underlying CSV reader. Default: excel-tab
-
class
skll.data.readers.
DictListReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
skll.data.readers.Reader
This class is to facilitate programmatic use of
predict()
and other functions that takeFeatureSet
objects as input. It iterates over examples in the same way as otherReader
clases, but uses a list of example dictionaries instead of a path to a file.Parameters: - path_or_list (Iterable of dict) – List of example dictionaries.
- quiet (bool) – Do not print “Loading...” status message to stderr.
- ids_to_floats (bool) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
-
class
skll.data.readers.
LibSVMReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
out ouf a LibSVM/LibLinear/SVMLight file.We use a specially formatted comment for storing example IDs, class names, and feature names, which are normally not supported by the format. The comment is not mandatory, but without it, your labels and features will not have names. The comment is structured as follows:
ExampleID | 1=FirstClass | 1=FirstFeature 2=SecondFeature
-
class
skll.data.readers.
MegaMReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
out ouf a MegaM -fvals file.If you would like to include example/instance IDs in your files, they must be specified as a comment line directly preceding the line with feature values.
-
class
skll.data.readers.
NDJReader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
skll.data.readers.Reader
Reader to create a
FeatureSet
out of a .jsonlines/.ndj fileIf you would like to include example/instance IDs in your files, they must be specified in the following ways as an “id” key in each JSON dictionary.
-
class
skll.data.readers.
Reader
(path_or_list, quiet=True, ids_to_floats=False, label_col='y', id_col='id', class_map=None, sparse=True, feature_hasher=False, num_features=None)[source]¶ Bases:
object
A little helper class to make picklable iterators out of example dictionary generators
Parameters: - path_or_list (str or list of dict) – Path or a list of example dictionaries.
- quiet (bool) – Do not print “Loading...” status message to stderr.
- ids_to_floats (bool) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
- id_col (str) – Name of the column which contains the instance IDs for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the IDs will be generated automatically.
- label_col (str) – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the data is considered to be unlabelled.
- class_map (dict from str to str) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple labels into a single class. Anything not in the mapping will be kept the same.
- sparse (bool) – Whether or not to store the features in a numpy CSR matrix when using a DictVectorizer to vectorize the features.
- feature_hasher (bool) – Whether or not a FeatureHasher should be used to vectorize the features.
- num_features (int) – If using a FeatureHasher, how many features should the resulting matrix have? You should set this to a power of 2 greater than the actual number of features to avoid collisions.
-
classmethod
for_path
(path_or_list, **kwargs)[source]¶ Parameters: - path (str or dict) – The path to the file to load the examples from, or a list of example dictionaries.
- quiet (bool) – Do not print “Loading...” status message to stderr.
- sparse (bool) – Whether or not to store the features in a numpy CSR matrix.
- id_col (str) – Name of the column which contains the instance IDs for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the IDs will be generated automatically.
- label_col (str) – Name of the column which contains the class labels for ARFF/CSV/TSV files. If no column with that name exists, or None is specified, the data is considered to be unlabelled.
- ids_to_floats (bool) – Convert IDs to float to save memory. Will raise error if we encounter an a non-numeric ID.
- class_map (dict from str to str) – Mapping from original class labels to new ones. This is mainly used for collapsing multiple classes into a single class. Anything not in the mapping will be kept the same.
Returns: New instance of the
Reader
sub-class that is appropriate for the given path, orDictListReader
if given a list of dictionaries.
-
read
()[source]¶ Loads examples in the
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
formats.Returns: FeatureSet
representing the file we read in.
-
class
skll.data.readers.
TSVReader
(path_or_list, **kwargs)[source]¶ Bases:
skll.data.readers.DelimitedReader
Reader for creating a
FeatureSet
out of a TSV file.If you would like to include example/instance IDs in your files, they must be specified as an “id” column.
Also there must be a column with the name specified by label_col if the data is labelled.
-
skll.data.readers.
safe_float
(text, replace_dict=None)[source]¶ Attempts to convert a string to an int, and then a float, but if neither is possible, just returns the original string value.
Parameters: - text (str) – The text to convert.
- replace_dict (dict from str to str) – Mapping from text to replacement text values. This is mainly used for collapsing multiple labels into a single class. Replacing happens before conversion to floats. Anything not in the mapping will be kept the same.
data.writers
Module¶
Handles loading data from various types of data files.
author: | Dan Blanchard (dblanchard@ets.org) |
---|---|
author: | Michael Heilman (mheilman@ets.org) |
author: | Nitin Madnani (nmadnani@ets.org) |
organization: | ETS |
-
class
skll.data.writers.
ARFFWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out FeatureSets as ARFF files.
Parameters: - path (str) – A path to the feature file we would like to create. If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.arff
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- requires_binary (bool) – Whether or not the Writer must open the file in binary mode for writing with Python 2.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- relation (str) – The name of the relation in the ARFF file.
Default:
'skll_relation'
- regression (bool) – Is this an ARFF file to be used for regression?
Default:
False
- path (str) – A path to the feature file we would like to create. If
-
class
skll.data.writers.
CSVWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out FeatureSets as CSV files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
DelimitedFileWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as TSV/CSV files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- id_col (str) – Name of the column to store the instance IDs in for ARFF, CSV, and TSV files.
- label_col (str) – Name of the column which contains the class labels for CSV/TSV files.
- dialect (str) – The dialect to use for the underlying
csv.DictWriter
Default: ‘excel-tab’
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
LibSVMWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as LibSVM/SVMLight files.
-
class
skll.data.writers.
MegaMWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as MegaM files.
-
class
skll.data.writers.
NDJWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.Writer
Writer for writing out FeatureSets as .jsonlines/.ndj files.
-
class
skll.data.writers.
TSVWriter
(path, feature_set, **kwargs)[source]¶ Bases:
skll.data.writers.DelimitedFileWriter
Writer for writing out FeatureSets as TSV files.
Parameters: - path (str) – A path to the feature file we would like to create.
If
subsets
is notNone
, this is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.tsv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- path (str) – A path to the feature file we would like to create.
If
-
class
skll.data.writers.
Writer
(path, feature_set, **kwargs)[source]¶ Bases:
object
Helper class for writing out FeatureSets to files.
Parameters: - path (str) – A path to the feature file we would like to create. The suffix
to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- quiet (bool) – Do not print “Writing...” status message to stderr.
- requires_binary (bool) – Whether or not the Writer must open the file in binary mode for writing with Python 2.
- subsets (dict (str to list of str)) – A mapping from subset names to lists of feature names
that are included in those sets. If given, a feature
file will be written for every subset (with the name
containing the subset name as suffix to
path
). Note, since string- valued features are automatically converted into boolean features with names of the formFEATURE_NAME=STRING_VALUE
, when doing the filtering, the portion before the=
is all that’s used for matching. Therefore, you do not need to enumerate all of these boolean feature names in your mapping.
-
classmethod
for_path
(path, feature_set, **kwargs)[source]¶ Parameters: - path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
.arff
,.csv
,.jsonlines
,.libsvm
,.megam
,.ndj
, or.tsv
. Ifsubsets
is notNone
, when calling thewrite()
method, path is assumed to be a string containing the path to the directory to write the feature files with an additional file extension specifying the file type. For example/foo/.csv
. - feature_set (FeatureSet) – The FeatureSet to dump to a file.
- kwargs (dict) – The keyword arguments for
for_path
are the same as the initializer for the desiredWriter
subclass.
Returns: New instance of the Writer sub-class that is appropriate for the given path.
- path (str) – A path to the feature file we would like to create. The
suffix to this filename must be
- path (str) – A path to the feature file we would like to create. The suffix
to this filename must be