Corpus
- class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]
Internal class for storing a corpus.
- property documents
Returns a list of strings representing documents — created by joining selected text features.
- documents_from_features(feats)[source]
- Parameters:
feats (list) – A list fo features to join.
Returns: a list of strings constructed by joining feats.
- extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]
Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.
- Parameters:
X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append
feature_names (list) – List of string containing feature names
feature_values (list) – A list of possible values for Discrete features.
compute_values (list) – Compute values for corresponding features.
var_attrs (dict) – Additional attributes appended to variable.attributes.
sparse (bool) – Whether the features should be marked as sparse.
rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features
- static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None, language=None)[source]
Create corpus from documents.
- Parameters:
documents (list) – List of documents.
name (str) – Name of the corpus
attributes (list) – List of tuples (Variable, getter) for attributes.
class_vars (list) – List of tuples (Variable, getter) for class vars.
metas (list) – List of tuples (Variable, getter) for metas.
title_indices (list) – List of indices into domain corresponding to features which will be used as titles.
language (str) – Resulting corpus’s language
- Returns:
Corpus.
- classmethod from_file(filename, sheet=None)[source]
Read a data table from a file. The path can be absolute or relative.
- Parameters:
filename (str) – File name
sheet (str) – Sheet in a file (optional)
- Returns:
a new data table
- Return type:
Orange.data.Table
- classmethod from_numpy(domain, X, Y=None, metas=None, W=None, attributes=None, ids=None, text_features=None, language=None)[source]
Construct a table from numpy arrays with the given domain. The number of variables in the domain must match the number of columns in the corresponding arrays. All arrays must have the same number of rows. Arrays may be of different numpy types, and may be dense or sparse.
- Parameters:
domain (Orange.data.Domain) – the domain for the new table
X (np.array) – array with attribute values
Y (np.array) – array with class values
metas (np.array) – array with meta attributes
W (np.array) – array with weights
- Returns:
- classmethod from_table(domain, source, row_indices=Ellipsis)[source]
Create a new table from selected columns and/or rows of an existing one. The columns are chosen using a domain. The domain may also include variables that do not appear in the source table; they are computed from source variables if possible.
The resulting data may be a view or a copy of the existing data.
- Parameters:
domain (Orange.data.Domain) – the domain for the new table
source (Orange.data.Table) – the source table
row_indices (a slice or a sequence) – indices of the rows to include
- Returns:
a new table
- Return type:
Orange.data.Table
- classmethod from_table_rows(source, row_indices)[source]
Construct a new table by selecting rows from the source table.
- Parameters:
source (Orange.data.Table) – an existing table
row_indices (a slice or a sequence) – indices of the rows to include
- Returns:
a new table
- Return type:
Orange.data.Table
- property ngrams
Ngram representations of documents.
- Type:
generator
- property pos_tags
A list of lists containing POS tags. If there are no POS tags available, return None.
- Type:
np.ndarray
- property pp_documents
Preprocessed documents (transformed).
- static retain_preprocessing(orig, new, key=Ellipsis)[source]
Set preprocessing of ‘new’ object to match the ‘orig’ object.
- set_text_features(feats: List[Variable] | None) None [source]
Select which meta-attributes to include when mining text.
- Parameters:
feats – List of text features to include. If None infer them.
- set_title_variable(title_variable: StringVariable | str | None) None [source]
Set the title attribute. Only one column can be a title attribute.
- Parameters:
title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.
- property titles
Returns a list of titles.
- property tokens
A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.
- Type:
np.ndarray