Corpus
- class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]
Internal class for storing a corpus.
- __init__(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]
- Parameters
domain (Orange.data.Domain) – the domain for this Corpus
X (numpy.ndarray) – attributes
Y (numpy.ndarray) – class variables
metas (numpy.ndarray) – meta attributes; e.g. text
W (numpy.ndarray) – instance weights
text_features (list) – meta attributes that are used for text mining. Infer them if None.
ids (numpy.ndarray) – Indices
- property dictionary
A token to id mapper.
- Type
corpora.Dictionary
- property documents
Returns a list of strings representing documents — created by joining selected text features.
- documents_from_features(feats)[source]
- Parameters
feats (list) – A list fo features to join.
Returns: a list of strings constructed by joining feats.
- extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]
Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.
- Parameters
X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append
feature_names (list) – List of string containing feature names
feature_values (list) – A list of possible values for Discrete features.
compute_values (list) – Compute values for corresponding features.
var_attrs (dict) – Additional attributes appended to variable.attributes.
sparse (bool) – Whether the features should be marked as sparse.
rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features
- extend_corpus(metadata, Y)[source]
Append documents to corpus.
- Parameters
metadata (numpy.ndarray) – Meta data
Y (numpy.ndarray) – Class variables
- static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]
Create corpus from documents.
- Parameters
documents (list) – List of documents.
name (str) – Name of the corpus
attributes (list) – List of tuples (Variable, getter) for attributes.
class_vars (list) – List of tuples (Variable, getter) for class vars.
metas (list) – List of tuples (Variable, getter) for metas.
title_indices (list) – List of indices into domain corresponding to features which will be used as titles.
- Returns
Corpus.
- property ngrams
Ngram representations of documents.
- Type
generator
- property pos_tags
A list of lists containing POS tags. If there are no POS tags available, return None.
- Type
np.ndarray
- property pp_documents
Preprocessed documents (transformed).
- static retain_preprocessing(orig, new, key=Ellipsis)[source]
Set preprocessing of ‘new’ object to match the ‘orig’ object.
- set_text_features(feats: Optional[List[Orange.data.variable.Variable]]) None [source]
Select which meta-attributes to include when mining text.
- Parameters
feats – List of text features to include. If None infer them.
- set_title_variable(title_variable: Optional[Union[Orange.data.variable.StringVariable, str]]) None [source]
Set the title attribute. Only one column can be a title attribute.
- Parameters
title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.
- store_tokens(tokens, dictionary=None)[source]
- Parameters
tokens (list) – List of lists containing tokens.
- property titles
Returns a list of titles.
- property tokens
A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.
- Type
np.ndarray