Corpus

class orangecontrib.text.corpus.Corpus(*args, **kwargs)[source]

Internal class for storing a corpus.

__init__(domain=None, X=None, Y=None, metas=None, W=None, text_features=None, ids=None)[source]
Parameters
  • domain (Orange.data.Domain) – the domain for this Corpus

  • X (numpy.ndarray) – attributes

  • Y (numpy.ndarray) – class variables

  • metas (numpy.ndarray) – meta attributes; e.g. text

  • W (numpy.ndarray) – instance weights

  • text_features (list) – meta attributes that are used for text mining. Infer them if None.

  • ids (numpy.ndarray) – Indices

copy()[source]

Return a copy of the table.

property dictionary

A token to id mapper.

Type

corpora.Dictionary

property documents

Returns a list of strings representing documents — created by joining selected text features.

documents_from_features(feats)[source]
Parameters

feats (list) – A list fo features to join.

Returns: a list of strings constructed by joining feats.

extend_attributes(X, feature_names, feature_values=None, compute_values=None, var_attrs=None, sparse=False, rename_existing=False)[source]

Append features to corpus. If feature_values argument is present, features will be Discrete else Continuous.

Parameters
  • X (numpy.ndarray or scipy.sparse.csr_matrix) – Features values to append

  • feature_names (list) – List of string containing feature names

  • feature_values (list) – A list of possible values for Discrete features.

  • compute_values (list) – Compute values for corresponding features.

  • var_attrs (dict) – Additional attributes appended to variable.attributes.

  • sparse (bool) – Whether the features should be marked as sparse.

  • rename_existing (bool) – When true and names are not unique rename exiting features; if false rename new features

extend_corpus(metadata, Y)[source]

Append documents to corpus.

Parameters
  • metadata (numpy.ndarray) – Meta data

  • Y (numpy.ndarray) – Class variables

static from_documents(documents, name, attributes=None, class_vars=None, metas=None, title_indices=None)[source]

Create corpus from documents.

Parameters
  • documents (list) – List of documents.

  • name (str) – Name of the corpus

  • attributes (list) – List of tuples (Variable, getter) for attributes.

  • class_vars (list) – List of tuples (Variable, getter) for class vars.

  • metas (list) – List of tuples (Variable, getter) for metas.

  • title_indices (list) – List of indices into domain corresponding to features which will be used as titles.

Returns

Corpus.

has_tokens()[source]

Return whether corpus is preprocessed or not.

property ngrams

Ngram representations of documents.

Type

generator

property pos_tags

A list of lists containing POS tags. If there are no POS tags available, return None.

Type

np.ndarray

property pp_documents

Preprocessed documents (transformed).

static retain_preprocessing(orig, new, key=Ellipsis)[source]

Set preprocessing of ‘new’ object to match the ‘orig’ object.

set_text_features(feats: Optional[List[Orange.data.variable.Variable]]) None[source]

Select which meta-attributes to include when mining text.

Parameters

feats – List of text features to include. If None infer them.

set_title_variable(title_variable: Optional[Union[Orange.data.variable.StringVariable, str]]) None[source]

Set the title attribute. Only one column can be a title attribute.

Parameters

title_variable – Variable that need to be set as a title variable. If it is None, do not set a variable.

store_tokens(tokens, dictionary=None)[source]
Parameters

tokens (list) – List of lists containing tokens.

property titles

Returns a list of titles.

property tokens

A list of lists containing tokens. If tokens are not yet present, run default preprocessor and return tokens.

Type

np.ndarray