SpacyPreprocessor

class tidyX.spacy_preprocessor.SpacyPreprocessor[source]
is_component_registered() bool[source]

Check if a spaCy pipeline component with the given name is already registered.

Args:
name (str):

The name of the spaCy pipeline component.

Returns:

conditional (bool) True if the component is already registered. False otherwise.

register_component()[source]

Conditionally register the custom_lemmatizer component.

static spacy_pipeline(documents: List[str], custom_lemmatizer: bool = False, pipeline: List[str] = ['tokenize', 'lemmatizer'], stopwords_language: str = 'Spanish', model: str = 'es_core_news_sm', num_strings: int = 0) Union[List[List[str]], Tuple[List[List[str]], List[Tuple[str, int]]]][source]

Processes documents through the spaCy pipeline, performing lemmatization and stopword removal, and optionally utilizing a custom lemmatizer for Spanish.

For further information on the custom lemmatizer, refer to: https://github.com/pablodms/spacy-spanish-lemmatizer

Note: Ensure the relevant spaCy model is downloaded using: `sh python -m spacy download <model_name> ` where <model_name> can be “es_core_news_sm”, “es_core_news_md”, “es_core_news_lg”, or “es_dep_news_trf”.

Args:

documents (List[str]): A list of texts to be processed. custom_lemmatizer (bool, optional): If True, a custom Spanish rule-based lemmatizer is added to the pipeline. pipeline (List[str], optional): A list of spaCy pipeline components for processing the documents. Defaults to [‘tokenize’, ‘lemmatizer’]. stopwords_language (str, optional): Language for the nltk stopwords list. Defaults to ‘Spanish’. model (str, optional): spaCy model to be used. Defaults to ‘es_core_news_sm’. num_strings (int, optional): Number of most common strings to return. If 0, only processed documents are returned. Defaults to 0.

Returns:

Union[List[List[str]], Tuple[List[List[str]], List[Tuple[str, int]]]]: A list of processed documents and, if num_strings > 0, a list of the most common strings in the documents.

Raises:

ValueError: If the documents list is empty.

static spanish_lemmatizer(token: str, model: Spanish) str[source]

Lemmatizes a given token using Spacy’s Spanish language model.

Note: Before using this function, a Spacy model for Spanish should be downloaded. Use python -m spacy download name_of_model to download a model. Available models: “es_core_news_sm”, “es_core_news_md”, “es_core_news_lg”, “es_dep_news_trf”. For more information, visit https://spacy.io/models/es

Args:

token (str): The token to be lemmatized. model (spacy.lang.es.Spanish): A Spacy language model object.

Returns:

str: The lemmatized version of the token, with accents removed.