TextPreprocessor
- class tidyX.text_preprocessor.TextPreprocessor[source]
- static create_bol(lemmas: ndarray, verbose: bool = True) DataFrame [source]
Groups lemmas based on Levenshtein distance to handle misspelled words in social media data.
This method clusters lemmas that are similar to each other based on their Levenshtein distance. The aim is to group together possibly misspelled versions of the same lemma.
- Args:
- lemmas (np.ndarray):
An array containing lemmas to be grouped.
- verbose (bool, optional):
If set to True, progress will be printed at every 5% increment. Defaults to True.
- Returns:
- pd.DataFrame: A DataFrame with columns:
“bow_id”: An ID for the bag of lemmas (int).
“bow_name”: The representative name for the bag of lemmas (str).
“lemma”: The original lemma (str).
“similarity”: The similarity score based on fuzz.ratio (int).
“threshold”: The similarity threshold used (int).
- Notes:
The method utilizes the fuzz.ratio function to determine similarity between lemmas. The threshold for similarity depends on the length of the lemma being compared, to accommodate the sensitivity of fuzz.ratio towards shorter words.
- static get_most_common_strings(texts: List[str | List[str]], num_strings: int) List[Tuple[str, int]] [source]
Retrieves the most common strings in a list of texts.
This method serves primarily to validate preprocessing steps or to provide descriptive information about a collection of texts. It can handle both flat lists of strings and lists of lists of strings.
- Args:
- texts (List[Union[str, List[str]]]):
A list of texts, each text can be a string or a list of strings.
- num_strings (int):
The number of most common strings to be returned.
- Returns:
- List[Tuple[str, int]]:
A list of tuples where each tuple contains a string and its occurrence count, representing the most common strings in the given texts.
- Example:
>>> TextPreprocessor.get_most_common_strings(["apple orange", "apple banana"], 1) [('apple', 2)]
>>> TextPreprocessor.get_most_common_strings([["apple", "orange"], ["apple", "banana"]], 1) [('apple', 2)]
- Raises:
ValueError: If the provided num_strings is non-positive or if texts is an empty list.
- static load_stopwords(language='spanish')[source]
Load and cache stopwords for a given language.
- Notes:
To utilize this function, the nltk library must be installed and the stopwords dataset downloaded: - To install nltk:
` pip install nltk `
- To download the stopwords dataset:
`python import nltk nltk.download('stopwords') `
- static preprocess(string: str, delete_emojis: bool = True, extract: bool = False, exceptions: List[str] = ['r', 'l', 'n', 'c', 'a', 'e', 'o'], allow_numbers: bool = False, remove_stopwords: bool = False, language_stopwords: str = 'spanish') str | Tuple[str, List[str]] [source]
Preprocesses a string, typically a tweet, by applying a series of cleaning functions. The function performs the following steps:
Removes the ‘RT’ prefix from retweeted tweets.
Converts the entire string to lowercase.
Removes all accents and, if specified, emojis.
Optionally extracts and/or removes all mentions (e.g., @elonmusk).
Removes URLs.
Removes hashtags.
Removes special characters such as !, ?, -, ;, etc. while optionally preserving numbers.
Removes stopwords if indicated.
Removes extra spaces between words.
Reduces consecutive repeated characters, with exceptions defined in the exceptions parameter.
Separate consecutive emojis.
- Args:
- string (str):
The raw text.
- delete_emojis (bool):
If True, removes emojis from the string. Default is True.
- extract (bool):
If True, extracts and returns a list of all mentioned accounts in the text. Default is False.
- exceptions (list):
Characters that are allowed to be repeated consecutively. Defaults to [‘r’, ‘l’, ‘n’, ‘c’, ‘a’, ‘e’, ‘o’].
- allow_numbers (bool):
If True, numbers are preserved in the string. Default is False.
- remove_stopwords (bool):
If True, stopwords are removed based on the specified language. Default is False.
- language_stopwords (str):
The language for which stopwords should be removed. Defaults to “spanish”.
- Returns:
- str:
The cleaned text.
- mentions (list):
If extract is True, this list contains all mentioned accounts in the original text.
- static remove_RT(string: str) str [source]
Removes the “RT” prefix from tweets.
This function removes the “RT” prefix that usually appears at the beginning of retweets. It accounts for the possibility of varying white-space after “RT”.
- Args:
- string (str):
The tweet text to be processed.
- Returns:
- str:
The processed tweet text with the “RT” prefix removed if it appears at the beginning.
- static remove_accents(string: str, delete_emojis=True) str [source]
Removes accents and optionally emojis from a string.
This function removes accent marks from characters in a given string. If specified, it can also remove emojis.
- Args:
- string (str):
The input string potentially containing accented characters and/or emojis.
- delete_emojis (bool, optional):
If True, removes emojis from the string. Default is True.
- Returns:
- str:
The string with accented characters and optionally emojis removed.
- static remove_extra_spaces(string: str) str [source]
Removes extra spaces within and surrounding a given string.
This function trims leading and trailing spaces and replaces any occurrence of consecutive spaces between words with a single space.
- Args:
- string (str):
The text that may contain extra spaces.
- Returns:
- str:
The processed text with extra spaces removed.
- static remove_hashtags(string: str) str [source]
Removes hashtags from a given string.
This function scans the string and removes any text that starts with a ‘#’ and is followed by alphanumeric characters.
- Args:
- string (str):
The text that may contain hashtags.
- Returns:
- str:
The processed text with hashtags removed.
- static remove_last_repetition(string: str) str [source]
Removes the repetition of the last character in each word of a given string.
In Spanish, no word ends with a repeated character. However, in social media, it is common to emphasize words by repeating the last character. This function cleans the text to remove such repetitions.
For example, the input “Holaaaa amigooo” would be transformed to “Hola amigo”.
- Args:
- string (str):
The text to be processed.
- Returns:
- str:
The processed text with the last character of each word de-duplicated.
- static remove_mentions(string: str, extract=True)[source]
Removes mentions (e.g., @username) from a given tweet string.
This function scans the string and removes any text that starts with ‘@’ followed by the username. Optionally, it can also return a list of unique mentions.
- Args:
- string (str):
The tweet text that may contain mentions.
- extract (bool, optional):
If True, returns a list of unique mentions. Defaults to True.
- Returns:
- str:
The processed tweet text with mentions removed.
- list:
If extract is True, returns a list of unique mentioned accounts in the tweet.
- static remove_repetitions(string: str, exceptions=['r', 'l', 'n', 'c', 'a', 'e', 'o']) str [source]
Removes consecutive repeated characters in a given string, with some optional exceptions.
For example, the string ‘coooroosooo’ would be transformed to ‘coroso’.
- Args:
- string (str):
The text to be processed.
- exceptions (list, optional):
A list of characters that can be repeated once consecutively without being removed. Defaults to [‘r’, ‘l’, ‘n’, ‘c’, ‘a’, ‘e’, ‘o’].
- Returns:
- str:
The processed text with consecutive repetitions removed, except for characters in the exceptions list.
- static remove_special_characters(string: str, allow_numbers: bool = False) str [source]
Removes all characters from a string except for lowercase letters and spaces.
This function scans the string and removes any character that is not a lowercase letter or a space. Optionally, numbers can be retained. As a result, punctuation marks, exclamation marks, special characters, and uppercase letters are eliminated.
- Args:
- string (str):
The text that may contain special characters.
- allow_numbers (bool):
Whether to allow numbers in the string. Default is False.
- Returns:
- str:
The processed text with special characters removed.
- static remove_urls(string: str) str [source]
Removes all URLs that start with “http” from a given string.
This function scans the entire string and removes any sequence of characters that starts with “http” and continues until a space or end of line is encountered.
- Args:
- string (str):
The text to be processed.
- Returns:
- str:
The processed text with URLs removed.
- static remove_words(string: str, bag_of_words: list | None = None, remove_stopwords: bool = False, language: str = 'spanish') str [source]
Removes specified words and optionally stopwords from a string.
- Args:
- string (str):
The input string from which words are to be removed.
- bag_of_words (list, optional):
A list of words that should be removed from the string. Defaults to None.
- remove_stopwords (bool, optional):
If True, removes predefined stopwords from the string based on the specified language. Defaults to False.
- language (str, optional):
Language of the stopwords that will be removed if remove_stopwords is set to True. Defaults to ‘spanish’.
- Returns:
- str:
A string with the specified words removed.
- Notes:
To utilize this function, the nltk library must be installed and the stopwords dataset downloaded: - To install nltk:
` pip install nltk `
- To download the stopwords dataset:
`python import nltk nltk.download('stopwords') `
- static space_between_emojis(string: str) str [source]
Inserts spaces around emojis within a string.
This function adds a space before and after each emoji character in the given string to ensure that emojis are separated from other text or emojis. Extra spaces are then removed.
- Args:
- string (str):
The text that may contain emojis.
- Returns:
- str:
The processed text with spaces inserted around each emoji.
- static unnest_tokens(df: DataFrame, input_column: str, id_col: str | None = None, unique: bool = False) DataFrame [source]
Unnests or flattens a DataFrame by tokenizing a specified column.
Given a pandas DataFrame and a column name, this function splits the specified column on spaces, turning each token into a separate row in the resulting DataFrame.
- Args:
- df (pd.DataFrame):
The input DataFrame. Each row is expected to represent a document.
- input_column (str):
The name of the column to tokenize.
- id_col (str, optional):
The name of the column that uniquely identifies each document. If None, an “id” column is added based on the DataFrame’s index. Defaults to None.
- unique (bool, optional):
If True, it will deduplicate tokens and concatenate the IDs where they appear. If False, every token will have a corresponding row. Defaults to False.
- Returns:
- pd.DataFrame:
A DataFrame where each row corresponds to a token from the input column.