TextPreprocessor
- class tidyX.text_preprocessor.TextPreprocessor[source]
- static create_bol(lemmas: ndarray, verbose: bool = True) DataFrame [source]
Groups lemmas based on Levenshtein distance to handle misspelled words in social media data.
This method clusters lemmas that are similar to each other based on their Levenshtein distance. The aim is to group together possibly misspelled versions of the same lemma.
- Args:
- lemmas (np.ndarray):
An array containing lemmas to be grouped.
- verbose (bool, optional):
If set to True, progress will be printed at every 5% increment. Defaults to True.
- Returns:
- pd.DataFrame: A DataFrame with columns:
“bow_id”: An ID for the bag of lemmas (int).
“bow_name”: The representative name for the bag of lemmas (str).
“lemma”: The original lemma (str).
“similarity”: The similarity score based on fuzz.ratio (int).
“threshold”: The similarity threshold used (int).
- Notes:
The method utilizes the fuzz.ratio function to determine similarity between lemmas. The threshold for similarity depends on the length of the lemma being compared, to accommodate the sensitivity of fuzz.ratio towards shorter words.
- static get_most_common_strings(texts: List[Union[str, List[str]]], num_strings: int) List[Tuple[str, int]] [source]
Retrieves the most common strings in a list of texts.
This method serves primarily to validate preprocessing steps or to provide descriptive information about a collection of texts. It can handle both flat lists of strings and lists of lists of strings.
- Args:
- texts (List[Union[str, List[str]]]):
A list of texts, each text can be a string or a list of strings.
- num_strings (int):
The number of most common strings to be returned.
- Returns:
- List[Tuple[str, int]]:
A list of tuples where each tuple contains a string and its occurrence count, representing the most common strings in the given texts.
- Example:
>>> TextPreprocessor.get_most_common_strings(["apple orange", "apple banana"], 1) [('apple', 2)]
>>> TextPreprocessor.get_most_common_strings([["apple", "orange"], ["apple", "banana"]], 1) [('apple', 2)]
- Raises:
ValueError: If the provided num_strings is non-positive or if texts is an empty list.
- static preprocess(string: str, delete_emojis=True, extract=True, exceptions=['r', 'l', 'n', 'c', 'a', 'e', 'o'], allow_numbers: bool = False)[source]
Preprocesses tweets by applying a series of cleaning functions. The function performs the following steps:
Removes the ‘RT’ prefix from retweeted tweets. (remove_RT)
Converts the entire string to lowercase. (.lower)
Removes all accents and optionally emojis. (remove_accents)
Extracts and/or removes all mentions (e.g., @elonmusk). (remove_mentions)
Removes URLs. (remove_urls)
Removes hashtags. (remove_hashtags)
Removes special characters such as !, ?, -, ;, etc. (remove_special_characters)
Removes extra spaces between words. (remove_extra_spaces)
Removes consecutive repeated characters, with exceptions defined in the exceptions parameter. (remove_repetitions and remove_last_repetition)
- Args:
- string (str):
The raw tweet text.
- delete_emojis (bool):
Whether to remove emojis from the string. Default is True.
- extract (bool):
If True, returns a list of all accounts mentioned in the tweet. Default is True.
- exceptions (list):
List of characters allowed to be repeated. Default is [‘r’, ‘l’, ‘n’, ‘c’, ‘a’, ‘e’, ‘o’].
- allow_numbers (bool):
Whether to allow numbers in the string. Default is False.
- Returns:
- str:
The cleaned tweet text.
- mentions (list):
If extract is True, a list of mentioned accounts is returned.
- static remove_RT(string: str) str [source]
Removes the “RT” prefix from tweets.
This function removes the “RT” prefix that usually appears at the beginning of retweets. It accounts for the possibility of varying white-space after “RT”.
- Args:
- string (str):
The tweet text to be processed.
- Returns:
- str:
The processed tweet text with the “RT” prefix removed if it appears at the beginning.
- static remove_accents(string: str, delete_emojis=True) str [source]
Removes accents and optionally emojis from a string.
This function removes accent marks from characters in a given string. If specified, it can also remove emojis.
- Args:
- string (str):
The input string potentially containing accented characters and/or emojis.
- delete_emojis (bool, optional):
If True, removes emojis from the string. Default is True.
- Returns:
- str:
The string with accented characters and optionally emojis removed.
- static remove_extra_spaces(string: str) str [source]
Removes extra spaces within and surrounding a given string.
This function trims leading and trailing spaces and replaces any occurrence of consecutive spaces between words with a single space.
- Args:
- string (str):
The text that may contain extra spaces.
- Returns:
- str:
The processed text with extra spaces removed.
- static remove_hashtags(string: str) str [source]
Removes hashtags from a given string.
This function scans the string and removes any text that starts with a ‘#’ and is followed by alphanumeric characters.
- Args:
- string (str):
The text that may contain hashtags.
- Returns:
- str:
The processed text with hashtags removed.
- static remove_last_repetition(string: str) str [source]
Removes the repetition of the last character in each word of a given string.
In Spanish, no word ends with a repeated character. However, in social media, it is common to emphasize words by repeating the last character. This function cleans the text to remove such repetitions.
For example, the input “Holaaaa amigooo” would be transformed to “Hola amigo”.
- Args:
- string (str):
The text to be processed.
- Returns:
- str:
The processed text with the last character of each word de-duplicated.
- static remove_mentions(string: str, extract=True)[source]
Removes mentions (e.g., @username) from a given tweet string.
This function scans the string and removes any text that starts with ‘@’ followed by the username. Optionally, it can also return a list of unique mentions.
- Args:
- string (str):
The tweet text that may contain mentions.
- extract (bool, optional):
If True, returns a list of unique mentions. Defaults to True.
- Returns:
- str:
The processed tweet text with mentions removed.
- list:
If extract is True, returns a list of unique mentioned accounts in the tweet.
- static remove_repetitions(string: str, exceptions=['r', 'l', 'n', 'c', 'a', 'e', 'o']) str [source]
Removes consecutive repeated characters in a given string, with some optional exceptions.
For example, the string ‘coooroosooo’ would be transformed to ‘coroso’.
- Args:
- string (str):
The text to be processed.
- exceptions (list, optional):
A list of characters that can be repeated once consecutively without being removed. Defaults to [‘r’, ‘l’, ‘n’, ‘c’, ‘a’, ‘e’, ‘o’].
- Returns:
- str:
The processed text with consecutive repetitions removed, except for characters in the exceptions list.
- static remove_special_characters(string: str, allow_numbers: bool = False) str [source]
Removes all characters from a string except for lowercase letters and spaces.
This function scans the string and removes any character that is not a lowercase letter or a space. Optionally, numbers can be retained. As a result, punctuation marks, exclamation marks, special characters, and uppercase letters are eliminated.
- Args:
- string (str):
The text that may contain special characters.
- allow_numbers (bool):
Whether to allow numbers in the string. Default is False.
- Returns:
- str:
The processed text with special characters removed.
- static remove_urls(string: str) str [source]
Removes all URLs that start with “http” from a given string.
This function scans the entire string and removes any sequence of characters that starts with “http” and continues until a space or end of line is encountered.
- Args:
- string (str):
The text to be processed.
- Returns:
- str:
The processed text with URLs removed.
- static remove_words(string: str, bag_of_words) str [source]
Removes all occurrences of words listed in bag_of_words from the string.
This function is particularly useful for removing stopwords. Exercise caution with the words listed in bag_of_words: this function performs an exact match, meaning it won’t remove variations of the words not appearing in the bag_of_words.
- Args:
- string (str):
The input string containing unwanted words.
- bag_of_words (list):
List of words to be removed from the string.
- Returns:
- str:
The string with unwanted words removed.
- static space_between_emojis(string: str) str [source]
Inserts spaces around emojis within a string.
This function adds a space before and after each emoji character in the given string to ensure that emojis are separated from other text or emojis. Extra spaces are then removed.
- Args:
- string (str):
The text that may contain emojis.
- Returns:
- str:
The processed text with spaces inserted around each emoji.
- static unnest_tokens(df: DataFrame, input_column: str, create_id: bool = True) DataFrame [source]
Flattens a DataFrame by tokenizing a specified column.
This function takes a pandas DataFrame and a column name to tokenize. Each token becomes a row in the resulting DataFrame. Tokens are separated by spaces.
- Args:
- df (DataFrame):
The input DataFrame to be flattened.
- input_column (str):
The name of the column to tokenize.
- create_id (bool, optional):
If True, adds an “id” column based on the DataFrame’s index. Defaults to True.
- Returns:
- DataFrame:
A DataFrame where each row corresponds to a token.