Welcome to tidyX’s documentation!

tidyX is a dedicated Python package designed for cleaning and preprocessing raw tweets and other text. It prepares them for Machine Learning (ML) applications, especially in Natural Language Processing (NLP) tasks like sentiment analysis, topic modeling, and text classification.

Features

  • Remove unwanted characters:

    This feature strips tweets of unnecessary clutter such as special characters (remove_special_characters()), emojis and/or accents (remove_accents(string, delete_emojis = True)), URLs (remove_urls(), making the text more suitable for ML models. Under the hood, these functions utilize regex patterns to address the issue. This is particularly helpful for beginners working with regex patterns, as well as for experienced users looking for a ready-made pipeline (preprocess()) to save time.

  • Stemming and Lemmatizing:

    One of the most crucial steps in preparing text for NLP applications is homogenizing words to a common root. This process can be particularly time-consuming for the Spanish language due to the lack of dedicated Python libraries. In this documentation, we’ve compiled the best models for lemmatization (spanish_lemmatizer()) and stemming (spanish_stemmer()).

  • Dependency Parsing Visualization:

    Incorporates visualization tools that enable the display of dependency parses, facilitating linguistic analysis and feature engineering (dependency_parse_visualizer_text())

  • Other utilities:

    As an example, we developed a function to group similar words, addressing the common issue of misspelled words in social media texts. Our aim is to create specific bags of terms to cluster related terms. This is achievable using the create_bol() function. Furthermore, to avoid redundant text preprocessing operations like stemming and lemmatizing on identical words across different texts, we introduced unnest_tokens(). This function generates a dataframe mapping unique terms to an ID for each text. Our tutorial provides guidance on leveraging this function efficiently.

For a more comprehensive explanation of all functions and a list of examples, visit the “HOW TO USE THE PACKAGE” section on our website.

Installation

Install the package using pip:

pip install tidyX

Ensure that you have the necessary dependencies, such as spaCy and the corresponding language models:

python -m spacy download en_core_web_sm  # or another language model

Usage

Here are some basic examples demonstrating how to use tidyX:

Text Preprocessing

The preprocess() function is a powerful tool that performs a comprehensive cleaning and preprocessing of tweet texts to prepare them for further analysis or machine learning tasks. It is designed to be highly configurable, allowing you to choose which preprocessing steps to apply.

Here’s how to use the preprocess method:

from tidyX import TextPreprocessor as tp

# Raw tweet example
raw_tweet = "RT @user: Check out this link: https://example.com 🌍 #example 😃"

# Applying the preprocess method
cleaned_text, mentions = tp.preprocess(raw_tweet)

# Printing the cleaned text and extracted mentions
print("Cleaned Text:", cleaned_text)
print("Mentions:", mentions)

In this example, the preprocess method is applied to a raw tweet. The method: - Removes the ‘RT’ prefix from retweeted tweets - Converts the text to lowercase - Removes accents and emojis (configurable) - Extracts and removes mentions, returning them as a list if required - Removes URLs, hashtags, and special characters - Eliminates extra spaces and consecutive repeated characters (with configurable exceptions)

This versatile method allows for a tailored preprocessing approach, adapting to the specific needs of your text analysis or NLP tasks.

Bag of Lemmas (BoL) Creation

The create_bol() function creates “bags” that group similar lemmas together by calculating the Levenshtein distance between words. This allows for the grouping of words that are likely variations or misspellings of one another. This is especially useful for managing the misspelled words often found in social media data.

import numpy as np
import pandas as pd
from tidyX import TextPreprocessor as tp

# Example lemmas
lemmas = np.array(["running", "runing", "jogging", "joging"])

# Creating bags of lemmas
bol_df = tp.create_bol(lemmas)

print(bol_df)

Spanish Lemmatizer

The spanish_lemmatizer() static method is used for lemmatizing Spanish tokens using a SpaCy language model, returning a cleaned and lemmatized version of the input token.

Text Visualization

The TextVisualizer class is integrated within the tidyX package, offering visualization capabilities for dependency parsing and named entities in textual data, leveraging the spaCy library’s displacy visualizer.

#### Dependency Parse Visualizer

The dependency_parse_visualizer_text method visualizes the syntactic dependency parse or named entities within a given document, facilitating a graphical representation that aids in understanding the linguistic annotations and structure of the text.

Tutorial

In the tutorial below, you will find examples for using each function within our package. Additionally, there’s a tutorial on Topic Modelling utilizing this package.