Metadata-Version: 2.4
Name: datascrub
Version: 2.0.1
Summary: A Python package for cleaning and preprocessing data in pandas DataFrames
Home-page: https://github.com/samuelshine/cleanmydata
Author: Alex Benjamin
Author-email: samuelshine112003@gmail.com
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: emoji
Requires-Dist: googletrans==3.1.0a0
Requires-Dist: scikit-learn==1.8.0
Requires-Dist: pyarrow>=12.0.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary

# DataScrub (v2.0)

DataScrub is a powerful, straightforward data cleaning and feature engineering toolkit for pandas DataFrames. Designed for speed and memory efficiency, it allows you to easily handle basic string formatting or scale up to out-of-core big data pipelines natively integrated with Scikit-Learn.

## Key Features

1. **Massive Data Optimization**: 
   - **PyArrow Backend**: Enable the C++ PyArrow backend (`use_pyarrow=True`) to process string columns significantly faster.
   - **Out-of-Core Processing**: Process files larger than your RAM. Passing a file path with `chunksize=100000` creates a generator that cleans data in batches and writes directly to disk.
   - **Memory Downcasting**: Use `.downcast()` to automatically shrink `float64`/`int64` columns to `float32`/`int8` where possible, drastically reducing memory usage.
2. **Scikit-Learn Integration**: `DataClean` extends `BaseEstimator` and `TransformerMixin`, so it acts as standard transformer. You can drop it directly into a `sklearn.pipeline.Pipeline()`.
3. **Advanced Feature Engineering**: 
   - Regex-based text cleaning (removes HTML, URLs, and punctuation).
   - Date feature extraction (splits `YYYY-MM-DD` into Year, Month, Day, and Is_Weekend columns).
   - Built-in categorical encodings (One-Hot, Label, and Target encoding).
4. **Data Profiling**: Run `.summary()` to quickly print out sparsity, memory usage, variance, and skewness for all columns.

## Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with `pandas` 2.0+ and `scikit-learn`.

```shell
pip install datascrub
```

## Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the `DataClean` class:

```python
from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()
```

## Scikit-Learn Pipeline Usage
```python
from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Set up the cleaner with your chosen parameters
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Include it in your pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit and predict as usual
# pipe.fit(X_train, y_train)
```

## API & Configuration Guide

Below is a quick reference for the arguments you can pass to the `DataClean` constructor and the `prep()` method.

### Initialization & Memory Handlers
When creating an instance `DataClean()`, you have access to the following IO and memory arguments:
- `obj` *(pd.DataFrame, str)*: Provide a Pandas DataFrame in memory, or a file path (string) pointing to a `.csv` or `.xlsx` file. To use out-of-core chunking, you must pass a file path instead of a DataFrame.
- `use_pyarrow` *(bool)*: Set to `True` to convert underlying datatypes to PyArrow. Greatly speeds up string operations and reduces memory footprints. Defaults to `False`.
- `chunksize` *(int)*: Pass an integer to process large files out-of-core. DataScrub will read the file in chunks of this size, process them independently, and append the results to a `*_cleaned.csv` file on disk. (Automatically enables if the file size is >100MB).

### Telemetry & Inspection
- `.summary()`: Prints a profile of your DataFrame. Shows Memory Usage (MB), dataset shape, sparsity (% of NaN values), variance, and skewness.
- `.downcast()`: Scans the numeric columns and downcasts them to the smallest possible type (e.g., changing `float64` to `float32`), saving RAM.

### `prep()` Arguments
The `prep()` method handles the actual data cleaning and transformation pipeline. You can provide the following arguments:

#### General Cleaning
- `clean` *(str, list)*: Defaults to `'all'`. Cleans up string columns by trimming whitespace, casting to lowercase, and handling emojis.

#### Missing Values
- `missing_values` *(dict)*: Dictionary mapping column names to your chosen imputation technique.
   - Options include: `'drop'`, `'fill with mean'`, `'fill with median'`, `'fill with mode'`, `'fill with backward fill along columns/rows'`.
   - **ML Strategies**: Use `'knn-imputer'` (utilizes sklearn's KNNImputer with 5 neighbors) or `'iterative-imputer'` (sklearn's IterativeImputer) for feature-aware filling.

#### Transformations
- `parse_date` *(list)*: List of columns to cast to `datetime64[ns]` (expects the format `YYYY-MM-DD`).
- `extract_datetime` *(list)*: List of date columns to explode into integer features: `[Col]_Year`, `[Col]_Month`, `[Col]_Day`, and `[Col]_Is_Weekend`.
- `explode` *(dict)*: Dict for splitting delimited string values into separate rows. Example: `{'Tags': ','}` turns a single `'A,B'` row into two rows: `'A'` and `'B'`.
- `translate_column_names` *(dict)*: Maps columns to boolean values for translation to English via `googletrans`. Set `True` to overwrite the existing column, or `False` to create a new `[Col]_translated` column.

#### Feature Engineering & Outliers
- `noise_columns` *(list)*: List of columns where you want to strip out HTML tags, URLs, and arbitrary punctuation via Regex.
- `outliers_method` *(str)*: Technique used to cap extreme outliers. Options are `'iqr'` and `'z-score'`.
- `outlier_columns` *(list)*: Columns to check for outliers. Defaults to `'all'` numeric columns.
- `perform_scaling_normalization_bool` *(bool)*: If `True`, applies a Box-Cox transformation to normalise numeric distributions. 

#### Encodings
- `encoding_method` *(str)*: Strategy for transforming strings/categories into numeric features via `scikit-learn`.
   - `'one-hot'`: Creates dummy variables (drops the first column to avoid collinearity).
   - `'label'`: Binds an integer ID to each unique text value.
   - `'target'`: Uses Target Encoding based on the target column's distribution.
- `encoding_columns` *(list)*: Columns to encode. Defaults to `'all'`.
- `target_col` *(str)*: Required if you are utilizing `'target'` encoding.

## Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the [GitHub repository](https://github.com/samuelshine/datascrub).

## License

This project is licensed under the MIT License. See the [LICENSE](https://github.com/samuelshine/CleanMyData/blob/main/LICENSE.txt) file for more information.
