Metadata-Version: 2.4
Name: datascrub
Version: 2.0.0
Summary: A Python package for cleaning and preprocessing data in pandas DataFrames
Home-page: https://github.com/samuelshine/cleanmydata
Author: Alex Benjamin
Author-email: samuelshine112003@gmail.com
Classifier: Development Status :: 5 - Production/Stable
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: emoji
Requires-Dist: googletrans==3.1.0a0
Requires-Dist: scikit-learn==1.8.0
Requires-Dist: pyarrow>=12.0.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary

Certainly! Here's an updated README file for your DataScrub package:

---

# DataScrub (v2.0)

DataScrub is an enterprise-grade Python package that provides powerful data cleaning, feature engineering, and memory-optimized preprocessing capabilities for pandas DataFrames. It is designed to handle everything from basic string formatting to massive, out-of-core big data pipelines natively integrated with Scikit-Learn.

## Key Features

1. **Massive Data Optimization**: 
   - **PyArrow Backend**: Opt-in C++ memory architectures (`use_pyarrow=True`) to accelerate string processing by 3x-4x.
   - **Out-of-Core Chunking**: Pass files larger than RAM natively. `DataClean('10GB_file.csv', chunksize=100000)` creates a generator pipeline that sequentially cleans and writes to disk without crashing.
   - **Memory Downcasting**: Automatically detects strictly bounded `float64`/`int64` arrays and shrinks them to `float32`/`int8` via `.downcast()`.
2. **Scikit-Learn Integration**: `DataClean` extends `BaseEstimator` and `TransformerMixin`. It cleanly passes `fit_transform()` validations, meaning you can place DataScrub directly inside an `sklearn.pipeline.Pipeline()`.
3. **Advanced Feature Engineering**: 
   - Regex-based noise stripping (HTML, URLs, Punctuation).
   - Time-series temporal unpacking (`YYYY-MM-DD` to Year, Month, Day, Is_Weekend features).
   - Machine Learning Categorical Encodings (One-Hot, Label, and Target bounding).
4. **Data Profiling**: Generate immediate sparsity, variance, skewness, and payload reports using `.summary()`.

## Installation

DataScrub natively supports Python 3.10+ and integrates seamlessly with `pandas` 2.0+ and `scikit-learn`.

```shell
pip install datascrub
```

## Basic Usage

To use DataScrub in your Python projects, import the package and create an instance of the `DataClean` class:

```python
from datascrub.cleaner import DataClean
import pandas as pd

# Load standard in-memory DataFrame (or pass the string path directly for chunking!)
df = pd.read_csv("data.csv")
cleaner = DataClean(df, use_pyarrow=True) # Opt-in for C++ speeds

# Execute a massive pipeline combining imputation, outlier clipping, and feature engineering
cleaned_data = cleaner.prep(
    clean='all', 
    missing_values={'Age': 'fill with median', 'Salary': 'knn-imputer'}, 
    outliers_method='iqr',
    noise_columns=['User_Bio'],
    encoding_method='one-hot',
    encoding_columns=['City']
)

# View telemetry
cleaner.summary()
```

## Scikit-Learn Pipeline Usage
```python
from sklearn.pipeline import Pipeline
from datascrub.cleaner import DataClean
from sklearn.ensemble import RandomForestClassifier

# Configure DataClean cleanly using kwargs
cleaner = DataClean(clean='all', encoding_method='label', encoding_columns=['Status'])

# Bind it directly into an ML pipeline
pipe = Pipeline([
    ('scrubber', cleaner),
    ('classifier', RandomForestClassifier())
])

# Fit on training sets dynamically
# pipe.fit(X_train, y_train)
```

## Comprehensive API Guide

The `DataClean` class is extremely robust. Below is a complete reference to all the capabilities you can invoke through the `prep()` orchestrator or by passing arguments to `DataClean(obj, **kwargs)`.

### Memory & File Handling Configurations
When initializing `DataClean()`, you have multiple advanced IO options:
- `obj` *(pd.DataFrame, str)*: Pass an in-memory Pandas dataframe, or provide a string pointing to a local `.csv` or `.xlsx` file. A string file path is required to utilize chunking.
- `use_pyarrow` *(bool)*: Set to `True` to swap the backend to C++ PyArrow (drastically speeds up string evaluation and lowers memory footprints). Defaults to `False`.
- `chunksize` *(int)*: If provided, `DataScrub` acts as an out-of-core generator, processing the target `.csv` file in batches of `chunksize` rows and safely pushing the consolidated memory-light output to a centralized `*_cleaned.csv` on your disk. (Auto-enables if the dataset is >100MB).

### Telemetry Profiling
- `.summary()`: Prints a detailed dataframe profile outlining strictly typed Memory Usage (in MBs), dataset shape, sparsity (percentage of NaNs), variance, and skewness across all columns.
- `.downcast()`: Strictly bounded numeric detection. Scans the dataframe and intelligently compresses unoptimized `float64` / `int64` datatypes to their tightest binary limits (e.g., `int8`, `float32`).

### `prep()` Parameters
The `prep()` method orchestrates the precise execution order of your data engineering pipeline. It accepts the following arguments:

#### Standard Cleaning
- `clean` *(str, list)*: Defaults to `'all'`. Cleans string columns by stripping leading/trailing whitespace, converting text to lowercase, and demojizing emojis.

#### Imputation & Missing Values
- `missing_values` *(dict)*: Provide a dictionary where the key is the column name and the value is the imputation strategy.
   - Example Options: `'drop'`, `'fill with mean'`, `'fill with median'`, `'fill with mode'`, `'fill with backward fill along columns/rows'`.
   - **Machine Learning Strategies**: Provide `'knn-imputer'` (utilizes 5-neighbors via sklearn) or `'iterative-imputer'` (utilizes MICE via sklearn) for multivariate feature-aware imputing.

#### Analytics Extrapolations
- `parse_date` *(list)*: Provide a list of columns to natively cast to `datetime64[ns]` formatted rigidly as `YYYY-MM-DD`.
- `extract_datetime` *(list)*: Explodes a date-string column into four distinct integers for Machine Learning: `[Col]_Year`, `[Col]_Month`, `[Col]_Day`, and `[Col]_Is_Weekend`.
- `explode` *(dict)*: Splits string values based on delimiters and cascades them downward into distinct rows. Example: `{'Tags': ','}` splits `'A,B'` into row 1 `'A'` and row 2 `'B'`.
- `translate_column_names` *(dict)*: Utilizes `googletrans` API to asynchronously translate string records to English. Set dictionary boolean values to `True` to overwrite the existing column, or `False` to generate a dedicated `[Col]_translated` feature.

#### Feature Engineering & Mitigations
- `noise_columns` *(list)*: A vector of columns to strip messy HTML tags (e.g., `<p>foo</p>`), broken URL requests, and arbitrary string punctuation via C-level Regex patterns.
- `outliers_method` *(str)*: Mitigate extreme outlier biases by capping/clipping boundaries instead of destroying row parity. Options are `'iqr'` and `'z-score'`.
- `outlier_columns` *(list)*: Define bounds. Defaults to `'all'` numeric arrays.
- `perform_scaling_normalization_bool` *(bool)*: Applies strict mathematical Box-Cox transformations across numeric matrices. Note: Shifts non-positive values cleanly to `min+1` iteratively. 

#### Machine Learning Encoding
- `encoding_method` *(str)*: Translates strings/categorical classes into ML-compatible features natively relying on `scikit-learn`. Options:
   - `'one-hot'`: Expands columns to dummy boundaries (`drop_first=True` enabled).
   - `'label'`: Binds strict Ordinal IDs to unique text values globally.
   - `'target'`: Maps distinct feature ratios based precisely on bounding biases via Target Encoded boundaries.
- `encoding_columns` *(list)*: Columns to evaluate. Defaults to `'all'`.
- `target_col` *(str)*: Required strictly if utilizing `target` encoding.

## Contributing

Contributions to DataScrub are welcome! If you encounter any bugs, have suggestions for improvements, or would like to add new features, please open an issue or submit a pull request on the [GitHub repository](https://github.com/samuelshine/datascrub).

## License

This project is licensed under the MIT License. See the [LICENSE](https://github.com/samuelshine/CleanMyData/blob/main/LICENSE.txt) file for more information.
