Metadata-Version: 2.4
Name: autosweep-preprocessing
Version: 0.1.1
Summary: Flexible tabular data preprocessing utility with a single AutoSweep API
Author: Harsh Kakadiya
License-Expression: MIT
Project-URL: Homepage, https://github.com/harsh-kakadiya1/Autosweep
Project-URL: Repository, https://github.com/harsh-kakadiya1/Autosweep
Project-URL: Issues, https://github.com/harsh-kakadiya1/Autosweep/issues
Keywords: preprocessing,machine-learning,feature-engineering,data-cleaning
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Requires-Dist: scikit-learn>=1.2
Requires-Dist: openpyxl>=3.1
Dynamic: license-file

# autosweep-preprocessing

A lightweight preprocessing library built around a single flexible API: `AutoSweep`.

## Usage

```python
from autosweep_preprocessing import AutoSweep

result = AutoSweep(
    file_path="data.csv",
    target_column="target",
    encode_categorical="onehot",
    remove_correlated=True,
    structured_output=True,
)

X = result["X"]
y = result["y"]
info = result["info"]
```

## Function

`AutoSweep` supports:

- CSV/Excel loading
- Missing value handling and imputation
- Numeric scaling (`standard`, `minmax`, `robust`)
- Categorical encoding (`onehot`, `ordinal`, `label`)
- Optional datetime feature extraction
- Optional outlier handling (`iqr`, `zscore`)
- Optional correlation and low-variance filtering
- Structured output for pipeline diagnostics

## AutoSweep Arguments Guide

### Required / Core

- `file_path` (required)
    - What it does: Path to input dataset (`.csv` or Excel file).
    - Use case: Point to your raw training file before preprocessing.
    - Example: `file_path="data/train.csv"`

- `target_column` (default: `None`)
    - What it does: Separates target variable from features and returns it as `y`.
    - Use case: Set this when you want to train/evaluate models after preprocessing.
    - Example: `target_column="price"`

### Column cleaning

- `drop_columns` (default: `None`)
    - What it does: Drops specific columns by name.
    - Use case: Remove IDs, leakage columns, or metadata fields.
    - Example: `drop_columns=["id", "created_at"]`

- `drop_threshold` (default: `1.0`)
    - What it does: Drops columns whose missing-value fraction is greater than this threshold.
    - Use case: Use `0.4`/`0.5` to remove heavily incomplete columns.
    - Example: `drop_threshold=0.5`

### Missing values

- `impute_strategy_num` (default: `'mean'`)
    - What it does: Numeric imputation strategy.
    - Allowed: `'mean'`, `'median'`, `'most_frequent'`, `'constant'`, `'knn'`, `'mode'`.
    - Use case: Use `'median'` for skewed numeric data, `'knn'` for richer local patterns.
    - Example: `impute_strategy_num="median"`

- `impute_strategy_cat` (default: `'most_frequent'`)
    - What it does: Categorical imputation strategy.
    - Allowed: any `SimpleImputer` categorical strategy (commonly `'most_frequent'`, `'constant'`).
    - Use case: Use `'most_frequent'` for stable categories.
    - Example: `impute_strategy_cat="most_frequent"`

### Scaling and encoding

- `scaler` (default: `'standard'`)
    - What it does: Scales numeric features.
    - Allowed: `'standard'`, `'minmax'`, `'robust'`, or any other value for passthrough.
    - Use case: Use `'robust'` when outliers are present.
    - Example: `scaler="robust"`

- `encode_categorical` (default: `None`)
    - What it does: Encodes categorical columns.
    - Allowed: `None`, `'none'`, `'passthrough'`, `'onehot'`, `'ordinal'`, `'label'`.
    - Use case: Use `'onehot'` for linear/tree models; `'label'` for compact numeric conversion.
    - Example: `encode_categorical="onehot"`

### Feature selection

- `remove_low_variance` (default: `False`)
    - What it does: Removes low-variance numeric features after preprocessing.
    - Use case: Enable when many near-constant numeric features exist.
    - Example: `remove_low_variance=True`

- `variance_thresh` (default: `0.0`)
    - What it does: Variance cutoff used by low-variance filtering.
    - Use case: Increase (e.g., `0.01`) to remove weak/noisy features.
    - Example: `variance_thresh=0.01`

- `remove_correlated` (default: `False`)
    - What it does: Drops highly correlated numeric features.
    - Use case: Reduce multicollinearity and redundant columns.
    - Example: `remove_correlated=True`

- `corr_threshold` (default: `0.95`)
    - What it does: Absolute correlation threshold for dropping features.
    - Use case: Use `0.85-0.95` depending on how aggressively you want feature pruning.
    - Example: `corr_threshold=0.9`

### Outlier handling

- `outlier_method` (default: `None`)
    - What it does: Enables outlier detection.
    - Allowed: `None`, `'iqr'`, `'zscore'` (also `'z-score'`, `'z_score'`).
    - Use case: Use `'iqr'` for non-normal data; `'zscore'` for roughly normal distributions.
    - Example: `outlier_method="iqr"`

- `outlier_threshold` (default: `1.5`)
    - What it does: Threshold used by outlier method.
    - Use case: Increase to keep more rows, decrease to be stricter.
    - Example: `outlier_threshold=3.0` (common for z-score)

- `cap_outliers` (default: `False`)
    - What it does: Caps outliers to bounds instead of dropping rows.
    - Use case: Set `True` when you want to preserve dataset size.
    - Example: `cap_outliers=True`

### Datetime features

- `extract_datetime` (default: `False`)
    - What it does: Parses datetime-like columns and extracts year/month/day/weekday/hour.
    - Use case: Enable when date fields carry predictive signal.
    - Example: `extract_datetime=True`

- `drop_datetime_original` (default: `False`)
    - What it does: Drops original datetime columns after extraction.
    - Use case: Keep only engineered datetime parts to simplify model input.
    - Example: `drop_datetime_original=True`

### Target encoding and output format

- `target_encode` (default: `False`)
    - What it does: Applies mean target encoding to categorical features.
    - Use case: Helpful for high-cardinality categorical variables.
    - Important: Requires `target_column`; avoid leakage by fitting only on training data in production workflows.
    - Example: `target_encode=True`

- `structured_output` (default: `True`)
    - What it does: Controls return format.
    - If `True`: returns `{ 'X', 'y', 'feature_names', 'info' }`.
    - If `False`: returns tuple(s) (`X, y, feature_names` or `X, feature_names`).
    - Use case: Keep `True` for debugging and pipeline introspection.

- `verbose` (default: `True`)
    - What it does: Prints detailed preprocessing diagnostics.
    - Use case: Set `False` for cleaner logs in training pipelines.
    - Example: `verbose=False`

## Notes

- If you use Excel input, keep `openpyxl` installed.
- If `target_encode=True`, provide a valid `target_column`.
