Metadata-Version: 2.1
Name: ds-eval-kit
Version: 1.1.0
Summary: Automated end-to-end ML pipeline: clean → EDA → feature selection → train 12/13 models → HTML reports + CSV snapshots
Author-email: Parasuraman E <g63909731@gmail.com>
License: MIT
Keywords: machine-learning,automated-ml,data-science,pipeline,feature-selection,eda,classification,regression
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: catboost >=1.2.0
Requires-Dist: lightgbm >=4.0.0
Requires-Dist: matplotlib >=3.7.0
Requires-Dist: numpy >=1.26.0
Requires-Dist: openpyxl >=3.1.0
Requires-Dist: pandas >=2.0.0
Requires-Dist: plotly >=5.15.0
Requires-Dist: pyarrow >=14.0.0
Requires-Dist: scikit-learn >=1.3.0
Requires-Dist: scipy >=1.11.0
Requires-Dist: seaborn >=0.12.0
Requires-Dist: statsmodels >=0.14.0
Requires-Dist: xgboost >=2.0.0
Provides-Extra: dev
Requires-Dist: black ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: pytest >=7.0 ; extra == 'dev'
Requires-Dist: ruff ; extra == 'dev'

# ds-eval-kit

**Automated end-to-end Machine Learning Pipeline — one import, one call.**

`ds-eval-kit` takes a raw dataset and delivers cleaned data, interactive HTML reports, and trained model comparisons — plus a full CSV trail of every pipeline stage — without writing a single line of boilerplate.

---

## ✨ Features

| Stage | What it does | Output |
|---|---|---|
| **Load** | CSV · Excel · JSON · Parquet | — |
| **Clean** | Null imputation · duplicate removal · type coercion | `02_cleaned_data.csv` |
| **EDA** | Distribution · correlation · missing-value charts | `eda_report.html` |
| **Encode** | One-hot / label / ordinal (auto-detected) | — |
| **Outliers** | IQR / Z-score clipping | — |
| **Scale** | Standard · MinMax · Robust | `03_encoded_scaled_data.csv` |
| **Feature selection** | Correlation + RFE + importance (union) | `04_selected_features.csv` · `feature_selection_report.html` |
| **Split** | Stratified train/test | `05_train_set.csv` · `06_test_set.csv` |
| **Train** | 12 classifiers **or** 13 regressors with cross-validation | `07_model_results.csv` |
| **Report** | Side-by-side model comparison with confusion matrices / residuals | `model_accuracy_report.html` |

---

## 📦 Installation

```bash
pip install ds-eval-kit
```

### Requirements

Python ≥ 3.12 and the following packages (all installed automatically):

```
pandas>=2.0   numpy>=1.26   scikit-learn>=1.3   matplotlib>=3.7
seaborn>=0.12  plotly>=5.15  scipy>=1.11  statsmodels>=0.14
xgboost>=2.0  lightgbm>=4.0  catboost>=1.2  pyarrow>=14.0  openpyxl>=3.1
```

---

## 🚀 Quickstart

### Interactive mode (zero code)

```python
from ds_eval_kit import ml_process

pipe = ml_process(output_dir="ml_output")
pipe.get_ml()
# → prompts: dataset path, target column, classification or regression
```

### Programmatic mode

```python
from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="ml_output",   # all files saved here
    export_csv=True,           # save 7 CSV snapshots (default: True)
    scaling_method="standard",
    handle_outliers=True,
)

result = pipe.run(
    dataset_path="used_cars.csv",
    target="price",
    problem_type="regression",   # or "classification" or None (auto-detect)
)

# HTML reports
print(result["eda_report"])       # → ml_output/eda_report.html
print(result["feature_report"])   # → ml_output/feature_selection_report.html
print(result["model_report"])     # → ml_output/model_accuracy_report.html

# CSV snapshots
for name, path in result["csv_files"].items():
    print(f"{name}  →  {path}")

# Intermediate DataFrames (in memory)
raw_df       = result["dataframes"]["raw"]
clean_df     = result["dataframes"]["cleaned"]
processed_df = result["dataframes"]["processed"]
selected_df  = result["dataframes"]["selected"]
train_df     = result["dataframes"]["train"]
test_df      = result["dataframes"]["test"]

# Model metrics table
import pandas as pd
metrics = pd.DataFrame(result["results"])
print(metrics.sort_values("R2", ascending=False))
```

---

## 📂 Output Files

After running the pipeline you will find the following files inside `output_dir`:

```
ml_output/
├── 01_raw_data.csv                  ← original loaded dataset
├── 02_cleaned_data.csv              ← after null/duplicate/type fixes
├── 03_encoded_scaled_data.csv       ← after encoding + outlier clip + scaling
├── 04_selected_features.csv         ← after feature selection
├── 05_train_set.csv                 ← X_train + y (target) column
├── 06_test_set.csv                  ← X_test  + y (target) column
├── 07_model_results.csv             ← all model metrics in one table
├── eda_report.html                  ← interactive EDA charts
├── feature_selection_report.html    ← feature importance & VIF table
└── model_accuracy_report.html       ← model comparison dashboard
```

### CSV snapshot reference

| File | Stage | Key columns |
|---|---|---|
| `01_raw_data.csv` | Raw load | original columns |
| `02_cleaned_data.csv` | After cleaning | original columns, nulls filled |
| `03_encoded_scaled_data.csv` | After preprocessing | numeric columns only |
| `04_selected_features.csv` | After feature selection | selected columns + target |
| `05_train_set.csv` | Train split | features + target |
| `06_test_set.csv` | Test split | features + target |
| `07_model_results.csv` | Model results | Model, CV Score, Test metrics… |

---

## ⚙️ Configuration Reference

```python
ml_process(
    test_size               = 0.2,          # fraction held out for testing
    random_state            = 42,           # global seed
    cv_folds                = 5,            # cross-validation folds
    handle_outliers         = True,         # IQR clipping
    scaling_method          = "standard",   # "standard" | "minmax" | "robust"
    encoding_method         = "auto",       # "auto" | "onehot" | "label" | "ordinal"
    feature_selection_method= "all",        # "all" | "correlation" | "rfe" | "importance"
    generate_plots          = True,         # include charts in HTML reports
    output_dir              = ".",          # where to save all output files
    export_csv              = True,         # save CSV snapshots at every stage
)
```

### `scaling_method`

| Value | Algorithm | Best for |
|---|---|---|
| `"standard"` | StandardScaler (z-score) | Most cases, SVM, linear models |
| `"minmax"` | MinMaxScaler (0–1) | Neural networks, KNN |
| `"robust"` | RobustScaler (median/IQR) | Data with many outliers |

### `encoding_method`

| Value | Behaviour |
|---|---|
| `"auto"` | One-hot for ≤ 10 categories, label encoding otherwise |
| `"onehot"` | Always one-hot |
| `"label"` | Always label encoding |
| `"ordinal"` | Ordinal encoding (preserves order) |

### `feature_selection_method`

| Value | Behaviour |
|---|---|
| `"all"` | Runs correlation + importance, unions results, then applies RFE |
| `"correlation"` | Drops features with pairwise correlation > 0.90 |
| `"rfe"` | Recursive Feature Elimination with a RandomForest estimator |
| `"importance"` | Keeps the top-k features by RandomForest importance score |

---

## 🤖 Models Trained

### Classification (12 models)

| Model | Library |
|---|---|
| Logistic Regression | scikit-learn |
| Decision Tree | scikit-learn |
| Random Forest | scikit-learn |
| Gradient Boosting | scikit-learn |
| AdaBoost | scikit-learn |
| Extra Trees | scikit-learn |
| SVM (RBF kernel) | scikit-learn |
| K-Nearest Neighbours | scikit-learn |
| Gaussian Naïve Bayes | scikit-learn |
| XGBoost | xgboost |
| LightGBM | lightgbm |
| CatBoost | catboost |

**Metrics:** Accuracy · Precision · Recall · F1 · CV Score  
**Extras:** Confusion matrix per model

### Regression (13 models)

All of the above (minus Naïve Bayes) **+** Linear Regression · Ridge · Lasso

**Metrics:** R² · MAE · RMSE · CV Score  
**Extras:** Actual vs Predicted scatter per model

---

## 📊 Accessing Results Programmatically

```python
result = pipe.run("data.csv", "target")

# Best classification model by test accuracy
import pandas as pd
df_res = pd.DataFrame(result["results"])
best = df_res.sort_values("Accuracy", ascending=False).iloc[0]
print(f"Best model: {best['Model']}  accuracy={best['Accuracy']:.4f}")

# Load the cleaned CSV for further work
clean = pd.read_csv(result["csv_files"]["02_cleaned_data.csv"])

# Use the train DataFrame directly (no disk I/O)
train_df = result["dataframes"]["train"]
```

### Disabling CSV export

```python
pipe = ml_process(export_csv=False)   # HTML reports only, no CSVs
```

---

## 📁 Supported Dataset Formats

| Extension | Format |
|---|---|
| `.csv` | Comma-separated values |
| `.xlsx` / `.xls` | Microsoft Excel |
| `.json` | JSON (records or columns orientation) |
| `.parquet` | Apache Parquet |

---

## 🧪 Running Tests

```bash
pip install ds-eval-kit[dev]
pytest ds_eval_kit/tests/ -v
```

---

## 📖 Examples

### Titanic (classification)

```python
from ds_eval_kit import ml_process

pipe = ml_process(output_dir="titanic_output", export_csv=True)
result = pipe.run("titanic.csv", target="Survived", problem_type="classification")
```

### House prices (regression)

```python
from ds_eval_kit import ml_process

pipe = ml_process(
    output_dir="house_output",
    scaling_method="robust",
    feature_selection_method="importance",
)
result = pipe.run("house_prices.csv", target="SalePrice", problem_type="regression")
```

### Custom split & folds

```python
pipe = ml_process(test_size=0.25, cv_folds=10, random_state=0)
result = pipe.run("data.csv", "label")
```

---

## 📝 License

MIT — see [LICENSE](LICENSE).

---

## 🙌 Contributing

Pull requests are welcome. Please run `ruff check .` and `black .` before submitting.
