Metadata-Version: 2.4
Name: mrpravin
Version: 0.1.0
Summary: One-line Data Analyst + AutoML library for tabular datasets — clean, train, and predict with minimal code
Author-email: Pravin MR <mrpravin000@gmail.com>
Maintainer-email: Pravin MR <mrpravin000@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Pravin MR
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/mr-pravin/mrpravin
Project-URL: Repository, https://github.com/mr-pravin/mrpravin
Project-URL: Documentation, https://github.com/mr-pravin/mrpravin#readme
Project-URL: Bug Tracker, https://github.com/mr-pravin/mrpravin/issues
Project-URL: Author, https://mrpravin000.vercel.app/
Keywords: automl,machine-learning,data-cleaning,feature-engineering,tabular-data,scikit-learn,pandas,pravinDA,pravinDS,pravinML
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5
Requires-Dist: numpy>=1.23
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: openpyxl>=3.1
Provides-Extra: full
Requires-Dist: xgboost>=1.7; extra == "full"
Requires-Dist: lightgbm>=4.0; extra == "full"
Requires-Dist: chardet>=5.0; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

<div align="center">

<h1>mrpravin</h1>

<p><strong>One-line Data Analyst + AutoML for tabular datasets</strong></p>

<p>
  <a href="https://pypi.org/project/mrpravin"><img src="https://img.shields.io/pypi/v/mrpravin?color=blue&label=PyPI" alt="PyPI"></a>
  <a href="https://pypi.org/project/mrpravin"><img src="https://img.shields.io/pypi/pyversions/mrpravin" alt="Python"></a>
  <a href="https://github.com/mr-pravin/mrpravin/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="License"></a>
  <a href="https://github.com/mr-pravin/mrpravin"><img src="https://img.shields.io/github/stars/mr-pravin/mrpravin?style=social" alt="Stars"></a>
</p>

<p>
  Built by <strong><a href="https://mrpravin000.vercel.app/">Pravin MR</a></strong> · Chennai, India<br>
  <a href="mailto:mrpravin000@gmail.com">mrpravin000@gmail.com</a> ·
  <a href="https://linkedin.com/in/mr-pravin">LinkedIn</a> ·
  <a href="https://github.com/mr-pravin">GitHub</a>
</p>

</div>

---

## What is mrpravin?

`mrpravin` is a Python library that automates the entire ML pipeline for tabular data — from raw CSV to trained, production-ready model — in as few as **3 lines of code**.

```python
import mrpravin as mr

df    = mr.pravinDA("data.csv")                  # clean, encode, ready
model = mr.pravinDS(df, target="loan_status")    # AutoML → best model
model.summary()                                  # full model card
```

No manual preprocessing. No encoder fitting. No scaler setup. No model selection loop.

---

## Install

```bash
# Core (pandas, numpy, scikit-learn)
pip install mrpravin

# Full — adds XGBoost, LightGBM, encoding detection
pip install "mrpravin[full]"
```

---

## The 3 Functions

| Function | What it does |
|----------|-------------|
| `mr.pravinDA(source)` | Loads, cleans, encodes, and returns a ready DataFrame |
| `mr.pravinDS(df, target)` | Full AutoML — selects and tunes the best model |
| `mr.pravinML` | Production inference layer — predict, validate, explain, benchmark |

---

## Quick Start

### Data Analyst Mode
```python
import mrpravin as mr

# Works with CSV, Excel, JSON, or a DataFrame
df = mr.pravinDA("data.csv")

print(df.head())   # fully cleaned, encoded, human-readable
print(df.shape)
```

**What happens automatically:**
- Duplicate rows removed
- Missing values filled (median for numeric, mode for categorical)
- Outliers winsorized
- Boolean columns → `0 / 1`
- Categorical columns → one-hot encoded
- High cardinality columns → frequency encoded
- Datetime columns → year / month / day / dayofweek features
- ID columns → dropped

---

### Data Scientist Mode — AutoML
```python
import mrpravin as mr

df    = mr.pravinDA("data.csv")
model = mr.pravinDS(df, target="price")

model.summary()
```

**What happens automatically:**
- Train / test split with zero data leakage
- Runs: `LinearRegression / LogisticRegression`, `RandomForest`, `GradientBoosting` (+ XGBoost, LightGBM if installed)
- Cross-validated hyperparameter tuning
- Picks the best model
- Evaluates on held-out test set
- Returns a `pravinML` object ready for production

---

### Production Inference — pravinML
```python
import pandas as pd

# Predict on new raw data
new_data = pd.DataFrame({
    "feature_1": [8, 3, 6],
    "feature_2": [95, 55, 75],
    "category":  ["Yes", "No", "Yes"],
})

predictions = model.predict(new_data)       # auto-cleans internally
probabilities = model.predict_proba(new_data)  # classification only

# Validate schema before predicting
report = model.validate(new_data)
print(report.summary())

# Feature importance
for feature, pct in model.explain().items():
    print(f"  {feature}: {pct:.1f}%")

# Benchmark inference speed
bench = model.benchmark(new_data, n_runs=100)
print(f"p50: {bench['p50_ms']} ms | throughput: {bench['throughput_rows_per_sec']} rows/s")

# Save and load
model.save("model.pkl")
model = mr.pravinML.load("model.pkl")
```

---

## Real Results

| Dataset | Rows | Problem | Best Model | Score |
|---------|------|---------|-----------|-------|
| Student Performance | 10,000 | Regression | LinearRegression | R² = 0.988 |
| Loan Default Prediction | 45,000 | Classification | GradientBoosting | Accuracy = 93.4%, ROC-AUC = 97.8% |

Both achieved with identical 3-line code.

---

## Configuration

```python
from mrpravin import MrPravinConfig

cfg = MrPravinConfig(
    random_seed=42,
    cv_folds=5,
    n_iter_search=20,        # hyperparameter search iterations
    use_xgboost=True,
    use_lightgbm=True,
    outlier_method="iqr",    # "iqr" | "zscore" | "mad"
    verbose=True,
)

model = mr.pravinDS("data.csv", target="label", cfg=cfg)
```

Save and reuse config:
```python
cfg.to_json("my_config.json")
cfg = MrPravinConfig.from_json("my_config.json")
```

---

## Architecture

```
mrpravin/
├── mrpravin/
│   ├── __init__.py          ← public API
│   ├── config.py            ← MrPravinConfig
│   ├── pipeline.py          ← pravinDA() and pravinDS()
│   ├── ml.py                ← pravinML (inference layer)
│   ├── core/
│   │   ├── loader.py        ← CSV / Excel / JSON loading
│   │   ├── profiler.py      ← column type detection
│   │   ├── cleaner.py       ← dedup, imputation, outliers
│   │   ├── encoder.py       ← OHE, frequency, boolean encoding
│   │   ├── scaler.py        ← StandardScaler / RobustScaler
│   │   └── report.py        ← report builder + JSON/HTML export
│   └── automl/
│       ├── model_selector.py ← problem detection + candidates
│       ├── tuner.py          ← RandomizedSearchCV
│       └── evaluator.py      ← metrics + feature importance
└── tests/
    └── test_mrpravin.py     ← 37 tests
```

### Full pipeline flow

```
CSV / Excel / JSON / DataFrame
        ↓
   pravinDA()
        ├── load
        ├── detect column types  (7 types)
        ├── clean                (dedup, impute, outliers, text)
        ├── encode               (OHE / frequency / boolean / datetime)
        └── returns DataFrame    ← human-readable, ML-ready
        ↓
   pravinDS()
        ├── dedup full dataset before split  (prevents X/y misalignment)
        ├── train / test split               (no leakage)
        ├── clean + encode + scale           (fit on train only)
        ├── model selection + CV tuning
        ├── evaluate on test set
        └── returns pravinML object
        ↓
   pravinML.predict()
        ├── validate schema
        ├── clean + encode + scale           (transform only, no re-fit)
        └── predict
```

**Zero data leakage by design.** Encoders and scalers are always fit on training data only, then applied via `.transform()` at inference time.

---

## pravinML — Full API Reference

```python
model.predict(X)              # predict labels / values
model.predict_proba(X)        # predict class probabilities
model.validate(X)             # schema + drift check → ValidationReport
model.evaluate(X, y)          # metrics on any labelled dataset
model.explain(top_n=20)       # feature importance as % contribution
model.summary()               # full model card printout
model.benchmark(X, n_runs=100)# inference latency p50/p95/p99
model.save("model.pkl")       # persist with integrity checksum
mr.pravinML.load("model.pkl") # load with checksum verification
model.metrics                 # raw metrics dict
model.feature_names           # training feature list
model.problem_type            # 'regression' | 'binary_classification' | ...
model.schema                  # InputSchema — raw feature ranges
model.model_name              # winning algorithm name
```

---

## Supported Formats

| Format | Extension |
|--------|-----------|
| CSV | `.csv`, `.tsv`, `.txt` |
| Excel | `.xlsx`, `.xls`, `.xlsm` |
| JSON | `.json` (records or lines) |
| DataFrame | Pass directly |

---

## Requirements

- Python ≥ 3.9
- pandas ≥ 1.5
- numpy ≥ 1.23
- scikit-learn ≥ 1.3
- scipy ≥ 1.10
- openpyxl ≥ 3.1 (Excel support)

Optional:
- xgboost ≥ 1.7
- lightgbm ≥ 4.0
- chardet ≥ 5.0 (non-UTF-8 CSV encoding detection)

---

## Run Tests

```bash
cd mrpravin
pip install -e ".[dev]"
pytest tests/ -v
```

37 tests covering profiler, cleaner, encoder, scaler, pravinDA, pravinDS, pravinML (predict, validate, evaluate, explain, benchmark, save/load).

---

## Roadmap

- [x] Phase 1 — `pravinDA` · `pravinDS` · `pravinML`
- [ ] Phase 2 — `pravinAI` — static pipeline compiler and anti-pattern detector
- [ ] Phase 3 — `pravinDL` · `pravinNLP` — deep learning and NLP extensions

---

## Author

**Pravin MR** — Data Engineer & ML Systems Builder, Chennai, India

- Website: [mrpravin000.vercel.app](https://mrpravin000.vercel.app/)
- Email: [mrpravin000@gmail.com](mailto:mrpravin000@gmail.com)
- LinkedIn: [linkedin.com/in/mr-pravin](https://linkedin.com/in/mr-pravin)
- GitHub: [github.com/mr-pravin](https://github.com/mr-pravin)

---

## License

MIT © 2026 Pravin MR — see [LICENSE](LICENSE) for full text.
