Metadata-Version: 2.4
Name: saara-ai
Version: 0.1.0
Summary: End-to-end ML pipeline: PDF→JSON datasets, synthetic data, fine-tuning, evaluation, and edge deployment with Gradio GUI
Author-email: Kilani Sai Nikhil <nikhil49023@gmail.com>
Maintainer-email: Kilani Sai Nikhil <nikhil49023@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/nikhil49023/saara-ai
Project-URL: Repository, https://github.com/nikhil49023/saara-ai
Project-URL: Documentation, https://saara-ai.readthedocs.io
Keywords: saara,ml,llm,fine-tuning,synthetic-data,pdf,vllm,gradio
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: unsloth>=2024.1
Requires-Dist: trl>=0.7.0
Requires-Dist: peft>=0.7.0
Requires-Dist: bitsandbytes>=0.41.0
Requires-Dist: vllm>=0.2.0
Requires-Dist: ollama>=0.1.0
Requires-Dist: llama-cpp-python>=0.2.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: autoawq>=0.2.0
Requires-Dist: auto-gptq>=0.5.0
Requires-Dist: onnx>=1.15.0
Requires-Dist: gguf>=0.1.0
Requires-Dist: lm-eval>=0.4.0
Requires-Dist: lighteval>=0.4.0
Requires-Dist: pynvml>=11.5.0
Requires-Dist: codecarbon>=2.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: gradio>=4.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: rich>=13.0.0
Requires-Dist: click>=8.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-book-theme>=1.0.0; extra == "docs"
Requires-Dist: mkdocs>=1.5.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.0.0; extra == "docs"
Provides-Extra: edge
Requires-Dist: tensorrt>=8.6.0; extra == "edge"
Requires-Dist: onnxruntime-gpu>=1.16.0; extra == "edge"
Provides-Extra: all
Requires-Dist: saara-ai[dev,docs,edge]; extra == "all"
Dynamic: license-file

# SAARA

**End-to-end ML pipeline for dataset creation, fine-tuning, evaluation, and edge deployment with Gradio GUI**

## Features

- 📄 **PDF to JSON Dataset Creation** - Extract text from PDFs and generate structured training data using local LLMs (vLLM, Ollama, llama.cpp)
- 🤖 **Synthetic Data Generation** - Create high-quality training data in multiple formats (Factual, Reasoning, Conversational, Instruction, Code, Creative)
- 🎯 **Fine-Tuning** - QLoRA/LoRA fine-tuning with Unsloth for fast, memory-efficient training
- 📊 **Comprehensive Evaluation** - Teacher-student comparison, standard benchmarks (MMLU, GSM8K, HumanEval), performance metrics, power consumption tracking
- 📦 **Model Export & Quantization** - Export to multiple formats (GGUF, AWQ, GPTQ, ONNX, TensorRT, Safetensors) with 2/3/4/8-bit quantization
- 🖥️ **Gradio GUI** - Visual interface for the entire pipeline with auto-generation from Python scripts

## Installation

```bash
# Clone repository
git clone https://github.com/nikhil49023/saara-ai.git
cd saara-ai

# Install package
pip install -e .

# For development
pip install -e ".[dev]"

# For edge deployment
pip install -e ".[edge]"
```

## Quick Start

### 1. Launch GUI

```bash
saara gui
```

Or in Python:
```python
from saara import SaaraDashboard

dashboard = SaaraDashboard()
dashboard.launch()
```

### 2. Create Dataset from PDF

```python
from saara import DatasetBuilder
from saara.dataset.types import DataType
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

# Setup provider
config = ProviderConfig(model="mistral", base_url="http://localhost:11434")
provider = OllamaProvider(config)

# Create dataset
builder = DatasetBuilder(provider)
samples = builder.from_pdf(
    "document.pdf",
    data_types=[DataType.INSTRUCTION, DataType.FACTUAL],
    pairs_per_type=5,
)

# Save
builder.save(samples, "dataset.jsonl")
```

### 3. Fine-Tune Model

```python
from saara import FineTuner
from saara.training.config import TrainingConfig

config = TrainingConfig(
    model_name="mistralai/Mistral-7B-v0.1",
    num_train_epochs=3,
    use_lora=True,
)

finetuner = FineTuner(config)
finetuner.train("dataset.jsonl")
finetuner.save("./output/models/my-finetune")
```

### 4. Evaluate with Teacher Comparison

```python
from saara import ModelEvaluator
from saara.providers.ollama_provider import OllamaProvider, ProviderConfig

student = OllamaProvider(ProviderConfig(model="mistral-7b-finetuned"))
teacher = OllamaProvider(ProviderConfig(model="llama-3-70b"))

evaluator = ModelEvaluator(student, teacher)
metrics = evaluator.evaluate(
    "test.jsonl",
    run_benchmarks=True,
    benchmark_names=["mmlu", "gsm8k"],
)

print(metrics.summary())
```

### 5. Export & Quantize

```python
from saara import ModelExporter
from saara.export.formats import ExportFormat
from saara.export.quantization import QuantizationConfig

config = QuantizationConfig(bits=4)
exporter = ModelExporter("./output/models/my-finetune", config)

results = exporter.export(
    "./output/exports",
    formats=[ExportFormat.GGUF, ExportFormat.AWQ, ExportFormat.ONNX],
    quantize=True,
)
```

## CLI Commands

```bash
# Dataset creation
saara dataset from-pdf document.pdf -o ./output --data-types instruction factual
saara dataset from-text text.txt -o ./output

# Training
saara train finetune dataset.jsonl --model mistralai/Mistral-7B-v0.1 --epochs 3

# Evaluation
saara eval model test.jsonl --model mistral --teacher llama-3-70b --benchmarks mmlu gsm8k

# Export
saara export model ./models/final --formats gguf awq --quantize --bits 4

# GUI
saara gui --port 7860
```

## Examples

See the `examples/` directory for complete workflows:

- `01_pdf_to_dataset.py` - Extract and create dataset from PDF
- `02_synthetic_data.py` - Generate synthetic training data
- `03_finetune.py` - Fine-tune a model with QLoRA
- `04_evaluate.py` - Evaluate with teacher comparison
- `05_export.py` - Export to multiple formats
- `06_complete_pipeline.py` - End-to-end workflow
- `07_gui.py` - Launch Gradio GUI

## Architecture

```
saara/
├── providers/      # Model providers (vLLM, Ollama, llama.cpp)
├── dataset/        # Dataset creation (PDF extraction, synthetic generation)
├── training/       # Fine-tuning pipelines (QLoRA, LoRA)
├── evaluation/     # Evaluation (benchmarks, teacher-student, power)
├── export/         # Export & quantization (GGUF, AWQ, GPTQ, ONNX, TensorRT)
├── gui/            # Gradio components and dashboard
├── cli/            # Command-line interface
└── utils/          # Utilities (I/O, logging, memory)
```

## Supported Formats

### Dataset Output
- Alpaca
- ChatML
- ShareGPT
- DPO
- Completion
- JSONL

### Model Export
- **Safetensors** - HuggingFace compatible
- **GGUF** - llama.cpp CPU/metal inference
- **AWQ** - NVIDIA GPU optimized (4-bit)
- **GPTQ** - Quantized inference
- **ONNX** - Cross-platform deployment
- **TensorRT** - NVIDIA Jetson edge deployment

## Benchmarks

Standard benchmarks supported:
- MMLU (Multi-task Language Understanding)
- GSM8K (Grade School Math)
- HumanEval (Code Generation)
- BoolQ (Boolean Questions)
- HellaSwag (Commonsense NLI)
- TruthfulQA
- WinoGrande
- ARC Easy/Challenge

## Metrics Tracked

- **Accuracy** - Task performance
- **Perplexity** - Language modeling quality
- **Speed** - Tokens/sec, latency
- **Memory** - VRAM usage
- **Power** - Watts, energy, carbon footprint
- **Teacher Agreement** - Student-teacher alignment
- **Hallucination Rate** - Factual accuracy

## Requirements

- Python 3.10+
- CUDA 11.8+ (for GPU features)
- 8GB+ VRAM recommended for fine-tuning
- 16GB+ VRAM for larger models

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Format code
black saara/
ruff check saara/

# Type checking
mypy saara/
```

## License

MIT License - see [LICENSE](LICENSE)

## Contributing

Contributions welcome! Please read our contributing guidelines before submitting PRs.

## Citation

```bibtex
@software{saara2024,
  title = {SAARA: End-to-End ML Pipeline},
  author = {Kilani Sai Nikhil},
  year = {2024},
  url = {https://github.com/nikhil49023/saara-ai},
}
```
