Metadata-Version: 2.4
Name: semantic-model-diff
Version: 0.3.2
Summary: A powerful 'git diff' tool for language models, providing deep behavioral and structural analysis between fine-tuned and base models.
License: Apache-2.0
Project-URL: Homepage, https://github.com/pratikbhande/Semantic-Diff
Project-URL: Repository, https://github.com/pratikbhande/Semantic-Diff
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: full
Requires-Dist: transformers>=4.40; extra == "full"
Requires-Dist: torch; extra == "full"
Requires-Dist: peft; extra == "full"
Requires-Dist: bitsandbytes; extra == "full"
Requires-Dist: sentence-transformers; extra == "full"
Requires-Dist: faiss-cpu; extra == "full"
Requires-Dist: safetensors; extra == "full"
Requires-Dist: huggingface_hub; extra == "full"
Requires-Dist: numpy; extra == "full"
Requires-Dist: structlog; extra == "full"
Requires-Dist: rich; extra == "full"
Requires-Dist: click; extra == "full"
Requires-Dist: pydantic>=2; extra == "full"
Requires-Dist: pydantic-settings; extra == "full"
Requires-Dist: jinja2; extra == "full"
Requires-Dist: sqlalchemy; extra == "full"
Requires-Dist: python-dotenv; extra == "full"
Requires-Dist: psutil; extra == "full"
Requires-Dist: scipy; extra == "full"
Provides-Extra: ui
Requires-Dist: gradio>=4.0; extra == "ui"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-asyncio; extra == "dev"
Requires-Dist: pytest-timeout; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: httpx; extra == "dev"

# Semantic Model Diff

[![PyPI version](https://badge.fury.io/py/semantic-model-diff.svg)](https://badge.fury.io/py/semantic-model-diff)

A powerful, production-grade **"git diff" for language models**. `semantic-model-diff` performs deep behavioral and structural comparisons between a base language model and its fine-tuned counterpart, producing human-readable capability diff reports.

It goes beyond standard benchmarks to tell you *exactly* how a model's behavior changed after fine-tuning.

## What is it?

When you fine-tune an LLM (e.g., using LoRA, QLoRA, or full fine-tuning), you alter its underlying weights and conceptual capabilities. Standard validation losses only tell you part of the story. `semantic-model-diff` actively tests both models side-by-side using a suite of internal benchmarks to evaluate shifts in reasoning, instruction following, creativity, and more.

## Key Features

- **10 Core Dimensions:** Evaluates models across various capabilities:
  - Instruction Following
  - Mathematical Reasoning
  - Creative Variance
  - Reasoning Depth
  - Code Quality
  - Factual Recall
  - Context Retention
  - Safety Adherence
  - Response Conciseness
  - Structured Output
- **Layer-by-Layer Weight Analysis:** Identifies exactly where in the network your adapter or fine-tuning made changes.
- **Statistical Significance:** Robust 95% confidence intervals via bootstrap resampling.
- **Local & Fast:** Runs entirely locally. Fits in 16GB CPU RAM with 4-bit quantization if needed.
- **Rich Reporting:** Generates beautiful Terminal reports, Markdown files, HTML pages, and machine-readable JSON reports.
- **Gradio UI & Docker Integration:** Full Gradio web UI and Docker Compose ready out of the box.

## Installation

You can install the package directly from PyPI. For the complete set of features (including local analysis, UI, and reporting), install with the `[full]` flag:

```bash
pip install "semantic-model-diff[full]"
```

## Quick Start

### Command Line Interface

You can run the analysis via the CLI. Point the tool to your HuggingFace base model and your local or remote fine-tuned adapter/model.

```bash
semantic-diff analyze \
  --base Qwen/Qwen2.5-3B \
  --finetuned Qwen/Qwen2.5-3B-Instruct \
  --dimensions instruction_following,mathematical_reasoning,creative_variance,reasoning_depth \
  --tier quick \
  --device cuda \
  --format terminal
```

**Options:**
- `--base`: Base model ID on Hugging Face (e.g., `google/gemma-2-2b`).
- `--finetuned`: Path to local adapter/model, or Hugging Face model ID.
- `--dimensions`: Comma-separated list of capabilities to test.
- `--tier`: Determines the depth of the test (`quick`, `standard`, `comprehensive`).
- `--device`: Target compute device (`cuda`, `cpu`, `mps`).
- `--format`: Output format (`terminal`, `html`, `json`, `markdown`).

### Python API

You can also use the library programmatically in your own scripts or Jupyter notebooks:

```python
from semantic_diff import analyze_models

report = analyze_models(
    base_model="Qwen/Qwen2.5-3B",
    finetuned_model="Qwen/Qwen2.5-3B-Instruct",
    dimensions=["instruction_following", "mathematical_reasoning"],
    device="cuda"
)

print(report.summary)
```

## Web UI

If you prefer a graphical interface, you can launch the Gradio UI:

```bash
python -m semantic_diff.ui.app
```

Or run it instantly via Docker:

```bash
docker-compose up ui
```

## Creating Custom Dimensions (Plugin System)

`semantic-model-diff` is highly extensible. You can build and register your own custom evaluation dimensions. Check out the `examples/custom_dimension.py` script in the repository for a complete example on how to define and load your own rules and evaluators.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request on [GitHub](https://github.com/pratikbhande/Semantic-Diff).
