Metadata-Version: 2.4
Name: sotis
Version: 1.0.2
Summary: Watches your LLM agent in real time and intercepts meltdowns before they spiral.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.26.0
Requires-Dist: pydantic>=2.6.0
Requires-Dist: python-dateutil>=2.9.0
Requires-Dist: openai>=1.25.0
Requires-Dist: anthropic>=0.25.0
Requires-Dist: rich>=13.7.0
Provides-Extra: dev
Requires-Dist: pytest>=8.2.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
Requires-Dist: langchain>=0.2.0; extra == "dev"
Requires-Dist: langgraph>=0.0.60; extra == "dev"
Provides-Extra: langgraph
Requires-Dist: langchain>=0.2.0; extra == "langgraph"
Requires-Dist: langgraph>=0.0.60; extra == "langgraph"
Provides-Extra: obs
Requires-Dist: streamlit>=1.35.0; extra == "obs"

# Sotis

**Sotis watches your LLM agent and catches it before it spirals.**

[![PyPI version](https://badge.fury.io/py/sotis.svg)](https://pypi.org/project/sotis/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)]()
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)]()

```bash
pip install sotis
```

Long-running agents fail in predictable ways — they loop on the same tool calls, flood their context with error traces, and spiral until the task collapses. Sotis detects these failure patterns in real time and transparently resets execution before they take hold.

*Based on ["Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents"](https://arxiv.org/abs/2603.29231) (arXiv:2603.29231, April 2026)*

---

## The Problem

Current AI agents fail predictably under long-horizon execution. As tasks grow longer, agents accumulate error and drift into terminal failure modes:

- **Infinite Loops** — repeating the same tool calls with identical arguments
- **Semantic Spirals** — rephrasing failed queries hoping for different outcomes
- **Context Poisoning** — flooding history with massive error traces and linter warnings
- **Edit Storms** — making rapid, uncoordinated file edits without shifting outputs

Frontier models do not fail because they are simple. They fail because long-horizon execution decays their reliability envelope until strategy collapse occurs. Sotis acts as an active runtime stabilizer — monitoring execution, detecting behavioral meltdowns, and transparently resetting context to restore forward progress.

---

## Usage

```python
from sotis import SotisGuard

guard = SotisGuard()

for step in range(max_steps):
    action = agent.decide()
    result = tools.execute(action)

    meltdown = guard.watch(action.name, action.args, result.summary)

    if meltdown:
        guard.reset()  # rolls back files, distills context, resumes cleanly
```

### What it looks like in practice

```
[Step 22] write_file -> {"path": "src/main.py", "content": "import math"} | SUCCESS
[Step 23] run_tests  -> {"cmd": "pytest"} | FAIL (ImportError)
[Step 24] write_file -> {"path": "src/main.py", "content": "import math"} | SUCCESS
[Step 25] run_tests  -> {"cmd": "pytest"} | FAIL (ImportError)

[WARNING]   Anomaly detected: Workspace edit storm and exact argument loops
[INTERCEPT] Sotis Meltdown Interception Triggered!
[RECOVER]   Restored workspace files to stable baseline (step 22 diff)
[RECOVER]   Distilled session context history (78% token savings)
[RESUME]    Injecting resumption briefing into agent context...

[Step 26] grep_search -> {"query": "math"} | Execution resumed cleanly
```

---

## Active Stabilization, Not Passive Tracing

Tools like LangSmith, Langfuse, and Helicone log what happened after your agent already spent $20 looping in production.

Sotis intervenes *during* execution. It intercepts spiraling tool calls, rolls back uncommitted file edits, distills conversation history, and redirects the model's reasoning loop — before the damage accumulates.

---

## Capabilities

| Capability | Description |
|---|---|
| **Meltdown Detection** | Sliding-window Shannon entropy (w=5, H=1.5) + exact loop detection |
| **Workspace Density Guard** | Detects infinite same-file edit cycles |
| **Transparent Reset** | Git-diff checkpointing + distilled context rebuild (≥60% token savings) |
| **Graceful Degradation** | GDS scoring preserves partial progress across resets |
| **LangGraph Integration** | Native guard node — intercepts state, rolls back files |
| **Document Processing** | PDF, XLSX, Word, CSV support + Jaccard semantic loop detection |
| **LLM Support** | OpenAI, Anthropic, DeepSeek, Google Gemini |
| **Observability** | Streamlit dashboard + structured JSON session logs |

---

## The Science

Sotis operationalizes the formal reliability framework from *["Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents"](https://arxiv.org/abs/2603.29231)* (arXiv:2603.29231, April 2026).

Four key findings from the paper that Sotis directly addresses:

**Meltdown Onset Point (MOP)** — the paper quantifies the transition from coherent planning to chaotic looping via sliding-window Shannon entropy. Sotis implements this as a live runtime monitor with a calibrated threshold of H=1.5 bits over a 5-step window.

**Super-linear reliability decay** — agent success rates decay faster than mathematically expected because errors are positively correlated across steps. A confused agent stays confused. Sotis acts as a circuit breaker that resets the error correlation coefficient by starting fresh from a verified checkpoint.

**Episodic memory failures** — the paper demonstrates that naive memory scaffolds universally degrade long-horizon performance by accumulating context overhead. Sotis uses controlled checkpointed resets instead of continuous memory accumulation.

**Graceful Degradation Score (GDS)** — rather than binary pass/fail, Sotis scores partial task completion using weighted subtask graphs, preserving measured progress across reset boundaries.

---

## Performance

| Metric | Result |
|---|---|
| Entropy + loop detection latency | < 0.2ms per step |
| Context distillation token reduction | 86.14% |
| Test suite | 127 tests, 88% coverage |
| Live recovery | Verified on circular import and AST recursive loop traps |

Full empirical ledger: [`performance_metrics.txt`](https://github.com/Shaurya-34/Sotis/blob/main/performance_metrics.txt)

---

## Project Structure

```
sotis/
  core/     # Entropy, loop detection, checkpoint, decomposition, GDS
  lib/      # ReAct runtime, LangGraph integration, LLM adapters
  obs/      # Streamlit dashboard + structured JSON logger
  bench/    # Benchmark harness and task generators
```

---

## License

MIT
