Metadata-Version: 2.4
Name: athanor-ai
Version: 0.2.0
Summary: Formal verification as agentic training signal — CLI + self-hosted runner
Project-URL: Homepage, https://athanor-ai.com
Project-URL: Repository, https://github.com/athanor-ai/athanor-sdk
Author-email: Athanor AI <aidan@athanor-ai.com>
License: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/athanor-ai/athanor-website/main/logo.svg" width="64" height="64" alt="Athanor">
</p>

<h1 align="center">athanor-ai</h1>

<p align="center">
  <strong>Lean 4 proof verification as agentic training signal.</strong><br>
  Turn formal proofs into reward functions. Score agent output with compilers, not judges.<br><br>
  <a href="https://athanor-ai.com">athanor-ai.com</a>
</p>

---

Your agent writes code. Then it writes a proof that the code is correct. The Lean 4 compiler checks the proof. The result is a training signal with no ambiguity.

```python
import athanor

# Verify a Lean 4 proof
result = athanor.verify_proof("""
theorem add_comm (a b : Nat) : a + b = b + a := by
  omega
""")

print(result.compiles)    # True
print(result.has_sorry)   # False
print(result.score)       # 1.0
```

## Install

```bash
pip install athanor-ai
```

## What this solves

You have domain expertise. You know what correct code looks like. You want an AI agent to produce verified solutions, not guesses.

The problem: LLM judges are noisy. Unit tests are brittle. Benchmarks don't produce training signal.

The solution: Lean 4 formal proofs are deterministic, machine-checked, and produce continuous reward signal (full proof = 1.0, partial = 0.35, broken = 0.25).

## Verify proofs

Check if a Lean 4 proof compiles. Detect `sorry` placeholders. Catch banned constructs (`axiom`, `import Mathlib`, `unsafe`).

```python
from athanor import verify_proof, check_sorry, score_proof

# Full verification with detailed result
result = verify_proof(proof_code)
result.compiles      # did it compile?
result.has_sorry     # any incomplete proof markers?
result.sorry_count   # how many sorry placeholders?
result.score         # 0.0 - 1.0
result.status        # "full_proof" | "partial_proof" | "compile_error" | "banned"
result.errors        # compiler error messages

# Quick score (just the float)
score = score_proof(proof_code)  # 1.0, 0.35, 0.25, or 0.0

# Check for sorry without full compilation
has_sorry, count = check_sorry(proof_code)
```

Works with local Lean 4 installation or via Docker (`ghcr.io/leanprover/lean4`).

## Score agent output

Pair code with a proof. Score both. Use the result as reward.

```python
import athanor

env = athanor.make("my-environment", task="my-task")
env.reset()

result = env.score({
    "kernel.py": agent_code,
    "proof.lean": agent_proof,
})

# Scoring layers:
# 1. Does the code work? (verifier checks)
# 2. Does the proof compile? (Lean compiler)
# 3. Is the proof complete? (no sorry)
print(result.score)        # combined score
print(result.lean_status)  # proof status
```

## Agent retry with verifier feedback

Agent gets the scoring output and tries again. No human in the loop. The verifier feedback is the teacher.

```python
results = env.run(
    model="anthropic/claude-sonnet-4-6",
    api_key="...",
    max_retries=3,
    target_score=0.95,
)
# Attempt 1: 0.35 (code correct, proof has sorry)
# Attempt 2: 0.72 (proof compiles, 2 sorry remaining)
# Attempt 3: 0.98 (full proof, verified)
```

## RL training

Use proof scores as reward signal in any RL framework.

```python
from trl import PPOTrainer

env = athanor.make("my-environment")
trainer = PPOTrainer(
    reward_fn=lambda completions: env.reward_fn(completions),
    ...
)
```

Compatible with TRL, veRL, NeMo-RL, or any custom training loop.

## Proof scoring

```
proof_multiplier:
  1.00  full proof (compiles, no sorry)
  0.35  partial proof (compiles with sorry)
  0.25  broken proof (does not compile)
  0.15  no proof submitted
  0.00  banned construct (axiom, Mathlib, unsafe)
```

Partial proofs produce gradient. An agent that proves 4 of 7 theorems scores higher than one that proves 0. This is the training signal.

## Getting environments

The `verify_proof` and `score_proof` functions work standalone with any Lean 4 code. For full environment scoring (code + proof + property tests), contact [athanor-ai.com](https://athanor-ai.com).

## Requirements

- Python >= 3.9
- Lean 4 or Docker (for proof verification)

## License

Apache-2.0
