Metadata-Version: 2.4
Name: prolog-reasoner
Version: 0.2.0
Summary: LLM-powered logical reasoning with Prolog - Calculator for logic
Project-URL: Homepage, https://github.com/rikarazome/prolog-reasoner
Project-URL: Repository, https://github.com/rikarazome/prolog-reasoner
Project-URL: Issues, https://github.com/rikarazome/prolog-reasoner/issues
Author-email: rikarazome <rikimarh@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: llm,logic,mcp,neuro-symbolic,prolog,reasoning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: fastmcp>=3.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: all
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Description-Content-Type: text/markdown

# prolog-reasoner

[![PyPI version](https://img.shields.io/pypi/v/prolog-reasoner.svg)](https://pypi.org/project/prolog-reasoner/)
[![Python versions](https://img.shields.io/pypi/pyversions/prolog-reasoner.svg)](https://pypi.org/project/prolog-reasoner/)
[![CI](https://github.com/rikarazome/prolog-reasoner/actions/workflows/test.yml/badge.svg)](https://github.com/rikarazome/prolog-reasoner/actions/workflows/test.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

SWI-Prolog as a "logic calculator" for LLMs — available as an MCP server and a Python library. Eliminate the black box from LLM logical reasoning.

LLMs excel at natural language but struggle with formal logic. Prolog excels at logical reasoning but can't process natural language. **prolog-reasoner** bridges this gap by exposing SWI-Prolog execution to LLMs. 

## Does it help?

On the built-in 30-problem logic benchmark:

| Pipeline | Accuracy |
|----------|----------|
| LLM-only (`claude-sonnet-4-6`) | 22/30 (73.3%) |
| **LLM + prolog-reasoner** | **27/30 (90.0%)** |

The gap concentrates in constraint satisfaction and multi-step reasoning — the combinatorial territory LLMs are weak on and Prolog is strong on. [Full breakdown below.](#benchmark)

## Why it works

LLMs pattern-match; Prolog actually searches and solves. When the LLM writes its problem down as Prolog, two things happen at once:

- Prolog handles the combinatorial work LLMs are weak on — constraint satisfaction, multi-step inference, exhaustive search.
- The reasoning exists as code you can read, re-run, and debug. When it goes wrong, you see the exact Prolog that failed and why.

## Two ways to use it

- **MCP server** — Claude (or any MCP client) calls it as a logic solver during conversation. **Rule bases** let the LLM save stable domain rules once and reference them by name per call.
- **Python library** — full NL→Prolog pipeline with self-correction. Requires OpenAI or Anthropic.

## Features

- **MCP tools**: `execute_prolog` for arbitrary SWI-Prolog execution, plus `list_rule_bases` / `get_rule_base` / `save_rule_base` / `delete_rule_base` for reusable named rule bases (v14)
- **Rule bases**: save stable Prolog rules once (e.g. chess move rules, legal axioms) and reference them by name from `execute_prolog` so the LLM only writes the situation-specific facts per call
- **Transparent intermediate representation**: the Prolog code is the audit trail — inspect, modify, or verify before execution
- **CLP(FD) support**: constraint logic programming for scheduling and optimization
- **Negation-as-failure, recursion, all standard SWI-Prolog features**
- **Library mode**: NL→Prolog translation with self-correction loop (OpenAI / Anthropic)

## Requirements

- Python ≥ 3.10
- [SWI-Prolog](https://www.swi-prolog.org/download/stable) installed and on PATH (≥ 9.0)
- API key for OpenAI or Anthropic — **only for library mode**, not for the MCP server

## Installation

```bash
# MCP server only (no LLM dependencies)
pip install prolog-reasoner

# Library with OpenAI
pip install prolog-reasoner[openai]

# Library with Anthropic
pip install prolog-reasoner[anthropic]

# Both providers
pip install prolog-reasoner[all]
```

## MCP Server Setup

The MCP server exposes five tools — `execute_prolog` runs Prolog code written by the connected LLM, and four rule-base tools manage named, reusable Prolog modules. It does **not** call any external LLM API, so no API key is required.

### Claude Desktop / Claude Code

```json
{
  "mcpServers": {
    "prolog-reasoner": {
      "command": "uvx",
      "args": ["prolog-reasoner"]
    }
  }
}
```

Or, if `prolog-reasoner` is installed directly:

```json
{
  "mcpServers": {
    "prolog-reasoner": {
      "command": "prolog-reasoner"
    }
  }
}
```

### Docker (SWI-Prolog bundled)

Use Docker if you don't want to install SWI-Prolog locally:

```bash
docker build -f docker/Dockerfile -t prolog-reasoner .
```

```json
{
  "mcpServers": {
    "prolog-reasoner": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "prolog-reasoner"]
    }
  }
}
```

### Tool reference

**`execute_prolog(prolog_code, query, rule_bases=None, max_results=100, trace=False)`**
- `prolog_code` — Prolog facts and rules (string)
- `query` — Prolog query to run, e.g. `"mortal(X)"` (string)
- `rule_bases` — optional list of saved rule base names to prepend to `prolog_code` (in order). Use this to reuse stable domain rules across calls without re-sending them
- `max_results` — cap the number of solutions returned (default 100)
- `trace` — when `True`, attach a structured proof tree per solution to `metadata.proof_trace`. Opt-in sub-feature; has performance overhead and does not support CLP(FD), higher-order predicates, or assert/retract.

Returns a JSON object with `success`, `output`, `query`, `error`, and `metadata`.

On success, `metadata` includes `execution_time_ms`, `result_count`, `truncated`, and `rule_bases_used`. When rule bases were requested, `rule_base_load_ms` is also attached (disk I/O timing). On failure, `metadata` also includes `error_category` (one of `syntax_error`, `undefined_predicate`, `unbound_variable`, `type_error`, `domain_error`, `evaluation_error`, `permission_error`, `timeout`, `trace_mechanism_error`, `unknown`) and `error_explanation` — a natural-language hint for the connected LLM (or human) to decide how to fix the Prolog code.

**Rule base tools** — manage named, reusable Prolog modules under `PROLOG_REASONER_RULES_DIR` (defaults to `~/.prolog-reasoner/rules/`). Names are restricted to `[a-z0-9_-]`, length 1–64.

- **`save_rule_base(name, content)`** — write or overwrite a rule base. Content is syntax-validated (parse-only) before the write; failures surface as `RULEBASE_003`. Returns `{"success": true, "name": ..., "created": bool}` where `created` is `true` on first write, `false` on overwrite. Files over `max_rule_size` are rejected with `RULEBASE_005`.
- **`list_rule_bases()`** — return all saved rule bases with `name`, `description`, and `tags`. Metadata is extracted from leading `% description:` / `% tags:` comments in each file.
- **`get_rule_base(name)`** — return the raw Prolog source of a saved rule base.
- **`delete_rule_base(name)`** — remove a saved rule base.

For name/size/existence errors, the tools return `{"success": false, "error": "...", "error_code": "RULEBASE_001"|"RULEBASE_002"|"RULEBASE_003"|"RULEBASE_005"}` rather than raising. I/O failures (`RULEBASE_004`) are propagated as infrastructure errors.

**Rule base conventions** — start each rule base file with leading comments that double as `list_rule_bases` metadata:

```prolog
% description: Chess piece movement rules
% tags: chess, games

piece_move(knight, (X1,Y1), (X2,Y2)) :- ...
```

Then reference from `execute_prolog`:

```json
{
  "rule_bases": ["chess_moves"],
  "prolog_code": "position(knight, (4,4)).",
  "query": "piece_move(knight, (4,4), Target)"
}
```

Rule bases also serve as the foundation for domain-specialized forks: ship a curated set (legal axioms, game rules, tax scenarios, etc.) bundled via `BUNDLED_RULES_DIR` as a ready-to-use reasoning package.

## Library Usage

The library exposes `PrologExecutor` (Prolog-only, no LLM) and `PrologReasoner` (NL→Prolog pipeline, needs an LLM API key).

### Execute Prolog directly (no LLM)

```python
import asyncio
from prolog_reasoner.config import Settings
from prolog_reasoner.executor import PrologExecutor

async def main():
    settings = Settings()  # no API key needed
    executor = PrologExecutor(settings)
    result = await executor.execute(
        prolog_code="human(socrates). mortal(X) :- human(X).",
        query="mortal(X)",
    )
    print(result.output)  # mortal(socrates)

asyncio.run(main())
```

### Full NL→Prolog pipeline (requires LLM API key)

```python
import asyncio
from prolog_reasoner import PrologReasoner, TranslationRequest, ExecutionRequest
from prolog_reasoner.config import Settings
from prolog_reasoner.executor import PrologExecutor
from prolog_reasoner.translator import PrologTranslator
from prolog_reasoner.llm_client import LLMClient

async def main():
    settings = Settings(llm_api_key="sk-...")  # from env or explicit
    llm = LLMClient(
        provider=settings.llm_provider,
        api_key=settings.llm_api_key,
        model=settings.llm_model,
        timeout_seconds=settings.llm_timeout_seconds,
    )
    reasoner = PrologReasoner(
        translator=PrologTranslator(llm, settings),
        executor=PrologExecutor(settings),
    )
    translation = await reasoner.translate(
        TranslationRequest(query="Socrates is human. All humans are mortal. Is Socrates mortal?")
    )
    print(translation.prolog_code)
    result = await reasoner.execute(
        ExecutionRequest(prolog_code=translation.prolog_code, query=translation.suggested_query)
    )
    print(result.output)

asyncio.run(main())
```

## Configuration

All settings via environment variables (prefix `PROLOG_REASONER_`):

| Variable | Default | Required for |
|----------|---------|--------------|
| `LLM_PROVIDER` | `openai` | library (`openai` or `anthropic`) |
| `LLM_API_KEY` | `""` | library only — leave unset for MCP |
| `LLM_MODEL` | `gpt-5.4-mini` | library |
| `LLM_TEMPERATURE` | `0.0` | library |
| `LLM_TIMEOUT_SECONDS` | `30.0` | library |
| `SWIPL_PATH` | `swipl` | both |
| `EXECUTION_TIMEOUT_SECONDS` | `10.0` | both |
| `RULES_DIR` | `~/.prolog-reasoner/rules` | both (where user-saved rule bases live) |
| `BUNDLED_RULES_DIR` | unset | both (optional — synced into `RULES_DIR` on first startup for shipping default rules with a fork) |
| `MAX_RULE_SIZE` | `1048576` (1 MiB) | both (per-file save cap; `save_rule_base` rejects larger content with `RULEBASE_005`) |
| `MAX_RULE_PROMPT_BYTES` | `65536` (64 KiB) | library only (total budget for the "Available rule bases" prompt section; truncated with a marker when exceeded) |
| `LOG_LEVEL` | `INFO` | both |

## Benchmark

`benchmarks/` contains 30 logic problems across 5 categories (deduction, transitive, constraint, contradiction, multi-step) to compare LLM-only reasoning vs LLM+Prolog reasoning. The benchmark exercises the **library** path (translator + executor), since it requires the NL→Prolog step.

### Results

Measured on `anthropic/claude-sonnet-4-6`, single run over 30 problems:

| Pipeline | Accuracy | Avg latency |
|----------|----------|-------------|
| LLM-only | 22/30 (73.3%) | 1.7s |
| **LLM + Prolog** | **27/30 (90.0%)** | 3.8s |

Per-category breakdown:

| Category | LLM-only | LLM + Prolog |
|----------|----------|--------------|
| deduction | 6/6 | 6/6 |
| transitive | 6/6 | 5/6 |
| constraint | 3/7 | **6/7** |
| contradiction | 4/4 | 3/4 |
| multi-step | 3/7 | **7/7** |

The gap is concentrated in **constraint** (SEND+MORE, 6-queens, knapsack, K4 coloring, Einstein-lite) and **multi-step** (Nim game theory, 3-person knights-and-knaves, TSP-4, zebra puzzle) — exactly the combinatorial/search-heavy territory where symbolic solvers outperform pattern completion. On purely deductive or transitive questions the LLM is already strong and Prolog adds latency without accuracy gains.

All 3 LLM+Prolog failures were Prolog execution errors from malformed LLM-generated code (missing predicate definitions, unbound CLP(FD) variables) rather than reasoning errors — addressable via prompt tuning. Notably, every failure is inspectable: you can see the exact Prolog that failed and why, rather than a wrong natural-language answer with no explanation.

### Running it yourself

```bash
docker run --rm -e PROLOG_REASONER_LLM_API_KEY=sk-... \
    prolog-reasoner-dev python benchmarks/run_benchmark.py
```

Results are saved to `benchmarks/results.json`.

## Comparison with other Prolog MCPs

Several Prolog MCP servers exist, each with different design choices. **prolog-reasoner** is intentionally stateless and spot-use — Prolog is a calculator you call when logic matters, not the backbone of your agent's memory.

| | prolog-reasoner | Stateful Prolog MCPs |
|---|---|---|
| Prolog's role | Per-call reasoning tool | Project-wide knowledge base |
| State | Stateless execution (each call independent); optional named **rule bases** for reusable static rules, no inter-call session memory | Persistent sessions / layered KBs |
| Reproducibility | Same input (incl. same rule bases) → same output, always | Depends on accumulated state |
| Integration effort | Use where logic matters, skip where it doesn't | Architectural commitment |
| A/B testable vs LLM-only | Yes (each call is a controlled experiment) | Structurally not comparable |

This is also why accuracy benchmarks are published here and not elsewhere: statelessness is what makes a side-by-side comparison possible.

If you need persistent agent memory, hallucination-safeguarded fact storage, or a full neuro-symbolic substrate, other projects may fit better:

- [adamrybinski/prolog-mcp](https://github.com/adamrybinski/prolog-mcp) — Trealla WASM with save/load sessions
- [umuro/prolog-mcp](https://github.com/umuro/prolog-mcp) — layered KB with file-backed persistence
- [vpursuit/model-context-lab](https://github.com/vpursuit/model-context-lab) — SWI-Prolog with security sandboxing
- [dr3d/prolog-reasoning](https://github.com/dr3d/prolog-reasoning) — neuro-symbolic memory with write-path safety

We're the spot-use option.

## Development

```bash
# Build dev image
docker build -f docker/Dockerfile -t prolog-reasoner-dev .

# Run tests (no API key needed — LLM calls are mocked)
docker run --rm prolog-reasoner-dev

# With coverage
docker run --rm prolog-reasoner-dev pytest tests/ -v --cov=prolog_reasoner

# Or via docker compose
docker compose -f docker/docker-compose.yml run --rm test
```

## License

MIT
