Metadata-Version: 2.4
Name: touchstone-prover
Version: 0.2.2
Summary: An SMT-based verifier for Python with a machine-checked trust base.
Author: CharlesCNorton
License: MIT
Project-URL: Repository, https://github.com/CharlesCNorton/touchstone
Keywords: verification,smt,z3,cvc5,formal-methods,type-inference,contracts
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: z3-solver==4.16.0
Requires-Dist: cvc5==1.3.4
Dynamic: license-file

# Touchstone

An SMT-based verifier for Python. Touchstone takes a function and a property and returns
**PROVED** (it holds for all inputs), **REFUTED** (with a counterexample), or **UNKNOWN**
(with a reason), by translating the code to Z3 rather than running it.

```python
import touchstone as t

# state the property in Python, over the parameters and `result`
t.prove("def f(x):\n    return x + x\n", "result == 2 * x").status        # 'PROVED'

# or write the contract as decorators on the function itself
t.verify_contracts('''
@require("n >= 0")
@ensure("result == n")
def count(n):
    i = 0
    while i < n:
        i = i + 1
    return i
''').status                                                              # 'PROVED'

# or check two implementations agree on every input
t.verify_equiv("double", "f", "def f(a):\n    return a + a\n",
               "def g(a):\n    return 2 * a\n", {}).status                # 'PROVED'
```

## Command line

The same verbs run from the shell, with the process exit status mirroring the verdict
(0 PROVED, 1 REFUTED, 2 UNKNOWN) so they compose in CI:

```sh
touchstone verify count.py                       # the @require / @ensure contracts written in a file
touchstone prove  f.py --ensures 'result == x'   # a postcondition over the parameters and `result`
touchstone equiv  impl.py spec.py --func f       # two implementations agree on every input
touchstone check  d.py                           # trap freedom (and any asserts) for all inputs
touchstone infer  m.py                           # sound over-approximate types of a return and its locals
```

A refutation comes back with the counterexample and the path it took:

```
$ touchstone prove f.py --ensures 'result == x'
REFUTED  [property via verified VC generator (Rocq-extracted wpg)]
  counterexample: x=0
  trace:
    line 2: return x + 1    [x=0]
    => returns 1
```

## What it covers

Functional equivalence and predicates; whole-function and interprocedural reasoning over
control flow with multiple loops, arbitrary nesting, break and continue, statements after a
loop, and any step direction; self-recursion, mutual recursion, and recursion over lists;
deductive and synthesized loop invariants; abstract interpretation (interval, zone, octagon,
Karr, polyhedra, machine-integer); IEEE-754 floating point total over every double, with Inf
and NaN as first-class inputs, exact floor division and modulo, and sound over-approximations
of sin, cos, exp, and log; arrays with quantified specifications; termination of counted
loops, iteration over containers, and data-dependent loops, and cost; exceptions;
rely-guarantee concurrency for all schedules and all depths; and separation logic with the
frame rule, the magic wand, and inductive heap predicates.

Values carry their real types through the symbolic core; the heap models object identity,
aliasing, and mutation; and index-out-of-bounds, key errors, None in arithmetic, type
mismatches, and division by zero are traps that refute a totality claim. Fixed-width
wraparound is checked alongside every integer proof. A property can be stated in Python over
the parameters and `result` (`prove`), written as `@require` / `@ensure` decorators on the
function (`verify_contracts`, in the style of the contracts and icontract packages), mined from
the code's own assertions (`check`), or written directly as a Z3 predicate; a counterexample
comes back with the execution trace and the path taken (`explain`).

[SUBSET.md](SUBSET.md) specifies the modeled subset precisely, in three tiers of trust, and enumerates the
constructs that return UNKNOWN (async, metaclasses, non-identity decorators, dynamic reflection, and the
rest), so it is clear in advance where a verdict will be withheld.

## Soundness

Every construct is either encoded soundly or returned as UNKNOWN with a reason, so an
unsupported feature is never silently skipped or assumed away. A PROVED is confirmed by a
second independent solver (cvc5) and withheld unless both agree, runs under a deterministic
resource bound so identical input yields an identical verdict on every machine, and carries a
reproducibility certificate. `proof_bundle` exports that certificate as a re-checkable bundle -- the
discharged SMT-LIB refutation queries, the solver versions, the deterministic configuration, and a content
hash -- and `recheck_bundle` (or any SMT solver) re-verifies it independently, trusting the re-run solver
rather than the engine that produced it.

`prove`, `verify`, and `check` establish partial correctness: a PROVED means that on every input
meeting the precondition the function returns without a trap and the postcondition holds, not that it
terminates -- termination is a separate obligation (`verify_total`, or `check --total` for trap freedom
plus termination). When a true property exceeds the default deterministic bound and comes back UNKNOWN,
a larger budget is available (`touchstone ... --budget high`), so that incompleteness is not invisible.

The trust base is machine-checked in Rocq (`proofs/`): the operational semantics of the
modeled subset; the verification-condition generator over it, proven sound and complete for
straight-line assignment and conditionals and sound for while-loops carrying a syntactic
invariant, trap-aware so a reachable division or modulo by zero leaves the condition
undischarged; a fixed-width two's-complement integer model proven to agree with unbounded
arithmetic exactly when no operation overflows; the SMT-LIB division and modulo encoding
proven to refine the theory for every conforming solver; the abstract-domain transfers
(extracted to OCaml and run as the engine's operators); the translation as a
semantics-preserving functor; and the end-to-end theorem that a discharged verification
condition implies the property under the program's semantics. SMTCoq re-checks each integer
obligation's certificate inside Coq's kernel.

Both the straight-line and the loop verification-condition generators are extracted from that
proof to OCaml and transcribed in the engine, and a differential audit holds each transcription
byte-for-byte equal to its extraction on a random corpus, so the generator code that runs is the
one proven correct in Rocq, not a separate symbolic execution. The engine discharges the
loop-free integer fragment through the straight-line generator directly, trap-aware so a
reachable division by zero leaves the condition undischarged. With that core verified, the
random differential checks against CPython are a completeness regression that measures precision
rather than the barrier against an unsound verdict.

## Type inference

The same symbolic core infers the type of every function return, parameter, and variable in
code, in two modes. `infer_types` is over-approximating and sound: the reported set of type
names is guaranteed to contain the value's runtime type (trusting a parameter's annotation as
a contract where one is given), or the location is left UNKNOWN when no such bound can be
established, so a stated type is never narrower than the truth. `emit_facts` is best-effort exact in the TypeEvalPy schema and discovers its own
targets: it walks the module and emits a fact at every return, parameter, and binding it
finds, typing each by carrying an argument's type across the call boundary into the parameter
it reaches, following a value through reassignment and the narrowing of `is None` /
`isinstance` guards, and resolving container element types and dict keys through the call
graph, attributes set in a constructor, decorators, and generators.

Scored by TypeEvalPy's exact matcher at full source position, including column offset and with
the analysis discovering each location, name, and kind on its own, the emitted facts are matched
against the runtime-observed ground truth. As an independent check, the inference is run over
pure-Python standard-library modules and compared against the type CPython produces at runtime:
the sound mode's reported set is confirmed to contain the runtime type at every observed location,
and the exact mode is held to the same emit-and-match standard, discovering the locations itself.

Recall is the fraction of ground-truth facts an emitted fact matches; precision the fraction of
emitted facts that match one. The CPython cross-check knows a runtime type only at the locations its
corpus reached, so its precision is measured over the emitted facts at those locations.
`python -m touchstone.typeeval` runs the CPython cross-check; `python -m touchstone.typeeval PATH`
scores a TypeEvalPy directory.

The commit rate is the fraction of ground-truth locations where the heuristic emits any type rather than
abstaining (the refusal rate is its complement), and the accuracy where committed is the fraction of those
commitments that match. Recall is the product of the two, so it rises either by committing on more
locations or by being right more often where it commits; tracking them apart shows which lever is short.

| Evaluation | Recall | Precision | Commit | Accuracy (committed) |
| --- | --- | --- | --- | --- |
| TypeEvalPy micro-benchmark (emit-and-match) | 812 / 868 (93.5%) | 36.6% | 96.0% | 97.5% |
| TypeEvalPy autogen suite (emit-and-match) | 71,653 / 77,268 (92.7%) | 48.7% | 94.1% | 98.5% |
| CPython standard library (emit-and-match) | 1042 / 1218 (85.6%) | 85.1% | 92.9% | 92.1% |

The sound mode reports a guaranteed bound rather than a single prediction, so it is measured differently:
the commit rate is the fraction of locations where it gives a bound at all, the soundness the fraction of
those whose bound contains the runtime type, and the exact rate where the bound equals the observed type.
`python -m touchstone.typeeval --sound` reproduces it.

| Sound mode (infer_types) | Commit rate | Soundness | Exact |
| --- | --- | --- | --- |
| TypeEvalPy micro-benchmark | 101 / 868 (11.6%) | 100% | 99.0% |
| TypeEvalPy autogen suite | 13,829 / 77,268 (17.9%) | 100% | 98.5% |
| CPython standard library | 54 / 1218 (4.4%) | 100% | 83.3% |

The sound mode commits only where the type is fixed independent of the inputs (a literal, a fixed-return
builtin, an operation over already-bounded values), so its reach is narrower than the heuristic's; but a
committed bound is a proven over-approximation, and the runtime type fell inside every one.

The TypeEvalPy figures are scored against commit `3719de1` of
[TypeEvalPy](https://github.com/secure-software-engineering/TypeEvalPy), and the CPython cross-check ran
on CPython 3.13.2; both the counts and the rates move with the benchmark commit and the standard-library
version, so the tables are snapshots `python -m touchstone.typeeval` reproduces rather than fixed scores.
The autogen suite is generated from templates the heuristic was tuned against and the micro suite is
hand-written, whereas the CPython standard library is independent of both: it is the held-out
measurement, and its lower accuracy is the out-of-distribution result the benchmark numbers do not show
on their own.

## Run

```sh
pip install touchstone-prover  # z3-solver and cvc5, pinned in pyproject.toml
python -m touchstone.ci      # self-tests, soundness audits, completeness regressions -> "CI OK"
python -m touchstone.examples  # one runnable example per capability, each verdict asserted
python -m touchstone.typeeval  # type-inference recall + precision (CPython cross-check)
python -m touchstone         # a demonstration
```

The machine-checked proofs run under the Rocq 9.0 opam switch; `verify_coq.sh` also invokes
the SMTCoq certificate check when its separate toolchain (see `proofs/toolchain.lock`) is
present, and skips it cleanly otherwise:

```sh
eval "$(opam env --switch=rocq9)" && cd proofs && bash verify_coq.sh
```

## Layout

```
touchstone/      package: core, domains, engines, theories, vcgen, audit, ci, examples (_impl is the engine)
proofs/          Rocq + SMTCoq proofs, the extracted VC generators + interval operators, verify_coq.sh
.github/         continuous integration: the audits and the proof gate on every change
pyproject.toml   package metadata and pinned Python dependencies
SUBSET.md        the modeled subset of Python, in three tiers of trust, and what returns UNKNOWN
```

## License

MIT. See [LICENSE](LICENSE).
