Metadata-Version: 2.4
Name: astcd
Version: 0.1.0
Summary: AST-driven constrained decoding: generate random, always-valid Python programs
Project-URL: Homepage, https://github.com/PatrickHua/astcd
Project-URL: Repository, https://github.com/PatrickHua/astcd
Author: Tianyu Hua
License: MIT
License-File: LICENSE
Keywords: ast,code-generation,constrained-decoding,fuzzing,grammar,program-synthesis,property-based-testing
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Compilers
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == 'dev'
Provides-Extra: gif
Requires-Dist: pillow>=9; extra == 'gif'
Description-Content-Type: text/markdown

# AST&#8209;CD

**AST-driven constrained decoding — generate random, syntactically valid Python programs.**

![A program with a for-loop assembling itself one legal token at a time](https://raw.githubusercontent.com/PatrickHua/astcd/main/assets/demo.gif)

`astcd` samples programs from a restricted Python grammar by deciding, at every
step, the exact set of tokens that are legal next — so every program it produces
is syntactically valid, scope-correct (no undefined variables), and bounded in
size **by construction**. There's no generate-then-reject loop and no Python
source parser: the token stream *is* a preorder serialization of the AST.

```python
import random
from astcd.vocab import Vocabulary
from astcd.types import DecodeConfig, Start
from astcd.decoder import sample_tokens, weighted_policy, PolicyConfig
from astcd.codec import tokens_to_ast, ast_to_source
from astcd.exec import run

cfg = DecodeConfig()
tok = Vocabulary(cfg)
rng = random.Random(2)

toks = sample_tokens(Start.FUNCTION, rng, cfg, tok, max_len=30,
                     policy=weighted_policy(PolicyConfig(), tok, rng))
fn = tokens_to_ast(toks, tok)
print(ast_to_source(fn))
print("output:", run(fn, (3, 4)).value)
```

```python
def func(a, b):
    d = a
    return ((b * (d + d)) * a)
output: 72
```

## Why

- **Constrained decoding for LLM Python AST code generation.** `Decoder` exposes
  the exact legal-token mask at each step, so a model can generate only programs
  inside the supported AST subset.
- **Random Python code generation for data augmentation.** The same grammar,
  scope, and budget constraints can sample many small, well-formed programs for
  training data or evaluation corpora.
- **No generate-then-reject loop.** Every sampled token stream decodes to
  syntactically valid, scope-safe restricted Python by construction.
- **Safe + deterministic by construction.** A small direct interpreter (no
  `exec`/`compile`) runs programs under integer/step caps and returns a structured
  outcome. Overflow and step-limit cases are reported as data; nothing can import
  or do I/O.

## Install

```bash
pip install astcd            # core (pure standard library, zero dependencies)
pip install "astcd[gif]"     # + Pillow, to export the decode animation as a GIF
```

## The constrained-decoding API

The core object is a `Decoder`: ask it what's legal, advance with any of those
tokens, repeat. That seam is where a random policy, a weighted prior, or a neural
model all plug in identically.

```python
from astcd.decoder import Decoder

d = Decoder(Start.FUNCTION, cfg, tok, max_len=30)
while not d.is_complete():
    legal = d.next_legal_tokens()     # the mask: grammar ∩ scope ∩ depth/size guards
    d.advance(my_policy(legal))       # pick any legal token (rng / weighted / model)
fn = d.function()
```

`next_legal_tokens()` is intersected from four sources: the grammar's FIRST set,
the in-scope variable names (definite assignment), the depth/length guards, and a
min-cost-to-complete check that guarantees the program closes within `max_len`.
It is never empty until the program is complete.

## What it generates

A small, bounded subset of Python over integers:

```
def func(a, b):           # arity 1..N
    d = a * b + 1         # assignments to fresh / existing locals
    if d < 0:             # if / else
        return 0          # return is a statement (early returns OK)
    return d              # ...or fall through to None
```

Optional features (off by default) via `DecodeConfig`: `for v in range(e):` loops
(`allow_loops`), and the size/shape knobs `arities`, `max_block_stmts`,
`max_expr_depth`, `max_control_depth`, and the integer `value_range` /
`const_range`.

## Watch it decode

The most fun way to understand it — animate a program assembling itself one legal
token at a time, holes (`<expr>`, `<stmt>`, ...) filling in:

```bash
python -m astcd.viz                     # animate a random function
python -m astcd.viz --seed 7 --delay 0.3
python -m astcd.viz --uniform           # flat policy (more complex programs)
python -m astcd.viz --loops --max-control-depth 3
python -m astcd.viz --play              # interactive: YOU pick each token from the menu
python -m astcd.viz --gif decode.gif    # export the animation
```

## Status

Early and experimental, but the core is complete and tested (`pytest`):
grammar / vocabulary / scope / interpreter / codec round-trip / decoder soundness /
sampling policies / visualization.

## License

MIT — see [LICENSE](LICENSE).
