Metadata-Version: 2.4
Name: emx-regex-cgen
Version: 0.2.0
Summary: Regex to C code generator
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

**emmtrix Regex-to-C Code Generator (emx-regex-cgen)** compiles regular expressions into portable,
static C code for embedded and performance-critical applications.

[![CI](https://github.com/emmtrix/emx-regex-cgen/actions/workflows/ci.yml/badge.svg)](https://github.com/emmtrix/emx-regex-cgen/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

## Key Characteristics

- **Two backend engines** — choose between a table-driven **DFA** and a
  bit-parallel **NFA** (`--engine dfa`, `--engine bitnfa`, or the default
  `--engine auto` which tries DFA first and falls back to bitnfa when the
  DFA state limit is exceeded).
- **Table-driven DFA** — every regex is compiled to a minimised
  deterministic finite automaton; the generated C code performs a single
  linear scan over the input.
- **Bit-parallel NFA** — the regex is compiled to a Thompson NFA and
  simulated with precomputed bitmasks; the hot loop is fully unrolled
  with no loops over words. The variant (`uint8`, `uint16`, `uint32`,
  or `uint32_t[N]`) is selected automatically based on the number of
  NFA positions.
- **No dynamic memory allocation** — all data is `static const`; no
  `malloc`, no `free`.
- **Branch-free inner loop** — the matching loop contains a single table
  lookup per byte (DFA) or unrolled bitwise operations (bitnfa).
- **re2 feature set** — supports the same subset of regular expressions
  as Google's [re2](https://github.com/google/re2) library (no
  back-references, no look-around).
- **Fullmatch semantics** — the generated function checks whether the
  *entire* input matches the pattern.

## Installation

```bash
pip install emx-regex-cgen
```

The distribution name is `emx-regex-cgen`. The Python import path is
`emx_regex_cgen`.

After installation, you can use the package either as a CLI tool via
`emx-regex-cgen` or as a Python library via
`from emx_regex_cgen import generate`.

## Quick Start

### Python Library

`generate()` returns a `GeneratedCode` object with four parts.
Call `.render()` to get the combined C source string, or access individual
fields to embed them in an existing code-generation pipeline.

```python
from emx_regex_cgen import generate

# generate() returns a GeneratedCode object
result = generate(r"\d{4}-\d{2}-\d{2}")

# .render() assembles the complete C source (same output as before)
print(result.render())

# Access individual parts:
result.includes        # ['stddef.h', 'stdbool.h', 'stdint.h']
result.globals         # static const transition table (and optional maps)
result.match_function  # bool regex_match(const char *input, size_t len) { … }
result.main_function   # int main(…) { … }  — None when emit_main=False

# Include a main() for standalone testing
result = generate(r"[a-z]+", emit_main=True)

# Bytes mode: '.' matches any single byte, classes work on raw byte values
result = generate(r"[\x80-\xff]+", encoding="bytes")

# Use the bit-parallel NFA backend instead of DFA
result = generate(r"hello", engine="bitnfa")
```

#### `generate()` options

```python
generate(
    pattern,
    flags="",
    *,
    emit_main=False,
    prefix="regex",
    encoding="utf8",
    engine="auto",
    row_dedup="auto",
    alphabet_compression="auto",
    size_threshold=8192,
    early_exit=False,
)
```

Library arguments map directly to the generator settings:

| Argument | Type | Default | Meaning |
|----------|------|---------|---------|
| `pattern` | `str` | - | Regular expression to compile. |
| `flags` | `str` | `""` | Regex flags: `i` (ignore case), `s` (dot-all), `m` (multiline), `x` (verbose syntax accepted for compatibility). |
| `emit_main` | `bool` | `False` | Include a standalone `main()` function in the generated C code. |
| `prefix` | `str` | `"regex"` | Prefix for generated C identifiers such as `regex_match`. |
| `encoding` | `str` | `"utf8"` | `"utf8"` for Unicode-aware UTF-8 input, or `"bytes"` for raw byte semantics. |
| `engine` | `str` | `"auto"` | Backend engine: `"auto"` (DFA first, bitnfa fallback), `"dfa"`, or `"bitnfa"`. |
| `row_dedup` | `str` | `"auto"` | DFA only: `"yes"`, `"no"`, or `"auto"` for transition-row deduplication. |
| `alphabet_compression` | `str` | `"auto"` | DFA only: `"yes"`, `"no"`, or `"auto"` for byte equivalence-class compression. |
| `size_threshold` | `int` | `8192` | DFA only: threshold used by `"auto"` for `row_dedup` and `alphabet_compression`. |
| `early_exit` | `bool` | `False` | DFA only: break out of the loop once the dead state is reached. |

Example with multiple options:

```python
from emx_regex_cgen import generate

result = generate(
    r"[a-z]+\d+",
    flags="i",
    prefix="user_id",
    emit_main=True,
    engine="dfa",
    encoding="utf8",
    row_dedup="yes",
    alphabet_compression="auto",
    size_threshold=4096,
    early_exit=True,
)

print(result.render())
```

### CLI

```bash
# Write generated C code to stdout
emx-regex-cgen '[a-z]+\d+'

# Write to a file with a main() function
emx-regex-cgen '[a-z]+\d+' --emit-main -o matcher.c

# Compile and test
gcc -O2 -o matcher matcher.c
./matcher "hello42"   # exit 0 (match)
./matcher "HELLO"     # exit 1 (no match)

# Bytes mode: match any sequence of high bytes
emx-regex-cgen --encoding bytes '[\x80-\xff]+' --emit-main -o byte_matcher.c

# Use the bit-parallel NFA backend
emx-regex-cgen --engine bitnfa 'hello' --emit-main -o bitnfa_matcher.c
```

## CLI Reference

```
usage: emx-regex-cgen [-h] [-o OUTPUT] [--emit-main] [--prefix PREFIX]
                  [--flags FLAGS] [--encoding {utf8,bytes}]
                  [--engine {auto,dfa,bitnfa}]
                  [--row-dedup {yes,no,auto}]
                  [--alphabet-compression {yes,no,auto}]
                  [--size-threshold SIZE_THRESHOLD]
                  [--early-exit {yes,no}] pattern

Generate C code that performs a fullmatch for a regular expression.

positional arguments:
  pattern               Regular expression pattern

options:
  -o, --output          Output file (default: stdout)
  --emit-main           Also emit a main() function (exit 0=match, 1=no match, 2=error)
  --prefix PREFIX       Prefix for all generated C identifiers (default: regex)
  --flags FLAGS         Regex flags: i (case-insensitive), s (dot-all), m (multiline)
  --encoding {utf8,bytes}
                        Input encoding: utf8 (default, Unicode-aware) or bytes
                        (raw byte semantics)
  --engine {auto,dfa,bitnfa}
                        Backend engine: auto (try DFA first, fall back to bitnfa; default),
                        dfa (table-driven minimised DFA),
                        bitnfa (bit-parallel NFA)
  --row-dedup {yes,no,auto}
                        Transition-row deduplication: yes (always), no (never),
                        auto (when table exceeds --size-threshold; default). DFA only.
  --alphabet-compression {yes,no,auto}
                        Alphabet compression into equivalence classes: yes (always),
                        no (never), auto (when table exceeds --size-threshold; default).
                        DFA only.
  --size-threshold N    Table-size threshold (cells = states × 256) for auto mode
                        (default: 8192). DFA only.
  --early-exit {yes,no}
                        Emit early-exit check in DFA loop: yes (break when dead state
                        is reached), no (always process full input; default). DFA only.
```

## Supported Features

### Regex Features

| Feature | Example pattern | DFA | bitnfa |
|---------|-----------------|-----|--------|
| Literal string | `hello` | [literal.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/literal.c) | [literal\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/literal_bitnfa.c) |
| Character class | `[a-z0-9_]+` | [char\_class.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/char_class.c) | [char\_class\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/char_class_bitnfa.c) |
| Negated class | `[^aeiou]+` | [negated\_class.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/negated_class.c) | [negated\_class\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/negated_class_bitnfa.c) |
| Dot (any char except `\n`) | `.+` | [dot.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/dot.c) | [dot\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/dot_bitnfa.c) |
| Alternation | `cat\|dog\|fish` | [alternation.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/alternation.c) | [alternation\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/alternation_bitnfa.c) |
| Star quantifier `*` | `ab*c` | [quantifier\_star.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_star.c) | [quantifier\_star\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_star_bitnfa.c) |
| Plus quantifier `+` | `ab+c` | [quantifier\_plus.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_plus.c) | [quantifier\_plus\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_plus_bitnfa.c) |
| Optional quantifier `?` | `colou?r` | [quantifier\_optional.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_optional.c) | [quantifier\_optional\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_optional_bitnfa.c) |
| Bounded repeat `{m,n}` | `a{2,4}` | [quantifier\_repeat.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_repeat.c) | [quantifier\_repeat\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/quantifier_repeat_bitnfa.c) |
| Digit escape `\d` | `\d{4}-\d{2}-\d{2}` | [escape\_digit.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_digit.c) | [escape\_digit\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_digit_bitnfa.c) |
| Word escape `\w` | `\w+` | [escape\_word.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_word.c) | [escape\_word\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_word_bitnfa.c) |
| Space escape `\s` | `\s+` | [escape\_space.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_space.c) | [escape\_space\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/escape_space_bitnfa.c) |
| Unicode / UTF-8 | `\x{00e9}+` | [unicode.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/unicode.c) | [unicode\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/unicode_bitnfa.c) |
| Anchors `^` / `$` | `^start.*end$` | [anchors.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/anchors.c) | [anchors\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/anchors_bitnfa.c) |
| Word boundary `\b` / `\B` | `\bword\b` | [word\_boundary.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/word_boundary.c) | [word\_boundary\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/word_boundary_bitnfa.c) |
| Unicode property `\p{…}` / `\P{…}` | `\p{Nd}+` | [unicode\_property.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/unicode_property.c) | [unicode\_property\_bitnfa.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/unicode_property_bitnfa.c) |

### CLI Options

| Option | Effect | Golden reference |
|--------|--------|------------------|
| `--flags i` | Case-insensitive matching | [flag\_ignorecase.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_ignorecase.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_ignorecase_bitnfa.c) |
| `--flags s` | Dot matches `\n` (dot-all) | [flag\_dotall.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_dotall.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_dotall_bitnfa.c) |
| `--flags m` | Multiline anchors | [flag\_multiline.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_multiline.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_multiline_bitnfa.c) |
| `--flags x` | Verbose / free-spacing mode | [flag\_verbose.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_verbose.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/flag_verbose_bitnfa.c) |
| `--encoding bytes` | Raw byte semantics (no UTF-8) | [encoding\_bytes.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/encoding_bytes.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/encoding_bytes_bitnfa.c) |
| `--prefix NAME` | Custom identifier prefix | [prefix.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/prefix.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/prefix_bitnfa.c) |
| `--emit-main` | Include standalone `main()` | [emit\_main.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/emit_main.c) · [bitnfa](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/emit_main_bitnfa.c) |
| `--alphabet-compression yes` | Byte equivalence-class compression | [alphabet\_compression.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/alphabet_compression.c) |
| `--row-dedup yes` | Transition-row deduplication | [row\_dedup.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/row_dedup.c) |
| `--early-exit yes` | Break DFA loop on dead state (early exit) | [early\_exit.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/early_exit.c) |

### Bit-NFA Variants

The `bitnfa` engine automatically selects the narrowest integer type
that can hold all NFA positions:

| Variant | NFA positions | Golden reference |
|---------|---------------|------------------|
| `uint8_t` | ≤ 8 | [bitnfa\_uint8.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/bitnfa_uint8.c) |
| `uint16_t` | ≤ 16 | [bitnfa\_uint16.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/bitnfa_uint16.c) |
| `uint32_t` | ≤ 32 | [bitnfa\_uint32.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/bitnfa_uint32.c) |
| `uint32_t[N]` | > 32 | [bitnfa\_uint32\_array.c](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/tests/golden/bitnfa_uint32_array.c) |

## Generated Code Structure

### DFA Engine (default)

The generator produces:

1. An optional **alphabet map** (`static const uint8_t regex_alphabet[256]`)
   mapping each byte to its equivalence class (emitted when alphabet
   compression is active).
2. A **transition table** (`static const uint8_t regex_transitions[M][C]`)
   mapping `(row, column) → next_state`.  Dimensions depend on active
   optimisations: *M* is the number of unique rows (with row dedup) or
   states; *C* is the number of equivalence classes (with alphabet
   compression) or 256.
3. An optional **row map** (`static const uint8_t regex_row_map[N]`)
   mapping `state → row index` (emitted when row deduplication is active).
4. A **match function** with the signature:
   ```c
   bool regex_match(const char *input, size_t len);
   ```
5. Optionally a **`main()` function** for standalone executables.

#### Example Output (pattern `hello`, `--alphabet-compression yes --row-dedup yes`)

```c
static const uint8_t regex_alphabet[256] = { /* byte → class */ };

static const uint8_t regex_transitions[6][5] = {
    /* states 0, 6 */ { 0 },
    /* state 1 */     { [2] = 4 },
    ...
};

static const uint8_t regex_row_map[7] = { 0, 1, 2, 3, 4, 5, 0 };

bool regex_match(const char *input, size_t len) {
    uint8_t state = 1;
    for (size_t i = 0; i < len; i++) {
        state = regex_transitions[regex_row_map[state]][regex_alphabet[(unsigned char)input[i]]];
    }
    return state >= 6;
}
```

### Bit-NFA Engine (`--engine bitnfa`)

The generator produces:

1. A **transition mask table**
   (`static const uint16_t regex_trans[P][256]`) where *P* is the number
   of NFA positions.  Each entry is a bitmask of destination positions
   for the given (source position, input byte) pair.  The integer width
   (`uint8_t` / `uint16_t` / `uint32_t` / `uint32_t[N]`) is chosen
   automatically.
2. A **match function** with a fully unrolled inner loop — one
   `if (state & bit) next |= table[pos][b];` per active position.
3. Optionally a **`main()` function** for standalone executables.

#### Example Output (pattern `hello`, `--engine bitnfa`)

```c
static const uint16_t regex_trans[10][256] = {
    /* position 0 */ { ['h'] = 0x0006u },
    /* position 1 */ { 0 },
    /* position 2 */ { ['e'] = 0x0018u },
    ...
};

bool regex_match(const char *input, size_t len) {
    uint16_t state = 0x0001u;
    for (size_t i = 0; i < len; i++) {
        unsigned char b = (unsigned char)input[i];
        uint16_t next = 0;
        if (state & 0x0001u) next |= regex_trans[0][b];
        if (state & 0x0004u) next |= regex_trans[2][b];
        if (state & 0x0010u) next |= regex_trans[4][b];
        if (state & 0x0040u) next |= regex_trans[6][b];
        if (state & 0x0100u) next |= regex_trans[8][b];
        state = next;
    }
    return (state & 0x0200u) != 0;
}
```

## Testing

Tests are parameterised from `re2_compat_results.json`, which contains
2 500+ patterns extracted from the PCRE2 test suite and validated against
Google re2.

```bash
# Run all tests (parallel)
pytest -n auto -q

# Run linter
ruff check src/ tests/
```

### Test Strategy

1. **Generate** C code from the regex pattern.
2. **Compile** with `gcc -O2`.
3. **Execute** the binary with each test subject as `argv[1]`.
4. **Compare** the exit code against the expected match/no-match result.

## Development

```bash
# Clone
git clone --recurse-submodules https://github.com/emmtrix/emx-regex-cgen.git
cd emx-regex-cgen

# Install
pip install emx-regex-cgen

# Or, for local development
pip install -e ".[dev]"

# Test
pytest -n auto -q

# Lint
ruff check src/ tests/
```

## License

[MIT](https://github.com/emmtrix/emx-regex-cgen/blob/v0.2.0/LICENSE) — Copyright (c) 2026 emmtrix Technologies GmbH

## Maintained by

[emmtrix Technologies GmbH](https://emmtrix.com)
