Metadata-Version: 2.4
Name: esrapy
Version: 1.1.post1
Summary: A parsing library for Python
Author-email: Simon <simonfunk@gmail.com>
Description-Content-Type: text/markdown
Project-URL: Homepage, https://sifter.org/~simon/esrapy/

# Esrapy — Parsing in Python

*esrapy is yparse backwards*

Esrapy is a parsing library written entirely in Python. It can parse pretty
much any context-free grammar, including left-recursive ones; can compile
patterns (grammars) from textual or procedural descriptions; unifies parsing
and tokenizing so tokenization can be contextual; has support for precedence
(ambiguity resolution) and attributes (limited context-sensitivity); can
return "first match" or a forest of all possible parsings (for an ambiguous
grammar); is very easy to use and results in very readable applications; and
is only about 600 lines of code (not counting comments) — or only 450 if you
don't need the textual compiler.

Esrapy works as a simple translator from a raw source text to a friendly
parse tree, with emphasis on making that tree very easy to traverse.

Esrapy is not terribly fast and may be a bit of a memory hog while running.
It's probably most useful for language prototyping or small interactive
applications where generality and ease of use are more important than speed.
The parsing method is somewhere between chart parsing and packrat recursive
descent.

See also: [pyparsing](https://pyparsing-docs.readthedocs.io/) and
[spark](http://pages.cpsc.ucalgary.ca/~aycock/spark/).


## Quick example

Here is a complete program using esrapy that implements a simple infix
calculator with operator precedence:

```python
import esrapy

# Here's a Pattern that parses expressions like "a = b*5+(2^c)":
pattern = esrapy.compile(r"""
SPACE = <[ \t]*>
expr  = name:<[a-zA-Z_]+> op:'=' b:expr
        |1; a:expr@1 op:<[+-]> b:expr@2
        |2; a:expr@2 op:<[*/]> b:expr@3
        |3; a:expr@3 op:'^'    b:expr@4
        |4; name:<[a-zA-Z_]+> op:\var
        |    val:<[0-9]+>     op:\int
        | '(' _:expr ')'
""")

vars = {}
def eval_pt(pt):
    if pt.op == 'int': return int(pt.val)
    if pt.op ==   '+': return eval_pt(pt.a)  + eval_pt(pt.b)
    if pt.op ==   '-': return eval_pt(pt.a)  - eval_pt(pt.b)
    if pt.op ==   '*': return eval_pt(pt.a)  * eval_pt(pt.b)
    if pt.op ==   '/': return eval_pt(pt.a)  / eval_pt(pt.b)
    if pt.op ==   '^': return eval_pt(pt.a) ** eval_pt(pt.b)
    if pt.op == 'var': return vars[pt.name]
    if pt.op ==   '=':
        val = eval_pt(pt.b)
        vars[pt.name] = val
        return val

while True:
    try:
        s = input("Expr: ")
    except EOFError:
        break
    try:
        pt = pattern.match(s, compact=True)
    except SyntaxError:
        print("Sorry, couldn't parse that.")
        continue
    print("Value:", eval_pt(pt))
```

See the `examples/` directory for more, including `calc.py` which steps
through the evolution of a pattern description, and `maxicalc.py` which
demonstrates the visitor interface.


## Getting started

See the examples directory for an implicit tutorial. `calc.py` in particular
steps through the evolution of a Pattern description.

From within Python, you can also `import esrapy` and run `help(esrapy)` to
get an overview. The sub-modules (`esrapyCore`, `esrapyPatterns`, etc.) also
have `help()` content if you want to dig into the internals or use the
procedural interface.


## Pattern syntax

Patterns (grammars) are specified textually and compiled with:

```python
pattern = esrapy.compile(textSpec)
```

Note `textSpec` is the definition of a pattern, not the text you will
eventually parse with it. To compile and immediately parse:

```python
parsed = esrapy.compile(textSpec).match(sourceText, compact=True)
```

The text spec is a series of definitions of the form `name = pattern` where
`name` is any alphanumeric name (underscores ok), and `pattern` is any of
those defined below. The name must be at the beginning of the line — no
indentation is allowed. Indented lines are treated as continuations of the
previous definition (no `\` needed or allowed). Blank lines are ignored, as
is anything after a `#`.

See `examples/esrapy.pat` for a sample pattern specification, which also
happens to describe the syntax of patterns itself (i.e., `esrapy.pat` parses
itself). See also `calc.py` and `calc*.pat` for a series of progressively
richer ways of defining the same pattern.

A pattern can be one of:

- **A named pattern reference**, as in `foo`, where `foo` is defined
  elsewhere in the pattern file.

- **A literal string** enclosed in `''` or `""`, as in `"foo"` or `'foo'`.

- **A regular expression** (see the `re` module) enclosed in `<>` or
  `` ` ` ``, as in `<[a-z]+>` or `` `[a-z]+` ``. A group may be selected
  as the value with the `@` operator, as in `<un-([a-z]+)>@1`. The default
  group (if no `@` is given) is 0, i.e. the entire matched string.

- **A sequence** of patterns, as in `foo bar baz` or `'foo' bar <[baz]>`.
  The returned value is a simple array of the values of each pattern.
  Items may be labeled: `a:foo b:bar c:<[baz]>`. If only a single item is
  labeled and that label is `_`, that item is returned directly as the value
  of the sequence (passthrough) with all other items discarded. If multiple
  items are labeled (or a single item labeled with any name other than `_`),
  the return value is a `MetaDict` of those labeled items (see `esrapyMisc.py`),
  which exposes all entries as attributes so `foo["bar"]` can be written
  `foo.bar`. See `minicalc.py` for an example.

- **A set of alternates** separated by `|`, as in `foo | bar | baz`. The
  returned value is the value of whichever pattern succeeds. Alternates are
  tried in order, so the first takes priority, but due to backtracking even
  the first match may not be the one ultimately selected. Items may be
  labeled using `;` as in `a;foo|bar|c;baz`. If a labeled item is selected,
  the return value is a 2-tuple `(label, value)`. An integer precedence may
  also be specified, applying to all successive items:

  ```
  set    = 1,a; foo | 2; bar | 3,c; baz | d; zorp
  subset = set@3    # Equivalent to: c; baz | d; zorp
  inUse  = "yo" set@2 <[a-z]+>
  ```

  This is most useful for binary operator precedence. See `minicalc.py`.

- **A repetition** specified with `*`, `+`, `?`, `{n}`, or `{n,m}`, where
  `{n,m}` means from n to m copies, `{n}` means n or more, and `*`, `+`,
  `?` correspond to `{0}`, `{1}`, `{0,1}`. The return value is always an
  array. A separating pattern may be specified after a `/`, as in `foo/bar+`,
  which matches `foo`, `foo bar foo`, `foo bar foo bar foo`, etc. The
  separator is omitted from the return value. Useful for comma-separated
  lists: `params = varName/","*`.

- **A keyword** inserted into the return stream with no effect on parsing,
  specified with a leading `\`, as in `\key`. For example:
  ```
  time = hour:int ":" minute:int | "noon" hour:\12 minute:\00
  ```

- **An indented block** of equally-indented lines, specified as
  `>> pattern`, where `pattern` begins with non-whitespace and ends at
  end of line or end of file. Most useful for the top level of
  indentation-sensitive syntaxes.

- **A header line followed by an indented block**, specified as
  `head >> pattern`. `head` matches something like the head of an `if` or
  `while` block. Note that the default `SPACE` pattern is incompatible with
  block/indent because it causes `pattern` to start with whitespace.

- **Grouping** with `()`, as in `foo (bar | baz) zorp`.

### Summary by example

```
foo bar baz    a sequence of patterns, returns a list
foo|bar|baz    a set of alternates, returns one item
foo*           zero or more, returns a list
foo+           one or more, returns a list
foo?           zero or one (optional), returns a list
foo{2,5}       2 to 5 copies, returns a list
foo/bar+       one or more foo's separated by bar; returns foos
foo/bar{5}     5 or more foo's separated by bar; returns foos
\foo           matches empty string, returns the literal "foo"
"foo"          a literal, returns itself
'foo'          ditto
<foo>          a regular expression, returns a string
`foo`          ditto
<f(oo)>@1      matches regex, returns the group "oo"
a:foo b:bar    sequence returning MetaDict {a:foo, b:bar}
_:foo bar      sequence returning only foo (passthrough)
a;foo|b;bar    set returning ("a", foo) or ("b", bar)
1;foo|2;bar    set annotated with precedence
foo@5          subset of foo with precedence >= 5
(foo|bar)baz   equivalent to: foo baz | bar baz
>> foo         a block of equally indented lines matching foo
foo >> bar     line matching foo followed by indented block of bar
```


## Visitor interface

The traditional way to use esrapy is to call `pattern.match()` and work
with the result — a nested structure of strings, lists, and `MetaDict`s.
This works well for simple cases, but for non-trivial grammars a visitor
is cleaner for building ASTs or evaluating parse trees.

Leave `compact=False` (the default) in `match()` to get `Match` objects
directly, then walk the tree with a visitor whose methods are dispatched
by pattern name:

```python
m = pattern.match(sourceText)
result = m.visit(myVisitor)
```

`visit()` dispatches to a method named `on_<patternname>` on the visitor.
If no matching `on_*` method exists, `visit()` applies default structural
behavior for the matched pattern type and returns the result directly.

### The `parts` parameter

If a handler declares a parameter named `parts`, `visit()` automatically
calls the default structural decomposition and passes it as `parts=`,
saving the handler from doing so explicitly. The type of `parts` depends
on the pattern:

- **Sequence with multiple named children:** `parts` is a `MetaDict`:
  ```python
  def on_binop(self, m, parts):
      return parts.a + parts.b   # parts.a, parts.b already visited
  ```
- **Sequence with sole `_` child:** `parts` is the single visited child value.
- **Regex (terminal):** `parts` is the matched string:
  ```python
  def on_number(self, m, parts):
      return int(parts)
  ```
- **NtoM (repetition):** `parts` is a list of visited children.
- **Block:** `parts` is a list of visited statements.
- **Indent:** `parts` is `MetaDict(head=..., body=...)`.

In all cases, child matches within `parts` have already been visited
recursively before the handler is called, so handlers are typically
just one or two lines:

```python
def on_assign(self, m, parts):
    vars[parts.name] = parts.val
    return parts.val

def on_binop(self, m, parts):
    op = parts.op
    if op == '+': return parts.a + parts.b
    if op == '-': return parts.a - parts.b
    if op == '*': return parts.a * parts.b
    if op == '/': return parts.a / parts.b
    if op == '^': return parts.a ** parts.b

def on_number(self, m, parts):
    return int(parts)

def on_var(self, m, parts):
    return vars[parts]
```

Omit `parts` when you need to control or defer the default recursion:

```python
def on_with_stmt(self, m):
    self.context.push(...)
    parts = m.visit_parts(self)
    self.context.pop()
    return parts.body
```

### Default behavior by pattern type

When no handler exists for a named pattern, or when `visit_parts()` is
called explicitly:

| Pattern type | Default behavior |
|---|---|
| Regex | Returns the matched string (`m.value`) |
| Sequence (sole `_` child) | Returns that child's visited value (passthrough) |
| Sequence (multiple named children) | Returns `MetaDict` of visited named children; unnamed children discarded |
| Sequence (no named children) | Returns list of all visited children |
| OneOfSet | If the winning alternative has a string label, dispatches to that handler; otherwise recurses into the winning alternative |
| NtoM | Returns list of visited children |
| Block | Returns list of visited children |
| Indent | Returns `MetaDict(head=visited_head, body=visited_body)` |

### Labeled alternatives and OneOfSet

When alternatives are labeled with string names in the grammar:

```
expr = 1,binop; a:expr@1 op:<[+-]> b:expr@2
     | 2,binop; a:expr@2 op:<[*/]> b:expr@3
     | var
     | number
```

`visit()` detects which labeled alternative matched and dispatches to
`on_binop`, `on_var`, `on_number` etc. directly. A label can be shared
across multiple alternatives — the same handler is called regardless of
which precedence level matched. Numeric precedence and string label can be
combined as `N,name;`.

### Navigating children directly: `m.children`

For handlers that need to navigate the `Match` tree directly:

```python
def on_binop(self, m):
    c = m.children
    a  = c.a.visit(self)
    op = c.op.value
    b  = c.b.visit(self)
    return ...
```

`m.children` returns a `MatchView` with attribute-style access to named
children, recursing transparently through anonymous intermediate nodes.
Use `c.get('name', default)` for optional children. Iterating `m.children`
yields all named children in order.

`m.value` returns the raw first parsing value — for terminal `Regex` matches
this is the matched string.

### Tips for clean visitor dispatch

1. Give every semantically meaningful alternative its own named rule or a
   string label within a set.
2. Label all sequence items you need to access by name. Unlabeled items are
   treated as punctuation and discarded in the `parts` `MetaDict`.
3. Use `_` as the label for a sole passthrough item in a sequence:
   `paren = '(' _:expr ')'`
4. For if/elif/else chains, fold the chain while handling the enclosing
   block rather than in a separate post-processing pass.

See `examples/maxicalc.py` for a complete working example of a visitor-based
expression evaluator.


## How it works

Esrapy is essentially a recursive-descent parser with two key additions:

- **Chart parsing:** If esrapy finds itself looking for something it's
  already looking for (such as when a pattern is recursively composed of
  itself), it doesn't start looking again — it just remembers the context
  and notifies it if the pattern is later found. This handles left-recursive
  grammars.

- **Packrat parsing:** If esrapy finds itself looking for something it has
  already found, it uses its cached result rather than re-parsing. This
  prevents exponential blowup on ambiguous grammars.

That's pretty much the whole of it.


## Changes

**Version 1.1 (2026-04-01)**
- Added `custom` param to the esrapy Pattern language compiler to allow
  injection of custom Pattern objects/classes into otherwise compiled
  Patterns.

**Version 1.0 (2026-03-23)**
- Removed `compact()` code (default visitor provides ~identical functionality).
- Defaulting `compact=False` in `match()` (*** API CHANGE ***)

**Version 0.8.3 (2026-03-23)**
- `Match.visit()` now accepts `None` for the visitor, falling back to the
  same behavior as `compact(all=False)`.

**Version 0.8.2 (2026-03-21)**
- Misc tweaks.

**Version 0.8.1 (2026-03-16)**
- Revised Sequence single-label passthrough: `_` is now required to trigger
  passthrough (previously any single label triggered it).
- Revised `Match.visit()` to apply default structural behavior per pattern
  type when no handler exists, rather than calling `on_default()`.
- Added `Match.dispatch()`, `Match.visit_parts()`, and `visit_parts()` to
  all Pattern subclasses.
- Labeled `OneOfSet` alternatives now require an explicit handler.

**Version 0.8 (2026-03-15)**
- Added visitor pattern support. Call `pattern.match(compact=False)` to get
  `Match` objects, then use `m.visit(visitor)` to dispatch to handler methods
  by pattern name.
- Added `Match.children` (`MatchView` with attribute-style access to named
  children).
- Added `Match.value` property for terminal match strings.
- Extended `OneOfSet` alternative labels to support string names combinable
  with integer precedences as `N,name;`.

**Version 0.7 (2023-03-20)**
- Added `firstLine` parameter to `match()`/`parse()` for better error messages.
- Incorporated `esrapyIndent` for indentation-sensitive syntaxes; added `>>`
  operator for indented blocks.

**Version 0.6.1 (2006-03-25)**
- Added catch of infinite repetitions of nil patterns in `NtoM`.
- Syntax errors now raise `SyntaxError` instead of `ValueError`.

**Version 0.6 (2006-03-16)**
- Added `SPACE` pattern for implicit whitespace.
- Added skip expression to `Regex` Pattern.

**Version 0.5 (2006-03-15)**
- Initial release.

