Metadata-Version: 2.4
Name: parsely-dip
Version: 0.0.1
Summary: PARSELY-DIP: Deterministic Intent Parser — RegEx and NLP pipeline for intent recognition
Author-email: George Butiri <george@iseestudios.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/gbutiri/parsely-dip
Keywords: nlp,intent,parser,deterministic,regex
Classifier: Development Status :: 1 - Planning
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: stanza>=1.5
Requires-Dist: requests>=2.28
Requires-Dist: python-dotenv>=1.0
Requires-Dist: flask>=3.0

# PARSELY-DIP

**Parsing And RegEx Syntactic Engine with Linguistic Yield — Deterministic Intent Parser**

*Parsely dip for silicon chips.*

A deterministic intent recognition engine that processes natural language through a cascading pipeline — RegEx first, then constituency and dependency parsing via Stanza, then LLM fallback. Each layer only fires if the one above didn't match. The cheapest, fastest layer runs first. The LLM is the last resort, not the default.

Your LLM is expensive, slow, and unpredictable. When a user says "what time is it" or "move the card to done," there is zero ambiguity. A regex handles it in microseconds. An LLM spends tokens guessing what you already know. PARSELY-DIP intercepts deterministic commands before they reach the LLM, executes them directly, and returns the result.

## What It Does

```python
from parsely_dip import parse

result = parse("what time is it")
# result = "14:32"

result = parse("what is the weather like")
# result = "It's 36°F and broken clouds in Cleveland."

result = parse("tell me about quantum physics")
# result = None  (no match — pass to LLM)
```

One call. One input. Response string or None.

## Install

```bash
pip install parsely-dip
```

From source:

```bash
git clone https://github.com/gbutiri/parsely-dip.git
cd parsely-dip
pip install -e .
```

### NLP Layer Setup (Optional)

The RegEx layer works out of the box. The NLP layer requires Stanza and a running parse service.

**1. Download the Stanza English model (~526MB):**

```bash
python -c "import stanza; stanza.download('en')"
```

**2. (Recommended) Download the accurate model with transformer support:**

```bash
python -c "import stanza; stanza.download('en', package='default_accurate')"
pip install transformers sentencepiece
```

The `default_accurate` model uses PEFT fine-tuned transformers (Google Electra Large). The biggest accuracy improvement is in constituency parsing — the core of NLP intent matching. Requires ~1-2GB extra VRAM on a dedicated GPU.

**3. (Recommended) Install PyTorch with GPU support:**

```bash
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
```

Without this, Stanza runs on CPU. With a dedicated GPU (RTX 3060+), parsing is near-instant.

**4. Start the NLP service:**

```bash
python -m parsely_dip.engine.stanza_service
```

The service loads once and stays running. PARSELY-DIP calls it via HTTP on port 5013 for each query that passes the RegEx layer. The service auto-detects the best available model (`default_accurate` > `default`) and reports GPU status on startup.

---

## Three-Tier Pipeline

```
User Input
    |
    v
[RegEx Layer]  — Pattern matching, microseconds, zero dependencies
    |  match? --> handler executes, returns response
    |  no match? --> continue
    v
[NLP Layer]    — Stanza constituency + dependency parsing via HTTP service
    |  match? --> handler executes, returns response
    |  no match? --> continue
    v
[LLM Fallback] — parse() returns None, caller decides what to do
```

### Layer 1: RegEx

Patterns stored in flat `.patterns` text files. One pattern per line. No JSON escaping nightmares.

```
# Format: (regex) => intent_name
# intents/time.py
(what('s|\s+is)\s+the\s+time|what\s+time\s+is\s+it)\?? => tell_time

# intents/weather.py
((what|how)('s|\s+is)\s+the\s+weather(\s+like)?)\?? => tell_weather

# intents/scrum.py
show(\s+me|\s+us)?\s+the\s+(current|active)(\s+scrum)?\s+cards?[.!]? => show_current_card
```

**Pattern convention:** `\s+` goes BEFORE the word it separates, not after.

```
CORRECT: (what('s|\s+is)\s+the\s+time)
WRONG:   (what('s|is\s+)the\s+time\s+)
```

The space belongs to the approach of the next word, not trailing from the previous.

### Layer 2: NLP

Patterns stored in `.json` files. Each pattern defines a grammatical structure using sentence type, POS tags, dependency relations, and head words. Matches on linguistic features, not exact strings — so "what time is it, please?" and "hey, what's the time right now?" both match without needing separate regex patterns.

```json
[
  {
    "intent": "tell_time",
    "nlp": {
      "sentence_type": ["SBARQ", "SQ", "WHNP"],
      "words": [
        {"word": "what", "pos": "DET", "dep": "det", "required": true},
        {"lemma": "time", "pos": "NOUN", "required": true},
        {"lemma": "be", "pos": "AUX", "dep": "cop", "required": true},
        {"word": "it", "pos": "PRON", "dep": "nsubj", "required": true}
      ]
    }
  }
]
```

The NLP layer requires the Stanza service running on port 5013. If the service is not running, the NLP layer is silently skipped and the pipeline falls through to LLM.

### Layer 3: LLM Fallback

`parse()` returns `None`. The caller decides what to do — send to an LLM, show an error, or ignore. PARSELY-DIP does not call any LLM itself.

---

## Intent Handlers

Self-registering via the `@intent` decorator. Import the module, the decorator registers the handler. No config files, no setup step.

```python
from parsely_dip.engine.registry import intent

@intent('tell_time')
def tell_time():
    from datetime import datetime
    now = datetime.now()
    return f"{now.hour:02d}:{now.minute:02d}"
```

### Built-in Intents

| Intent | File | What It Does |
|--------|------|-------------|
| `tell_time` | `intents/time.py` | Returns current time in 24-hour format |
| `tell_weather` | `intents/weather.py` | Returns weather via OpenWeatherMap API (requires `WEATHER_API_KEY` in `.env`) |
| `show_current_card` | `intents/scrum.py` | Shows active scrum cards from SQLite database |
| `read_current_card` | `intents/scrum.py` | Same data as show, but intended for LLM to summarize |

### Adding New Intents

1. Create a new file in `intents/` (e.g., `intents/greeting.py`)
2. Write a handler function with the `@intent` decorator
3. Add regex patterns to `patterns/base.patterns`
4. (Optional) Add NLP patterns to `patterns/base_nlp.json`
5. Import the module in `__init__.py`

---

## Project Structure

```
parsely-dip/
  pyproject.toml           — Package config, dependencies
  README.md                — This file
  env_parselydip/          — Virtual environment
  db/                      — Database files (if needed by intents)
  logs/                    — Log files
  tests/                   — Test suite
  src/parsely_dip/
    __init__.py            — parse(prompt) single entry point
    engine/
      registry.py          — @intent decorator, handler registry, dispatch()
      regex.py             — load_patterns(), check_regex()
      nlp.py               — load_nlp_patterns(), check_nlp(), match_nlp_pattern()
      splitter.py          — Sentence splitting (future expansion)
      stanza_service.py    — Stanza NLP Flask service (port 5013)
    intents/
      __init__.py           — Auto-imports all intent modules
      time.py               — tell_time handler
      weather.py            — tell_weather handler (OpenWeatherMap API)
      scrum.py              — show_current_card, read_current_card handlers
    patterns/
      base.patterns         — RegEx patterns (flat text, one per line)
      base_nlp.json         — NLP patterns (structured JSON)
    cli/
      __init__.py           — CLI entry point (future)
```

---

## Hook Integration

PARSELY-DIP is designed to run as a Claude Code `UserPromptSubmit` hook. The hook intercepts the user's message, runs it through the pipeline, and either handles it deterministically or lets the LLM process it.

### Hook Script

```bash
#!/bin/bash
PROJECT_DIR="${CLAUDE_PROJECT_DIR:-.}"
VENV_PY="$PROJECT_DIR/env_bibliotech/Scripts/python.exe"
[ ! -f "$VENV_PY" ] && exit 0

"$VENV_PY" -c "
import sys, json
from parsely_dip import parse
data = json.load(sys.stdin)
prompt = data.get('prompt', '')
if prompt:
    r = parse(prompt)
    if r:
        print('=== PARSELY-DIP ===')
        print('Relay this to the user EXACTLY as written, nothing else:')
        print(r)
        print('=== END PARSELY-DIP ===')
" 2>/dev/null
exit 0
```

### How It Works

1. Hook reads the user's prompt from stdin (JSON with `prompt` field)
2. Calls `parsely_dip.parse(prompt)`
3. If result: prints it to stdout (shown to LLM as context, LLM relays verbatim)
4. If None: no output, LLM processes the prompt normally

### Known Limitation

Claude Code's `UserPromptSubmit` hooks cannot display text directly to the user without the LLM firing. The documented `decision: "block"` + `reason` field blocks the prompt but does not render the reason in the VS Code extension (confirmed bug). The current approach uses plain text stdout with exit 0 — the LLM sees the result and relays it.

---

## Stanza NLP Service

The NLP service is a Flask app that wraps Stanford's Stanza NLP library. It runs as a background service on port 5013, loads the model once at startup, and handles parse requests via HTTP.

### Starting the Service

```bash
python -m parsely_dip.engine.stanza_service
```

### What Happens at Startup

1. Tries to load `default_accurate` (transformer-based, best accuracy)
2. If that fails (missing packages), prompts the user to install or continue with standard
3. Falls back to `default` (CharLM-based, solid accuracy)
4. If no model found, prints install instructions and exits
5. Reports GPU status (name of GPU if available, install command if not)

### Service Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/process_syntactic_parsing` | POST | Parse text, return words with POS/dependency/constituency |
| `/debug_parse` | POST | Raw parse data for debugging sentence structure |

### Interactive Mode

```bash
python -m parsely_dip.engine.stanza_service --chat
```

Opens an interactive prompt where you can type sentences and see their constituency trees and dependency relations. Useful for building new NLP patterns.

### Security

- Localhost only (127.0.0.1) — rejects non-local requests
- Optional token auth via `STANZA_API_TOKEN` environment variable — enforced if set, skipped if not

---

## NLP Pattern Specification

NLP patterns define grammatical structures that map to intents. Unlike regex (exact string matching), NLP patterns match on linguistic features extracted by Stanza.

### Pattern Structure

```json
{
  "intent": "intent_name",
  "nlp": {
    "sentence_type": "SBARQ",
    "words": [
      {
        "word": "exact_word",
        "lemma": "base_form",
        "pos": "NOUN",
        "dep": "nsubj",
        "head_lemma": "parent_word",
        "required": true
      }
    ]
  }
}
```

### Matching Modes

- **Exact Word Match** — `word` specified: match that exact word in that grammatical position
- **Structural Match (Slot)** — `word` empty: match ANY word with specified POS + dependency features
- **Optional Words** — `required: false`: pattern matches with or without this word

### Supported Values

**Sentence Types:** S, SBARQ, SQ, SINV, FRAG (+ 20 more constituency labels)

**POS Tags (17 Universal):** NOUN, VERB, AUX, ADJ, ADV, PRON, DET, ADP, NUM, PART, CCONJ, SCONJ, INTJ, PROPN, PUNCT, SYM, X

**Dependency Relations (37+):** nsubj, obj, root, det, cop, aux, mark, case, advmod, amod, compound, conj, cc, xcomp, ccomp, advcl, acl, nmod, obl, nummod, appos, dep, fixed, flat, list, parataxis, orphan, goeswith, reparandum, punct, clf, discourse, dislocated, expl, iobj, vocative, csubj

### Specificity Rule

**A loose pattern that matches incorrectly is WORSE than no pattern (LLM fallback).**

Every NLP pattern must be maximally specific. Include all words that disambiguate the intent — articles, pronouns, structural words. If removing a word would cause false positives, that word is required.

---

## Configuration

### .env

```
WEATHER_API_KEY=your_openweathermap_key
STANZA_API_TOKEN=optional_security_token
```

### pyproject.toml Dependencies

```toml
dependencies = [
    "stanza>=1.5",
    "requests>=2.28",
    "python-dotenv>=1.0",
    "flask>=3.0",
]
```

Optional (for `default_accurate` model):
```
pip install transformers sentencepiece
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
```

---

## Requirements

- Python 3.9+
- Stanza 1.5+ (for NLP layer)
- Flask 3.0+ (for NLP service)
- A dedicated GPU is recommended but not required (RTX 3060+ for transformer models)
- The RegEx layer works with zero dependencies beyond the base package

## Target Audience

Linguists and NLP researchers who understand constituency trees, dependency relations, and POS tags. You can run commands and follow instructions, but you should not have to debug import errors or port conflicts. PARSELY-DIP tells you what's wrong and how to fix it.

## Status

v0.0.1 — Core engine built. RegEx pipeline working with time, weather, and scrum card intents. NLP layer ported from Uni with Stanza service (default_accurate with Electra Large transformer, GPU accelerated). Hook integration tested with Claude Code. CLI available via `parsely` command.

## License

Source-available. Personal and development use permitted.

## Author

George Butiri — [george@iseestudios.com](mailto:george@iseestudios.com)
