Metadata-Version: 2.4
Name: gotoken
Version: 0.1.1
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Summary: Syntax-aware Bash tokenizer — Rust core, Python bindings
Keywords: tokenizer,bash,nlp,llm
Home-Page: https://github.com/ThingAI/gotoken
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# gotoken

Syntax-aware tokenizer for Bash and formal languages, written in Rust with Python bindings.

## Why gotoken?

Standard BPE tokenizers (tiktoken, HuggingFace) fragment Bash constructs like `2>&1`, `&&`, `--help` into 4-5 separate tokens, wasting context window and model parameters.

gotoken protects 130+ shell operators, coreutils commands and flags as **atomic single tokens**, then falls back to byte-level encoding for everything else.

## Features

- `grep`, `chmod`, `find`, `2>&1`, `&&`, `||`, `-rf` → single ID, always
- Zero OOV: every byte maps to a fallback ID in `[1000..1255]`
- Perfect round-trip: `decode(encode(s)) == s` guaranteed
- `VOCAB_SIZE = 32768` (power-of-two, Tensor Core aligned)
- Rayon parallel batch encoding, GIL released during tokenization
- Python 3.9+ via PyO3, installable with `pip install gotoken`

## Install

```bash
pip install gotoken       # Python
cargo add gotoken         # Rust
```

## Usage (Python)

```python
from gotoken import GoToken

tok = GoToken()

ids  = tok.encode("grep -rf /tmp 2>&1")
text = tok.decode(ids)
assert text == "grep -rf /tmp 2>&1"

# Parallel batch — GIL released, rayon saturates all cores
results = tok.encode_batch(["find /var -name '*.log'", "chmod 755 /bin/app"])
```

## Usage (Rust)

```rust
use gotoken::encoder::Encoder;

let enc  = Encoder::new();
let ids  = enc.encode_str("grep -rf /tmp 2>&1", false)?;
let text = enc.decode(&ids)?;
assert_eq!(text, "grep -rf /tmp 2>&1");
```

## Compression vs tiktoken

| Command | tiktoken tokens | gotoken tokens |
|---------|----------------|----------------|
| `grep -r 'TODO' . 2>&1` | 12 | 7 |
| `chmod 755 /var/www/html` | 10 | 6 |
| `find /home -name '*.log' \| wc -l` | 14 | 8 |

## License

MIT

