Metadata-Version: 2.4
Name: jtoken
Version: 0.2.0
Summary: Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement
Project-URL: Homepage, https://github.com/hermannsamimi/jtoken
Project-URL: Repository, https://github.com/hermannsamimi/jtoken
Project-URL: Issues, https://github.com/hermannsamimi/jtoken/issues
Author-email: Hermann Samimi <hermannsamimi@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: encoding,format,key-value,llm,serialization,text,tokens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.8
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: tiktoken>=0.5; extra == 'dev'
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5; extra == 'tiktoken'
Description-Content-Type: text/markdown

# jtoken

**jtoken** compresses JSON-shaped documents for LLM prompts: same information, fewer tokens, lossless round-trip for supported scalar dicts.

It is a small Python library and CLI for turning verbose JSON into a compact line-oriented representation, measuring token savings, and working with real-world document dialects such as Elasticsearch hits and MongoDB JSON.

## Why use jtoken

- **Lower prompt cost:** strip JSON punctuation and collapse repeated `true`, `false`, and `null` fields into summary lines.
- **Readable for models:** the output stays human-readable key-value text instead of dense JSON.
- **Lossless for supported data:** nested dicts round-trip through `encode()` and `decode()`.
- **Production-shaped inputs:** normalize Elasticsearch hits, MongoDB Extended JSON, and Mongo shell literals before encoding.
- **No required runtime dependencies:** the core package is stdlib-only; `tiktoken` is optional for accurate token counts.

## Installation

```bash
pip install jtoken
pip install "jtoken[tiktoken]"
```

Python 3.8+ is supported.

## Quick start

```python
import jtoken

data = {
    "user": "alice",
    "age": 30,
    "premium": True,
    "verified": True,
    "is_remote": False,
    "trial": False,
    "score": 9.5,
    "referral": None,
    "last_login": None,
}

text = jtoken.encode(data)
restored = jtoken.decode(text)
assert restored == data
```

`dumps` and `loads` are aliases for `encode` and `decode`.

## Format overview

**JSON example**

```json
{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
```

**jtoken example**

```text
name: Alice
age: 30
trues: active
falses: verified
nulls: ref
```

### Encoding rules

- Nested dicts are flattened with dot notation.
- Boolean `true` values are collected into a `trues:` line.
- Boolean `false` values are collected into a `falses:` line.
- `null` values are collected into a `nulls:` line.
- Ambiguous strings such as `"90210"` or `"true"` keep quotes so types survive decode.

### Supported scalar types

`str`, `int`, `float`, `bool`, `None`, and nested `dict`.

### Current limitations

- Keys cannot contain `.` or the separator `": "`.
- Reserved top-level keys: `nulls`, `trues`, `falses`.
- Lists are not encoded directly by the core codec; they are normalized into nested dicts with numeric keys before encoding.

## Normalization and denormalization

Use normalization when the source document is not already a plain JSON object of scalar values.

Supported input dialects:

| `source` | Use when |
|---|---|
| `auto` | Let jtoken detect the input family |
| `json` | Standard JSON |
| `python` | JSON-compatible Python values |
| `mongo_extended` | Extended JSON wrappers such as `$oid` and `$date` |
| `mongo_shell` | Shell literals such as `ObjectId(...)` and `ISODate(...)` |
| `elastic_hit` | Full Elasticsearch hit with `_source` and `fields` |
| `elastic_source` | A document shaped like `_source` only |

Supported output dialects:

| `target` | Result |
|---|---|
| `python` | Python data structures |
| `json` | Standard JSON text |
| `mongo_extended` | Extended JSON wrappers restored from context |
| `mongo_shell` | Shell-style literals restored from context |
| `elastic_hit` | Elasticsearch hit envelope restored from context |
| `elastic_source` | `_source` document only |

### Sidecar context

Mongo shell types, Elasticsearch envelopes, and list positions are stored in a separate normalization context. Keep that sidecar with the encoded text when you need a lossless decode back into the original dialect.

```python
import jtoken

raw_hit = {...}
normalized, context = jtoken.normalize(raw_hit, source="elastic_hit")
text = jtoken.encode(normalized)
restored = jtoken.denormalize(
    jtoken.decode(text),
    target="elastic_hit",
    context=context,
)
```

Convenience helpers:

```python
text, context = jtoken.encode_document(raw_hit, source="elastic_hit")
restored = jtoken.decode_document(text, target="elastic_hit", context=context)
```

## CLI

```bash
jtoken encode --input-format elastic_hit -f hit.json --context-out hit.ctx.json
jtoken decode --output-format mongo_shell -f hit.jtoken --context-in hit.ctx.json
jtoken stats --input-format json -f document.json
jtoken count --input-format json -f document.json --backend estimate
```

Common flags:

- `-f/--file`: read from a file instead of stdin
- `--input-format`: document dialect for `encode`, `stats`, and `count`
- `--output-format`: document dialect for `decode`
- `--context-out` / `--context-in`: normalization sidecar files
- `--model` and `--backend`: token counting options for `stats` and `count`

## Token measurement

```python
stats = jtoken.token_savings(data)
print(stats)
# jtoken: 22 tokens | json: 36 tokens | saved: 14 (38.9%)

count = jtoken.count_tokens(data, backend="estimate")
```

Backends:

| backend | behavior |
|---|---|
| `auto` | use `tiktoken` when installed, otherwise estimate |
| `tiktoken` | require `tiktoken` |
| `estimate` | use a simple character heuristic |

Install accurate counting with:

```bash
pip install "jtoken[tiktoken]"
```

## API surface

Core codec:

- `encode(data: dict) -> str`
- `decode(text: str) -> dict`

Normalization:

- `parse_input(text, *, source="auto")`
- `normalize(data, *, source="auto", context=None)`
- `denormalize(data, *, target="python", context)`
- `render_output(value, *, target="python")`
- `encode_document(raw, *, source="auto", context=None)`
- `decode_document(text, *, target="python", context)`

Token helpers:

- `count_tokens(data, *, model="cl100k_base", backend="auto")`
- `token_savings(data, *, model="cl100k_base", backend="auto")`

## Exceptions

```text
JPackError
├── JPackEncodeError
├── JPackDecodeError
├── NormalizationError
├── DenormalizationError
└── TokenCountError
```

## Links

- Homepage: https://github.com/hermannsamimi/jtoken
- Issues: https://github.com/hermannsamimi/jtoken/issues

## License

MIT
