Metadata-Version: 2.4
Name: jtoken
Version: 0.3.3
Summary: Compress JSON-shaped documents for LLM prompts with normalization, CLI, and token measurement
Project-URL: Homepage, https://github.com/hermannsamimi/jtoken
Project-URL: Repository, https://github.com/hermannsamimi/jtoken
Project-URL: Issues, https://github.com/hermannsamimi/jtoken/issues
Author-email: Hermann Samimi <hermannsamimi@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: encoding,format,key-value,llm,serialization,text,tokens
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.8
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: tiktoken>=0.5; extra == 'dev'
Provides-Extra: tiktoken
Requires-Dist: tiktoken>=0.5; extra == 'tiktoken'
Description-Content-Type: text/markdown

# jtoken

**Author:** Hermann Samimi

**jtoken** compresses JSON-shaped documents for LLM prompts: fewer tokens, readable line-oriented output, lossless round-trip. Pass a file, a string, or a dict — it figures out the rest.

Python 3.8+. No extra runtime dependencies.

## Installation

```bash
pip install jtoken
pip install "jtoken[tiktoken]"   # for OpenAI-compatible token counting
```

## Quick start

```python
import jtoken

# From a file — read as text, pass directly
raw = open("data.json").read()
encoded = jtoken.encode(raw)
print(encoded)

# From a Python dict
data = {"user": "alice", "age": 30, "active": True, "ref": None}
encoded = jtoken.encode(data)
decoded = jtoken.decode(encoded)
assert decoded == data
```

Aliases: `jtoken.dumps` = `encode`, `jtoken.loads` = `decode`.

## Format overview

**JSON**

```json
{"name": "Alice", "age": 30, "active": true, "verified": false, "ref": null}
```

**jtoken**

```
name: Alice
age: 30
trues: active
falses: verified
nulls: ref
```

### Encoding rules

- Nested dicts flatten with dot notation.
- `True`, `False`, and `None` collapse into `trues:`, `falses:`, and `nulls:` summary lines.
- Ambiguous strings keep quotes on encode.
- Keys containing `.` are escaped during normalization and restored from context.

## What jtoken accepts

`encode` accepts a **string** (file content) or a **dict/list**. When given a string, it auto-detects the format:

- Standard JSON objects and arrays
- Multiple bare JSON objects in a single string (no array wrapper needed)
- MongoDB shell format (`ObjectId(...)`, `ISODate(...)`, `NumberInt(...)`)
- MongoDB Extended JSON (`$oid`, `$date`, `$numberInt`, …)
- Elasticsearch search hits (with `_source`)

No format flag required — just pass the text.

## Normalization and denormalization

For lossless round-trips back into MongoDB shell or Elasticsearch hit format, use `encode_document` / `decode_document`:

```python
import jtoken

raw = open("hit.json").read()
text, context = jtoken.encode_document(raw)
restored = jtoken.decode_document(text, target="mongo_shell", context=context)
```

```bash
jtoken encode -f doc.json --context-out doc.ctx.json
jtoken decode --output-format mongo_shell -f doc.jtoken --context-in doc.ctx.json
```

### Input and output formats

`auto` (the default) handles everything automatically. Override with `source=` / `target=` only when needed.

| Input format | Description |
|---|---|
| `auto` | detect from content (default) |
| `json` | standard JSON |
| `mongo_shell` | MongoDB shell (`ObjectId`, `ISODate`, …) |
| `mongo_extended` | MongoDB Extended JSON |
| `elastic_hit` | Elasticsearch hit with `_source` |
| `elastic_source` | `_source` wrapper only |

| Output format | Description |
|---|---|
| `json` | pretty-printed JSON (CLI default) |
| `python` | Python `repr` (Python API default) |
| `mongo_shell` | MongoDB shell document |
| `mongo_extended` | MongoDB Extended JSON |
| `elastic_hit` | full Elasticsearch hit envelope |
| `elastic_source` | `_source` wrapper |

## Public API reference

### Core codec

| function | description |
|---|---|
| `encode(data) -> str` | compress string, dict, or list to jtoken |
| `decode(text: str) -> dict` | reconstruct the nested dict |
| `dumps` / `loads` | json-style aliases |

### Normalization

| function | description |
|---|---|
| `encode_document(raw, *, source="auto", context=None)` | return `(jtoken_text, NormalizationContext)` |
| `decode_document(text, *, target="json", context=None)` | decode and denormalize |
| `normalize(data, *, source="auto", context=None)` | return `(normalized_dict, NormalizationContext)` |
| `denormalize(data, *, target="python", context)` | restore lists, typed values, and dialect |
| `parse_input(text, *, source="auto")` | parse foreign text into Python data |
| `render_output(value, *, target="python") -> str` | render denormalized data as text |

### Token measurement

| function | description |
|---|---|
| `count_tokens(data, *, model, backend) -> int` | token count for dict or jtoken string |
| `count_text_tokens(text, *, model, backend) -> int` | token count for raw text |
| `token_savings(data, *, model, backend, json_indent=2)` | compare jtoken vs pretty JSON |

### `TokenSavings` properties

| property | type | description |
|---|---|---|
| `jtoken_tokens` | `int` | tokens in jtoken representation |
| `json_tokens` | `int` | tokens in JSON baseline |
| `saved` | `int` | `json_tokens - jtoken_tokens` |
| `percent` | `float` | percent saved |

### `NormalizationContext` fields

| field | description |
|---|---|
| `source_format` | detected input dialect |
| `target_format` | optional output hint |
| `typed_values` | BSON-like type markers per path |
| `lists` | paths that were lists before flattening |
| `dotted_keys` | paths with escaped `.` keys |
| `elastic` | Elasticsearch envelope metadata |

Methods: `to_dict()`, `from_dict(data)`.

### Exceptions

| exception | when raised |
|---|---|
| `JPackEncodeError` | encoding fails |
| `JPackDecodeError` | decoding fails |
| `NormalizationError` | normalization fails |
| `DenormalizationError` | denormalization fails |
| `TokenCountError` | token counting fails |

## Token counting

```python
stats = jtoken.token_savings(data, model="gpt-4o", backend="tiktoken", json_indent=2)
print(stats.jtoken_tokens, stats.json_tokens, stats.saved, stats.percent)
```

| `backend` | behavior |
|---|---|
| `auto` | use `tiktoken` when installed, otherwise estimate |
| `tiktoken` | require `tiktoken` |
| `estimate` | character heuristic |

### Representative token counts

| Document type | JSON | jtoken |
|---|---:|---:|
| ELK hit | 1537 | 583 |
| Mongo shell | 770 | 508 |
| PostgreSQL structured document | 831 | 685 |
| Standard JSON | 617 | 503 |

![Token count by representation](https://raw.githubusercontent.com/hermannsamimi/jtoken/main/docs/token-savings-bar-chart.svg)

## CLI

```bash
cat data.json | jtoken encode
cat data.jtoken | jtoken decode
jtoken encode -f data.json
jtoken stats -f data.json --model gpt-4o --backend tiktoken
jtoken count -f data.json
python -m jtoken encode
```

## Links

- Homepage: https://github.com/hermannsamimi/jtoken
- Issues: https://github.com/hermannsamimi/jtoken/issues

## License

MIT — Copyright (c) 2026 Hermann Samimi
