Metadata-Version: 2.4
Name: llmclean
Version: 0.2.0
Summary: Utilities for cleaning and normalizing raw LLM output
License: MIT License
        
        Copyright (c) 2026 Tushar Jaju
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/Tushar-9802/llmclean
Keywords: llm,cleaning,json,text,ai,output,normalization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# llmclean

Small Python library for cleaning the noise out of raw LLM output. Strips markdown fences, repairs malformed JSON, trims runaway repetitions. Zero runtime dependencies — pure standard library.

I built this because my other projects ([Sakhi](https://github.com/Tushar-9802/Sakhi), [Resume-parser](https://github.com/Tushar-9802/Resume-parser)) kept reinventing the same five or six regex passes against the same recurring failure modes. The [0.2.0 changelog](CHANGELOG.md) documents what production traffic on those projects taught me to fix here.

## Install

```bash
pip install llmclean
```

## What it does

```python
from llmclean import strip_fences, enforce_json, trim_repetition

# ```lang ... ``` wrappers, including tilde fences and CRLF line endings
strip_fences('```json\n{"name": "Alice"}\n```')
# → '{"name": "Alice"}'

# JSON buried in prose, with trailing comma + Python literals
enforce_json('Here you go: {"ok": True, "items": [1,2,3,]}')
# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'

# Model looped on the same sentence
trim_repetition("The answer is 42. This is final. This is final.")
# → 'The answer is 42. This is final.'
```

`enforce_json` runs a pipeline of strategies in order and stops at the first one that produces parseable JSON. Strategies cover: existing valid JSON, fences, prose around the JSON, BOM at position 0, doubled-quote overruns like `""value""`, trailing commas, Python `True`/`False`/`None`, single-quoted strings, unquoted keys, and unclosed brackets. Full set in [USAGE.md](USAGE.md).

## What it doesn't do (and the thing to use instead)

- Validate JSON against a schema — use `jsonschema` or `pydantic`
- Re-prompt the model to fix its output — use `instructor`
- Constrain the model at generation time so it can't produce broken output — use `outlines`

These are different problems with different tools. llmclean handles the post-hoc cleanup pass; compose it with the above if you need more.

## Design choices

Three constraints kept while iterating:

The library should never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.

Stay zero-dep. The standard library is sufficient for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.

Be predictable. Same input always produces the same output. No external state, no model calls, no fuzzy heuristics that change behaviour silently across versions.

## Known limitations

Some inputs land llmclean in known false-positive territory. Two worth flagging:

`strip_fences` will remove a single language name if it's the only content inside a fence — so if your model literally emits `` ```\njson\n``` `` as a one-word answer, that content disappears. The aggressive language-tag cleanup catches stray tags from real-world fence variants, and the trade-off is documented in the test `test_lone_language_word_as_content_gets_stripped`.

`enforce_json`'s double-quote collapse only handles the symmetric form `""text""`. The asymmetric variants Sakhi's pipeline also handles (`: ""x` and `x""`) corrupt legitimate empty-string values, so they're deliberately omitted here.

## Tests

```bash
pip install "llmclean[dev]"
pytest -v
```

78 tests across the three modules at v0.2.0. Includes characterization tests for known trade-offs (empty-string preservation, lone-language-tag strip) so future changes can't silently regress them.

## License

MIT.
