Metadata-Version: 2.4
Name: darn_it
Version: 1.3.1
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
License-File: LICENSE
Summary: An opinionated chunker for markdown formatted text
Author-email: John C Ll Stokes <johnclstokes@hotmail.com>
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# darn-it
![piccy](docs/mathematically_optimised_chunking.jpg)
(`darn` - *Welsh*, 'därn', meaning 'piece' or more favourably, 'chunk')

Darn is a python tool for producing optimal 'chunks' from markdown-formatted string data. The tool makes use of a novel 'punishment' based approach to determining chunk boundaries, which in practice can be seen to be an improvement over traditional "semantic" chunking methods. To avoid adding unexpected behaviours (hidden text cleaning etc...), `darn` assumes [ASCII](https://www.ascii-code.com/) compliance. pre-cleaning to remove non-ASCII compliant characters should be performed by users of the package ahead of time to avoid errors in processing.

for more on the method that underpins `darn`, see the accompanying [blog post](https://cashewe.github.io/blog/optimal-chunking-strats/). for detailed examples, see the [documentation]().

## Setup

To use `darn`, you must first install it, ideally into a python virtual environment:

```
pip install uv

uv venv --python 3.13
source .venv/bin/activate # or .venv\Scripts\activate on windows

uv pip install darn_it
```

from here, you can import it into your python session for use:

```
from darn_it import (
    Chunker,
    PyNodeType,
)

chunker = Chunker(
    [
        Rule(
            on_punishment="const",
            on_scale=10,
            off_punishment="const",
            off_scale=0,
            nodetype=PyNodeType.Sentence
        ),  # ... add custom rules here. if none are provided, darn will fall back on its curated list of defaults
    ]
)

with open("file.md", "r") as f:
    text = f.read()

chunks = chunker.get_chunks(
    text=text, 
    chunk_size=500,
    # granularity="tokens", # either "tokens" or "characters"
    # model="gpt-4o", # only consumed if `granularity` is set to `tokens`
    # overlap=50, # respects `granularity` variable
    # return_vectors=true, # turn on to include the full punishment vectors per rule in the output object
)
```

Rules can be applied to any structure from the [mdast](https://github.com/syntax-tree/mdast) specification, along with the `Sentence` and `Word`.
A full list of currently supported punishment functions for rules is given below:

| punishment | meaning |
|------------|---------|
| const | a static punishment value |
| linear | a punishment value that increments by 1 per character, starting at the provided input |
| reverse_linear | a punishment that decrements by 1 per character, starting at the provided input and stopping at 0 |
| triangular | a punishment which peaks at the provided value in the middle of the structure, starting and ending at 0 |
| inverse_triangular | a punishment which peaks at the start and end of the structure at the provided value, and hits 0 in the middle |

When settling on rules, you may find the `punishments` and `punishment_breakdown` attributes of the output object from `get_chunks()` to be of use. To access them, turn the `return_vectors` argument to `True`. The `punishment_breakdown` shows the punishments attributable to each rule in order whilst the `punishments` attribute gives the summed punishment across all rules. these are really intended for analytical use, not to be left on in production.
