Metadata-Version: 2.4
Name: sembr
Version: 0.4.2
Summary: A semantic linebreaker powered by transformers
Author: admk
License-Expression: MIT
Project-URL: Homepage, https://github.com/admk/sembr
Project-URL: Issues, https://github.com/admk/sembr/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Utilities
Classifier: Environment :: Console
Requires-Python: <3.15,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastmcp
Requires-Dist: flask
Requires-Dist: magika
Requires-Dist: mcp[cli]
Requires-Dist: pydantic
Requires-Dist: requests
Requires-Dist: tqdm
Requires-Dist: tree-sitter>=0.25.0
Requires-Dist: tree-sitter-markdown>=0.3.2
Provides-Extra: cuda
Requires-Dist: accelerate; extra == "cuda"
Requires-Dist: bitsandbytes; extra == "cuda"
Requires-Dist: torch; extra == "cuda"
Requires-Dist: transformers==5.9.0; extra == "cuda"
Provides-Extra: mlx
Requires-Dist: huggingface-hub; extra == "mlx"
Requires-Dist: mlx; extra == "mlx"
Requires-Dist: numpy; extra == "mlx"
Requires-Dist: tokenizers; extra == "mlx"
Provides-Extra: cpu
Requires-Dist: torch; extra == "cpu"
Requires-Dist: transformers==5.9.0; extra == "cpu"
Provides-Extra: train
Requires-Dist: accelerate; extra == "train"
Requires-Dist: datasets; extra == "train"
Requires-Dist: evaluate; extra == "train"
Requires-Dist: numpy; extra == "train"
Requires-Dist: torch; extra == "train"
Requires-Dist: transformers==5.9.0; extra == "train"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: torch; extra == "test"
Requires-Dist: transformers==5.9.0; extra == "test"
Dynamic: license-file

# ⚡️ Semantic Line Breaker (SemBr)

[![GitHub](https://img.shields.io/github/license/admk/sembr)](LICENSE)
[![python](https://img.shields.io/badge/Python-3.11--3.14-3776AB.svg?style=flat&logo=python&logoColor=white)](https://www.python.org)
[![pytorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg?style=flat&logo=pytorch)](https://pytorch.org)
[![PyPI](https://badge.fury.io/py/sembr.svg)](https://pypi.org/project/sembr)

```
> When writing text
> with a compatible markup language,
> add a line break
> after each substantial unit of thought.
```


## What is SemBr?

SemBr is a command-line tool
powered by [Transformer][transformers1] [models][transformers2]
that performs [semantic linebreaks](#what-are-semantic-line-breaks)
to breaks lines in a text file at semantic boundaries.
It supports multiple file types
including LaTeX, Markdown, and plain text,
with automatic file type detection.

**[20 Jun 2026]** :rocket: It now supports MLX + NVFP4 on macOS
which is incredibly fast:
**it now uses only <6 seconds to process 100k words**
on an old M2 MacBook Pro.

### Installation

SemBr is available as a [Python package on PyPI][pypi].

#### macOS on Apple Silicon with MLX

On Apple Silicon Macs,
SemBr can use the MLX backend with NVFP4 quantization,
which is **\~30x faster than torch+MPS**!
Install the MLX extra:
```shell
uv tool install "sembr[mlx]"
```

#### Linux/Windows with CUDA support

For CUDA on Linux,
install the CUDA extra:
```shell
uv tool install "sembr[cuda]"
```

#### CPU (Linux/Windows) or MPS (macOS) only

Install with [`uv`][uv]:
```shell
uv tool install sembr[cpu]
```

#### From GitHub (Latest Development Version)

To install the latest development version directly from GitHub:

```shell
# Install from GitHub main branch
uv tool install git+https://github.com/admk/sembr.git

# Run directly without installing
uvx --from git+https://github.com/admk/sembr.git sembr
```

Note that the development version
may include experimental features
and could be less stable than the PyPI release.

#### Development

To develop this project,
clone and install in development mode:

```shell
git clone https://github.com/admk/sembr.git
cd sembr
SEMBR_VERSION_SUFFIX=.dev0 \
  uv tool install --editable . --force --refresh-package sembr
```


### Supported Platforms

SemBr is supported on Linux, macOS and Windows
(well-tested on macOS).
On machines with CUDA devices,
or on Apple Silicon Macs,
SemBr will use the GPU / Apple Neural Engine
to accelerate inference.

### Usage

#### Command Line Interface

To use SemBr,
run the following command in your terminal:
```shell
sembr -i <input_file> -o <output_file>
```
where `<input_file>` and `<output_file>`
are the paths to the input and output files respectively.

On the first run,
it will download the SemBr model
and cache it in `~/.cache/huggingface`.
Subsequent runs will check for updates
and use the cached model if it is up-to-date.

Alternatively,
you can pipe the input into `sembr`,
and the output can also be printed to the terminal:
```shell
cat <input_file> | sembr
```
This is especially useful if you want to use SemBr
with clipboard managers, for instance, on a Mac:
```shell
pbpaste | sembr | pbcopy
```
Or on Linux:
```shell
xclip -o | sembr | xclip -i
```

You can also specify the following command-line options:

* `-l`, `--listen`:
  Serves the SemBr API on a local server.
  - Each instance of `sembr` run
    will detect if the API is accessible,
    and if not it will run the model on its own.
  - This option is useful
    to avoid the time taken to initialize the model
    by keeping it in memory in a separate process.
* `--file-type <type>`:
  File type (`plaintext`, `latex`, `markdown`, etc.).
  Auto-detected using [Magika][magika] if not provided.
* `--mcp`:
  Start MCP server mode instead of processing text.

#### Configurations

Additionally,
you can configure SemBr by creating
`$XDG_CONFIG_HOME/sembr/config.toml`.
If `XDG_CONFIG_HOME` is not set,
SemBr reads `~/.config/sembr/config.toml`.
The complete commented defaults
are stored in [`sembr/default.toml`](sembr/default.toml).
Copy that file to your config path
and edit only the values you want to change.

To use it offline,
you can download the model from Hugging Face
and set `model.name` to the model directory,
or prepend `TRANSFORMERS_OFFLINE=1` to the command
to use the cached model.

You can override config values for a single run
with `-c` or `--config`:

```shell
sembr \
  -c model.name=/path/to/model \
  -c optimize.algorithm=balanced_linebreaks \
  -c optimize.preferred_min_tokens_per_line=8 \
  -c optimize.preferred_max_tokens_per_line=10 \
  -c optimize.line_length_penalty_weight=0.05
```

The supported config keys are:

* `model.name`:
  The name of the Hugging Face model to use.
* `model.backend`:
  Inference backend to use.
  `torch` is the default.
  `cuda` uses the torch backend
  and requires a CUDA-capable torch install.
  Choose `mlx` on Apple Silicon.
* `model.bits`:
  Quantization bits for model weights (`4` or `8`).
  Requires CUDA. Not supported on MPS.
* `model.dtype`:
  Data type for model weights (e.g. `float16`, `bfloat16`).
  Default is `float32`.
* `model.quantization`:
  MLX weight quantization mode.
  Set `model.backend=mlx`
  and `model.quantization=nvfp4`
  to use MLX NVFP4 quantized linear layers.
  The default is `none`.
* `inference.batch_size`:
  The number of lines to process in a batch.
  Default is `8`.
* `inference.overlap_divisor`:
  The overlap divisor for tiled inference.
  Default is `8`.
* `optimize.algorithm`:
  The prediction function to use.
  Options are `argmax`, `logit_adjustment`, `greedy_linebreaks`,
  and `balanced_linebreaks`.
  Default is `balanced_linebreaks`.
* `optimize.preferred_min_tokens_per_line`:
  Preferred lower line length target.
  Default is `8`.
* `optimize.preferred_max_tokens_per_line`:
  Preferred upper line length target.
  Default is `10`.
* `optimize.line_length_penalty_weight`:
  Penalty weight for line lengths outside the preferred range.
  The default is `0.05`.
* `format.num_spaces`:
  Number of spaces represented by one indentation level,
  or `auto` to detect `2`, `4`, or `8` from the input.
  The default is `auto`.
* `format.indent_type`:
  Indentation unit to emit.
  Options are `space`, `tab`, and `auto`.
  `auto` detects space or tab indentation from the input.
  The default is `space`.
* `listen.host`:
  The host address of the SemBr API server.
  The default is `127.0.0.1`.
* `listen.port`:
  The port for the SemBr API server.
  The default is `8384`.

#### Balanced line breaks

The `balanced_linebreaks` algorithm
optimizes line breaks with dynamic programming
over each parsed paragraph.

It precomputes token costs
from the model log probabilities.
A no-break token costs `-log P(off)`,
and a break token costs `-log P(breaks)`.
After choosing break positions,
it uses the highest-scoring indent level
at each chosen position to recover the break type.

The objective also adds a quadratic penalty
when the token count falls outside
`optimize.preferred_min_tokens_per_line`
and `optimize.preferred_max_tokens_per_line`.
Larger `optimize.line_length_penalty_weight` values
make the algorithm favor the preferred range more strongly.

For a paragraph with `n` tokens,
the implementation uses prefix sums,
a monotonic queue for the no-penalty range,
and a Li Chao tree for long-line penalties.
The optimization complexity is
`O(n * l + n log n)`,
where `l` is `optimize.preferred_min_tokens_per_line`.
Memory usage is `O(n)` per paragraph.

#### MCP Server

Alternatively,
you can run `sembr` as an [MCP server][mcp].
Simply add the following configuration
to your MCP server configuration:
```json
"mcpServers": {
  "sembr": {
    "type": "stdio",
    "command": "uvx",
    "args": [
      "sembr",
      "--mcp"
    ],
  }
}
```

The server also supports the formatting options described above.
It will expose a `wrap_text` tool
for the MCP client to use.

## What are Semantic Line Breaks?

[Semantic Line Breaks][sembr]
or [Semantic Linefeeds][semlf]
describe a set of conventions
for using insensitive vertical whitespace
to structure prose along semantic boundaries.


## Why use Semantic Line Breaks?

Semantic Line Breaks has the following advantages:

* Breaking lines by splitting clauses
  reflects the logical, grammatical and semantic structure
  of the text.

* It enhances the ease of editing and version control
  for a text file.
  Merge conflicts are less likely to occur
  when small changes are made,
  and the changes are easier to identify.

* Documents written with semantic line breaks
  are easier to navigate and edit
  with Vim and other text editors
  that use Vim keybindings.

* Semantic line breaks
  are invisible to readers.
  The final rendered output
  shows no changes to the source text.


## Why SemBr?

Converting existing text not written
with semantic line breaks
takes a long time to do it manually,
and it is surprisingly difficult
to do it automatically with rule-based methods.

### Challenges of rule-based methods

Rule-based heuristics do not work well
with the actual semantic structure of the text,
often leading to incorrect semantic boundaries.
Moreover,
these boundaries are hierarchical and nested,
and a rule-based approach
cannot capture this structure.
A semantic line break
may occur after a dependent clause,
but where to break clauses into lines
is challenging to determine
without syntactic and semantic reasoning capabilities.
For examples:

* A rule that breaks lines at punctuation marks
  will not work well with sentences
  that contain periods
  in abbreviations or mathematical expressions.

* Syntactic or semantic structures
  are not always easy to determine.
  "I like to eat apples and oranges
  because they are healthy."
  should be broken into lines as follows:
  ```
  > I like to eat apples and oranges
  > because they are healthy.
  ```
  rather than:
  ```
  > I like to eat apples
  > and oranges because they are healthy.
  ```

For this reason,
I have created SemBr,
which uses finetuned Transformer models
to predict line breaks at semantic boundaries.


## How does SemBr work?

SemBr uses a Transformer model
to predict line breaks at semantic boundaries.

A small dataset of text with semantic line breaks
was created from my existing LaTeX documents.
The dataset was split into training
(46,295 lines, 170,681 words and 1,492,952 characters)
and test
(2,187 lines, 7,564 words and 72,231 characters)
datasets.

The data was prepared
by extracting line breaks and indent levels
from the files,
and then converting the result
into strings of paragraphs with line breaks removed.
The data can then be tokenized using the tokenizer
and converted into a dataset with tokens,
where each token has a label
denoting if there is line break before it,
and the indent level of the token.

For LaTeX documents,
there are two types of line breaks:
one with a normal line break
that adds implicit spacing (e.g. `line a⏎line b`)
and one with no spacing (e.g. `line a%⏎line b`).
The data processor
also tries to preserve the LaTeX syntax of the text
by adding and removing comment symbols (`%`),
if necessary.

The pretrained masked language model
is then finetuned as a token classifier
on the training dataset
to predict the labels of the tokens.
We save the model with the best F1 score
on correctly predicting the existence of a line break
on the test set.
The finetuning logs for the following models
can be found on this [WandB][wandb] report:

* `distilbert-base-uncased`
  [[Pretrained]][distilbert-bu]
  [[Finetuned]][sembr-distilbert-bu]
* `distilbert-base-cased`
  [[Pretrained]][distilbert-bc]
  [[Finetuned]][sembr-distilbert-bc]
* `distilbert-base-uncased-finetuned-sst-2-english`
  [[Pretrained]][distilbert-bufs2e]
  [[Finetuned]][sembr-distilbert-bufs2e]
* `prajjwal1/bert-tiny`
  [[Pretrained]][bert-tiny]
  [[Finetuned]][sembr-bert-tiny]
* `prajjwal1/bert-mini`
  [[Pretrained]][bert-mini]
  [[Finetuned]][sembr-bert-mini]
* `prajjwal1/bert-small`
  [[Pretrained]][bert-small]
  [[Finetuned]][sembr-bert-small]


## Performance

We now ship an MLX NVFP4 variant
that is about **26k words per second**,
with a **much fast model load time (4 seconds)**
and only about **130 MB** of memory usage!
Inference speed for the old torch+MPS backend
on an M2 Macbook Pro
is about 850 words per second
on `bert-small` with the default options,
the memory usage is about 1.70 GB.

The link breaking accuracy is difficult to measure,
and the locations of line breaks
could also be subjective.
On the test set,
the per-token line break accuracy
of the models are >95%,
with ~80% F1 scores.
Because of the sparse nature of line breaks,
the accuracy is not a good metric
to measure the performance of the model,
and I used the F1 score instead
to save best models.


## Improvements and TODOs

- Features:
  - Natural language support:
    - [ ] Support natural languages other than English.
  - Typesetting languages support:
    - [x] ~~Markdown.~~
    - [ ] Typst.
    - [ ] LaTeX.
  - Usability:
    - [ ] Inference queue.
    - [ ] Daemon with model unloading.
  - Editor integration:
    - [x] ~~NeoVim plugin.~~
    - [x] ~~VSCode extension.~~
    - [x] MCP server.
  - [x] ~~Use the [Hugging Face API][hfapi] for inference.~~
- Accuracy:
  - Some lines are too short or too long:
    - [x] Long lines can be penalized greedily
          by breaking lines with token counts
          more than `optimize.preferred_max_tokens_per_line`.
    - [ ] Support `optimize.preferred_(min|max)_words_per_line`.
    - [x] Improve the algorithm to penalize short and long lines
          with a more sophisticated method.
  - [ ] Improve indent level prediction.
  - [ ] Performance and accuracy benchmarking,
        and comparisons with related works.
- Performance:
  - [x] Improve inference speed.
  - [x] Reduce memory usage.


## Related Projects and References

Sentence splitting:
* https://code.google.com/archive/p/splitta/
* https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
* https://github.com/nipunsadvilkar/pySBD
* https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html

Semantic line breaking:
* https://github.com/sembr/specification
* https://github.com/waldyrious/semantic-linebreaker
* https://github.com/bobheadxi/readable ([blog post][readable-blog-post])
* https://github.com/chrisgrieser/obsidian-sembr
* https://github.com/cllns/semantic_linefeeds


[transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4
[transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/

[pypi]: https://pypi.org/project/sembr
[uv]: https://github.com/astral-sh/uv
[mcp]: https://modelcontextprotocol.io/overview
[magika]: https://github.com/google/magika

[sembr]: https://sembr.org
[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line

[wandb]: https://api.wandb.ai/links/admk/efvui9f4

[distilbert-bu]: https://huggingface.co/distilbert-base-uncased
[distilbert-bc]: https://huggingface.co/distilbert-base-cased
[distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
[bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny
[bert-mini]: https://huggingface.co/prajjwal1/bert-mini
[bert-small]: https://huggingface.co/prajjwal1/bert-small
[sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased
[sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased
[sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english
[sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny
[sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini
[sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small

[hfapi]: https://huggingface.co/docs/api-inference/detailed_parameters#token-classification-task

[readable-blog-post]: https://bobheadxi.dev/semantic-line-breaks
