Metadata-Version: 2.4
Name: combo-seg
Version: 0.1.3
Summary: Character-level segmentation model
Author: Michał Ulewicz, Alina Wróblewska
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg
Project-URL: Documentation, https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg
Project-URL: Repository, https://gitlab.clarin-pl.eu/syntactic-tools/combo-seg
Keywords: nlp,natural-language-processing,segmentation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.30
Requires-Dist: huggingface-hub>=0.26
Requires-Dist: jinja2>=3.0
Requires-Dist: tqdm
Requires-Dist: pyyaml>=6.0
Requires-Dist: dacite>=1.8
Requires-Dist: wandb>=0.16
Requires-Dist: numpy>=1.24.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# COMBO-SEG

**Character-level segmentation model for natural language text.**

COMBO-SEG segments raw text into turns, sentences, tokens, and words using a transformer-based character-level classifier. Supports 80+ languages.

Output hierarchy: `Document → Turn[] → Sentence[] → Token[]` (matches LAMBO). Turns are produced by regexp separators (by default double newline `\n\n` and `<turn>`) before the neural model splits each turn into sentences and tokens.

## Installation

```bash
pip install combo-seg
```

## Usage

```python
from combo_seg import ComboSeg, Language, SplitLevel

segmenter = ComboSeg(language=Language.POLISH)
doc = segmenter("Ala ma kota. Kot ma Alę.\n\nDrugi akapit.")

for turn in doc.turns:
    for sentence in turn.sentences:
        print(sentence.text)
        for token in sentence.tokens:
            print(f"  {token.text}")

# Pass split_level=SplitLevel.TURN to collapse each turn into one Sentence.
```

## License

GPL-3.0
