Metadata-Version: 2.4
Name: combo-nlp
Version: 4.0.0
Summary: COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.
Author: Maja Jablonska, Michał Ulewicz
License-Expression: GPL-3.0
Project-URL: Homepage, https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
Project-URL: Documentation, https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
Project-URL: Repository, https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
Keywords: nlp,natural-language-processing,dependency-parsing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.35.0
Requires-Dist: peft>=0.6.0
Requires-Dist: wandb>=0.16.0
Requires-Dist: dacite>=1.8.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: sacremoses>=0.1.1
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.9.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.6.0; extra == "dev"
Provides-Extra: lambo
Requires-Dist: lambo>=2.3.1; extra == "lambo"

# COMBO-NLP

COMBO-NLP - A library for Morphosyntactic Tagging and Dependency Parsing.

## Features

- **Two training modes**: full fine-tuning and LoRA (Low-Rank Adaptation) for parameter-efficient training
- **Multi-task learning**: morphosyntactic tagging, lemmatization, dependency parsing
- **Combined label encoding**: UPOS + XPOS + FEATS as single label for efficient classification
- **Biaffine attention** for dependency parsing
- **Character-level seq2seq** lemmatization
- **CoNLL 2018 metrics**: UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX
- **Multi-treebank training**: optionally combine multiple treebanks for the same language
- **Device support**: NVIDIA CUDA and Apple MPS
- **WandB integration** for experiment tracking
- **Checkpoints** after each epoch with best model selection
- **Full pipeline**: train → export → upload to HuggingFace Hub in one command

## Installation

### uv package manager (if not installed)
macOS / Linux
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Windows
```powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

### Automatic installation
Create a new virtual environment.
```
uv venv
source .venv/bin/activate
```
Install COMBO-NLP.
```
uv pip install combo-nlp
```

### LAMBO segmenter (optional)

A segmenter is only needed when passing raw text strings to COMBO. If you provide pre-tokenized input (`list[str]` or `list[list[str]]`), no segmenter is required.

When you initialize COMBO with a language name (e.g. `COMBO("Polish")`), it automatically loads a [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo) segmenter. If LAMBO is not installed, an `ImportError` is raised. LAMBO is hosted on a custom PyPI index and must be installed separately:

```bash
uv pip install --index-url https://pypi.clarin-pl.eu/ lambo
```

Alternatively, add the custom index to your project's `pyproject.toml` so that `lambo` resolves automatically:

```toml
[[tool.uv.index]]
url = "https://pypi.clarin-pl.eu/"
```

### Source installation

Clone the repository and install in a virtual environment.

```bash
git clone https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp.git
cd combo-nlp
uv venv
source .venv/bin/activate
uv sync
```

## Basic usage

After installing via `pip install combo-nlp`, use COMBO directly in Python:

```python
from combo import COMBO

# Load by HuggingFace model ID:
nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")
result = nlp("Ala ma kota.")

# Or load by language name:
nlp = COMBO("Polish")
result = nlp("Ala ma kota.")

# Access results:
for sentence in result:
    for token in sentence:
        print(token.form, token.upos, token.head, token.deprel, token.lemma)
```

### Pre-tokenized input

```python
from combo import COMBO

nlp = COMBO.from_pretrained("clarin-pl/combo-nlp-xlm-roberta-base-polish-pbd-ud2.17")

# Single sentence:
result = nlp(["Ala", "ma", "kota", "."])

# Multiple sentences:
result = nlp([["Ala", "ma", "kota", "."], ["Pies", "je", "."]])

# To parse multiple raw text sentences, join them into a single string:
sentences = ["Ala ma kota.", "Pies je."]
result = nlp("\n".join(sentences))
```

## Environment Setup

Copy the example environment file and fill in your API keys:

```bash
cp .env.example .env
```

Edit `.env` with your credentials:

- **`WANDB_API_KEY`** — get it from https://wandb.ai/authorize (required for experiment tracking)
- **`HF_TOKEN`** — create a token with write access at https://huggingface.co/settings/tokens (required for uploading models)

The `.env` file is loaded automatically by all scripts. It is git-ignored and should never be committed.

## Quick Start

### Training

```bash
# Train with task-specific overrides:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Resume training from checkpoint:
combo-nlp-train --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --resume /path/to/checkpoint_latest.pt
```

### Evaluation

All evaluation uses the official CoNLL 2018 evaluation script (`conll18_ud_eval.py`).

The test file can be a **`.conllu`** file (aligned evaluation with gold tokenization) or a **`.txt`** file (full-text evaluation with automatic segmentation via LAMBO). The model can be loaded from **HuggingFace Hub** (`--model`), a **local exported directory** (`--model-dir`), or a **training checkpoint** (`--task-config` + `--checkpoint`).

#### Option 1: Evaluate a model from HuggingFace Hub or local directory

```bash
# Evaluate using a HuggingFace model on a CoNLL-U file (aligned evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu

# Evaluate using a HuggingFace model on a plain text file (full-text evaluation):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# Or use a local exported model directory:
combo-nlp-evaluate --model-dir /path/to/model \
    --test-file /path/to/test.conllu

# Save predictions to CoNLL-U format while evaluating:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --save-predictions

# Save results to a custom directory:
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --output-dir ./results/
```

#### Option 2: Evaluate a training checkpoint

```bash
# Evaluate with task config (uses best checkpoint and auto-detects test file from UD treebank):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Evaluate a specific checkpoint on a custom test file:
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint /path/to/checkpoint.pt --test-file /path/to/custom_test.conllu

# Full-text evaluation from checkpoint (language auto-detected from task config):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text
```

#### Option 3: Full-text evaluation (end-to-end with segmentation)

Full-text evaluation segments raw text with LAMBO before parsing and measures end-to-end performance including tokenization and segmentation quality.

```bash
# From HuggingFace model with a .conllu file (uses adjacent .txt for raw input):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.conllu --full-text --language Polish

# From HuggingFace model with a .txt file (auto-resolves matching .conllu for gold scoring):
combo-nlp-evaluate --model clarin-pl/combo-nlp-polish \
    --test-file /path/to/test.txt --full-text --language Polish

# From task config (language auto-detected):
combo-nlp-evaluate --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --full-text
```

Uses the `.txt` file next to the gold `.conllu` (standard in UD treebanks) as raw input for LAMBO segmentation. Falls back to `# text =` metadata if no `.txt` file exists. Measures both segmentation (Tokens, Sentences, Words) and parsing quality.

#### Option 4: Compare two CoNLL-U files directly

```bash
# Compare gold vs predictions (no model needed):
combo-nlp-evaluate \
    --gold-file /path/to/ud-treebanks/UD_Polish-LFG/pl_lfg-ud-test.conllu \
    --predictions-file outputs/Polish/results/predictions.conllu
```

### Prediction

```bash
# Parse a single sentence using an exported model:
combo-nlp-predict --model-dir /path/to/model \
    --text "Ala ma kota ."

# Parse a CoNLL-U file:
combo-nlp-predict --model-dir /path/to/model \
    --input input.conllu --output output.conllu

# Interactive mode:
combo-nlp-predict --model-dir /path/to/model

# Or use task-config + checkpoint (during training):
combo-nlp-predict --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml \
    --checkpoint data/output/combo-nlp-herbert-base-cased-polish-pdb-ud2.17/checkpoints/checkpoint_best.pt \
    --text "Ala ma kota ."
```

## Configuration

Configuration uses a two-level system:

- **`config/base.yaml`** — shared defaults (architecture, hyperparameters, paths, training settings)
- **`config/tasks/<task>.yaml`** — task-specific overrides (language, treebanks, base model)

Task configs are deep-merged on top of the base config — only specify fields that differ.

```
config/
├── base.yaml
└── tasks/
    ├── english.yaml
    ├── german.yaml
    └── ...
```

Training and the full pipeline require `--task-config`. Evaluation and prediction can also use `--model-dir` to load from an exported model directory.

## Model Export

Export a trained model to a local directory (simulating HuggingFace repo structure) or push directly to HuggingFace Hub.

```bash
# Export to local directory:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml

# Export and push to HuggingFace Hub:
combo-nlp-export --task-config config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml --push-to-hub
```

The exported directory contains everything needed to load the model:

```
data/export/<model_name>/
├── config.json          # Model configuration
├── pytorch_model.bin    # Weights (optimizer state stripped)
├── README.md            # Model card (auto-generated from eval results)
└── encoders/
    ├── morpho_encoder.json
    ├── deprel_encoder.json
    └── char_vocab.json
```

The model card (`README.md`) is generated automatically if `results_best.json` is found in the training output directory.

## Full Pipeline

Run the complete workflow — train, export, and upload to HuggingFace Hub — in a single command:

```bash
# Full pipeline:
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml

# Dry run (preview commands without executing):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --dry-run

# Resume from a specific step (e.g. training already done):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --start-from export

# Run only training (no export/upload):
combo-nlp-pipeline --task-config config/tasks/combo-nlp-test-xlm-roberta-base-polish-pdb-ud2.17.yaml --stop-after train
```

The pipeline stops immediately if any step fails. Steps:

1. **Train** — fine-tune the model with per-epoch evaluation (aligned + full-text) and best model selection
2. **Export** — package the best model for distribution (strip optimizer state, generate model card)
3. **Upload** — push the exported model to HuggingFace Hub

## Adding a New Language

To add a new language, create a task config in `config/tasks/` and train. See existing configs (e.g., `config/tasks/combo-nlp-herbert-base-cased-polish-pdb-ud2.17.yaml`) for reference.

## Multi-Treebank Training

When multiple treebanks exist for a language (e.g., Polish has LFG, PDB, PUD, MPDT), the pipeline automatically:

1. **Combines training data** from all specified treebanks
2. **Combines dev data** for validation
3. **Evaluates on combined test data**

This provides more training data and better generalization.

**Supported devices**:
- NVIDIA CUDA GPUs
- Apple Silicon (MPS)
- CPU (slow, not recommended)

## WandB Integration

Training metrics are logged to Weights & Biases:

- **train/loss**, **train/arc_loss**, **train/rel_loss**, **train/lemma_loss**, etc.: Per-step losses
- **train/lr_encoder**, **train/lr_head**: Learning rates
- **dev_eval/{metric}**: Dev set F1 metrics (aligned evaluation, gold tokenization)
- **dev_fulltext/{metric}**: Dev set F1 metrics (full-text evaluation, LAMBO segmentation)
- **test_eval/...**, **test_fulltext/...**: Same for test set
- **train_eval/...**: Training subset metrics (aligned)

Metrics include all CoNLL 2018 F1 scores: Tokens, Sentences, Words, UPOS, XPOS, UFeats, AllTags, Lemmas, UAS, LAS, CLAS, MLAS, BLEX. Aligned evaluation uses gold tokenization (Tokens/Sentences/Words always 100%); full-text evaluation measures end-to-end performance including segmentation.

To disable WandB:
```yaml
wandb:
  enabled: false
```

## CLI Commands

After `pip install`, the following commands are available:

| Command | Description |
|---|---|
| `combo-nlp-train` | Train a model |
| `combo-nlp-evaluate` | Evaluate a model |
| `combo-nlp-predict` | Run predictions |
| `combo-nlp-export` | Export a trained model |
| `combo-nlp-pipeline` | Run the full train → export → upload pipeline |

For development without installing, use `python scripts/<name>.py` instead (e.g. `python scripts/train.py`).

## Testing

Install dev dependencies:
```bash
uv sync --extra dev
```

Run all tests:
```bash
PYTHONPATH=src pytest test/ -v
```

## License

## Citations
