Metadata-Version: 2.4
Name: alta-models-sft
Version: 1.1.1
Summary: ALTAModel SFT — instruction-tuned Kinyarwanda language models from YaliLabs.
Project-URL: Homepage, https://github.com/yalilabs/alta-models-sft
Project-URL: Repository, https://github.com/yalilabs/alta-models-sft
Project-URL: Issues, https://github.com/yalilabs/alta-models-sft/issues
Project-URL: Model Hub, https://huggingface.co/yalilabs
Author-email: YaliLabs <info@yalilabs.com>
License-File: LICENSE
Keywords: ALTA,Rwanda,african-nlp,instruction-tuning,kinyarwanda,language-model,llm,sft
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: huggingface-hub>=0.23
Requires-Dist: safetensors>=0.4
Requires-Dist: torch>=2.2
Requires-Dist: transformers>=4.40
Provides-Extra: all
Requires-Dist: build>=1.2; extra == 'all'
Requires-Dist: datasets>=2.18; extra == 'all'
Requires-Dist: fastapi>=0.110; extra == 'all'
Requires-Dist: psutil>=5.9; extra == 'all'
Requires-Dist: pydantic>=2.6; extra == 'all'
Requires-Dist: pynvml>=11.5; extra == 'all'
Requires-Dist: pytest-cov>=5; extra == 'all'
Requires-Dist: pytest>=8; extra == 'all'
Requires-Dist: ruff>=0.4; extra == 'all'
Requires-Dist: tensorboard>=2.15; extra == 'all'
Requires-Dist: tqdm>=4.66; extra == 'all'
Requires-Dist: twine>=5; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'all'
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: twine>=5; extra == 'dev'
Provides-Extra: serve
Requires-Dist: fastapi>=0.110; extra == 'serve'
Requires-Dist: pydantic>=2.6; extra == 'serve'
Requires-Dist: uvicorn[standard]>=0.29; extra == 'serve'
Provides-Extra: train
Requires-Dist: datasets>=2.18; extra == 'train'
Requires-Dist: psutil>=5.9; extra == 'train'
Requires-Dist: pynvml>=11.5; extra == 'train'
Requires-Dist: tensorboard>=2.15; extra == 'train'
Requires-Dist: tqdm>=4.66; extra == 'train'
Description-Content-Type: text/markdown

<div align="center">

# ALTA Models — SFT

**Instruction-tuned Kinyarwanda language models from [YaliLabs](https://yalilabs.com)**

[![PyPI version](https://img.shields.io/pypi/v/alta-models-sft.svg?color=blue)](https://pypi.org/project/alta-models-sft/)
[![Python](https://img.shields.io/pypi/pyversions/alta-models-sft.svg)](https://pypi.org/project/alta-models-sft/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
[![Hugging Face](https://img.shields.io/badge/🤗-Models-yellow)](https://huggingface.co/yalilabs)

</div>

---

ALTA is a family of language models built **Kinyarwanda-first** — the tokenizer, training data, and inference are optimized for Kinyarwanda rather than treated as an afterthought to English. This package gives you a clean, dependency-light runtime for chatting with ALTA models in Python or from the command line.

## Installation

```bash
pip install alta-models-sft
```

That's it. The package pulls in `torch`, `transformers`, `huggingface_hub`, and `safetensors` — nothing else by default.

For the optional FastAPI server (`alta-sft serve`):

```bash
pip install "alta-models-sft[serve]"
```

## Quick start

```python
from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
print(chat.chat("Mwiriwe! Ushobora kumbwira amateka y'u Rwanda?"))
```

Or from the terminal:

```bash
alta-sft chat --model yalilabs/alta-base-sft --stream
```

That's the whole thing. Below is everything you'd want to do with it.

## Available models

| Model | Parameters | Context | Description |
|-------|-----------:|--------:|-------------|
| [`yalilabs/alta-base-sft`](https://huggingface.co/yalilabs/alta-base-sft) | ~110M | 4,096 | Base instruction-tuned model |

See [huggingface.co/yalilabs](https://huggingface.co/yalilabs) for the full list. In production, **pin to a specific revision**:

```python
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")
```

## Inference cookbook

Everything below uses the same `ALTAChat` class. Copy-paste any block to try it.

### 1. Basic chat (single turn)

```python
from alta_models_sft import ALTAChat

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft")
response = chat.chat("Sobanura ubumenyi bw'ikoranabuhanga.")
print(response)
```

### 2. Multi-turn conversation (with memory)

The model remembers prior turns. Just keep calling `chat()`:

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    use_memory=True,
    max_history_turns=8,
)

chat.chat("Mwiriwe! Nitwa Schadrack.")
chat.chat("Witwa nde?")                # uses the previous turn as context
chat.chat("Wansubize mu magambo make.")

chat.reset()                           # clear history
chat.set_memory(False)                 # disable memory entirely
```

### 3. GPU + bfloat16 for speed

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda",
    dtype="bfloat16",                  # "float32" | "bfloat16" | "float16"
)
```

### 4. Streaming output (token-by-token)

```python
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")

# Tokens print to stdout as they're generated.
# The full response is also returned at the end.
response = chat.chat(
    "Sobanura amateka y'u Rwanda mu magambo make.",
    stream=True,
)
```

### 5. Tuning the sampler

```python
# More focused / factual
response = chat.chat(
    "Ni iki Kigali?",
    temperature=0.3, top_p=0.85, top_k=40,
)

# More creative
response = chat.chat(
    "Andika inkuru ngufi y'amateka.",
    temperature=0.8, top_p=0.95, top_k=50,
)

# Longer outputs
response = chat.chat(
    "Sobanura uburezi mu Rwanda.",
    max_new_tokens=1024,
    repetition_penalty=1.05,
)
```

| Parameter | Default | What it does |
|-----------|--------:|--------------|
| `temperature` | `0.5` | Lower = focused, higher = creative |
| `top_p` | `0.85` | Nucleus sampling threshold (`1.0` disables) |
| `top_k` | `40` | Keep only top-k candidates (`0` disables) |
| `repetition_penalty` | `1.05` | Penalize repeated tokens (`1.0` disables) |
| `max_new_tokens` | `512` | Maximum tokens to generate |
| `stream` | `False` | Print tokens as they're generated |

### 6. Loading from a local directory

`from_pretrained` accepts any local path — useful if you've downloaded weights manually:

```python
# Relative path
chat = ALTAChat.from_pretrained("./my_local_model")

# Absolute path
chat = ALTAChat.from_pretrained("/opt/models/alta-base-sft")

# Home directory
chat = ALTAChat.from_pretrained("~/models/alta")
```

The same code works for both local paths and Hub repos — no branching required.

### 7. Private repos (authentication)

```python
import os
os.environ["HF_TOKEN"] = "hf_xxxxxxxxxxxx"
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model")

# Or pass the token directly
chat = ALTAChat.from_pretrained("yalilabs/alta-private-model", token="hf_...")
```

### 8. Batch inference (process many prompts)

`ALTAChat` is single-conversation. For independent prompts, reset between calls:

```python
prompts = [
    "Mwiriwe!",
    "Bite, witwa nde?",
    "Sobanura izuba.",
    "Kuki amazi ari ingenzi?",
]

chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", device="cuda")
results = []
for p in prompts:
    chat.reset()                       # so prompts don't influence each other
    results.append(chat.chat(p, max_new_tokens=128))

for prompt, response in zip(prompts, results):
    print(f"Q: {prompt}\nA: {response}\n")
```

### 9. Custom system prompt

By default, the model uses a Kinyarwanda assistant persona. To override:

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    system_prompt="Uri umwarimu w'amateka. Subiza nk'umwarimu.",
)
```

### 10. Debugging: disable token masking

The model masks out non-Kinyarwanda Unicode (CJK, Arabic, etc.) by default. To see raw model output:

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    mask_non_kinyarwanda=False,        # not recommended for production
)
```

## Command-line interface

The package installs an `alta-sft` command. Three subcommands cover most needs.

### Interactive chat

```bash
alta-sft chat --model yalilabs/alta-base-sft --stream
```

In-session: `/reset` clears memory, `/quit` exits.

### One-shot generation

```bash
alta-sft generate "Sobanura ubumenyi bw'ikoranabuhanga" \
    --model yalilabs/alta-base-sft \
    --temperature 0.5 \
    --max_new_tokens 256 \
    --stream
```

### HTTP server (FastAPI)

```bash
pip install "alta-models-sft[serve]"
alta-sft serve --model yalilabs/alta-base-sft --host 0.0.0.0 --port 8000
```

```bash
# Health check
curl http://localhost:8000/health

# Chat
curl -X POST http://localhost:8000/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "Mwiriwe!", "temperature": 0.5, "max_new_tokens": 128}'
```

Interactive API docs are at `http://localhost:8000/docs`.

### Common CLI flags

```
--model REPO_OR_PATH    Hub repo or local directory (required)
--revision REV          Pin to a Hub tag / branch / SHA
--device DEVICE         cpu | cuda | cuda:N
--dtype DTYPE           float32 | bfloat16 | float16
--temperature FLOAT     Sampling temperature
--top_p FLOAT           Nucleus sampling
--top_k INT             Top-k filtering
--max_new_tokens INT    Max tokens to generate
--no_memory             Disable multi-turn memory
--stream                Token-by-token output
```

Run `alta-sft --help` or `alta-sft chat --help` for the full list.

## Production deployment

### Docker

```dockerfile
FROM python:3.11-slim
RUN pip install --no-cache-dir "alta-models-sft[serve]"
ENV ALTA_MODEL=yalilabs/alta-base-sft \
    ALTA_REVISION=v1.0 \
    ALTA_DEVICE=cpu \
    ALTA_DTYPE=float32
# Pre-download weights at build time → fast cold-start
RUN python -c "from alta_models_sft import ALTAChat; \
    ALTAChat.from_pretrained('${ALTA_MODEL}', revision='${ALTA_REVISION}')"
EXPOSE 8000
CMD ["uvicorn", "alta_models_sft.server:app", "--host", "0.0.0.0", "--port", "8000"]
```

### Version pinning

The runtime and the model version independently. Pin both:

```bash
pip install "alta-models-sft==0.1.0"
```

```python
chat = ALTAChat.from_pretrained("yalilabs/alta-base-sft", revision="v1.0")
```

Every published model carries a `model_format_version`. The runtime refuses to load incompatible formats with a clear error — so a user pinning `alta-models-sft==0.1.0` can never accidentally load a checkpoint that needs a newer runtime.

## Troubleshooting

<details>
<summary><b>Model produces non-Kinyarwanda characters (CJK / Arabic)</b></summary>

Token masking is on by default and should prevent this. Make sure you haven't passed `mask_non_kinyarwanda=False` or the `--no_mask` CLI flag.
</details>

<details>
<summary><b>"Could not load tokenizer"</b></summary>

Pass it explicitly:

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    tokenizer_name="yalilabs/alta-tokenizer",
)
```
</details>

<details>
<summary><b>ModelFormatError on load</b></summary>

Your installed `alta-models-sft` is older than the model's format. Upgrade:

```bash
pip install -U alta-models-sft
```

Or pin to a model revision compatible with your installed runtime.
</details>

<details>
<summary><b>Out of memory on GPU</b></summary>

Use bfloat16:

```python
chat = ALTAChat.from_pretrained(
    "yalilabs/alta-base-sft",
    device="cuda", dtype="bfloat16",
)
```
</details>

<details>
<summary><b>Slow first generation</b></summary>

The first call always pays a one-time cost (CUDA kernel autotuning, tokenizer warm-up). Subsequent calls are much faster. The FastAPI server pre-warms on startup to avoid this on first request.
</details>

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — free for commercial and non-commercial use.

## Citation

```bibtex
@software{alta_models_sft_2026,
  author  = {YaliLabs},
  title   = {ALTA Models — SFT: Instruction-tuned Kinyarwanda Language Models},
  year    = {2026},
  url     = {https://pypi.org/project/alta-models-sft/},
  version = {0.1.0},
}
```

---

<div align="center">

Built by [YaliLabs](https://yalilabs.com) for Kinyarwanda speakers worldwide

[Website](https://yalilabs.com) · [Models on 🤗](https://huggingface.co/yalilabs)

</div>