Metadata-Version: 2.4
Name: trigram-llm
Version: 0.1.0
Summary: Fast C-backed trigram language model for word prediction and sentence completion
Author: Raghottam Girish Nadgoudar
License-Expression: MIT
Project-URL: Homepage, https://github.com/ROHITH-KUMAR-L/Trigrams
Project-URL: Repository, https://github.com/ROHITH-KUMAR-L/Trigrams
Project-URL: Bug Tracker, https://github.com/ROHITH-KUMAR-L/Trigrams/issues
Keywords: nlp,language-model,trigram,autocomplete,prediction,nlp,text
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: C
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file
Dynamic: requires-python

# trigram-llm 🧠

A **fast, production-ready Python library** for next-word prediction and sentence completion, powered by a hand-written C engine using a Prefix Trie, DJB2 HashMap, and Stupid Backoff smoothing.

> Sub-millisecond predictions · Zero dependencies · Pure ctypes · Thread-safe

---

## Features

| Feature | Description |
|---|---|
| `train_from_text(text)` | Train from any Python string |
| `train_from_file(path)` | Train from a text file (incremental) |
| `train_from_list(words)` | Train from a pre-tokenised word list |
| `predict_next(w1, w2)` | Greedy single-word prediction (< 1ms) |
| `predict_top_n(w1, w2, n, temperature)` | Top-N predictions with probabilities |
| `complete_sentence(prompt, num_words, beam_width)` | Beam search sentence generation |
| `greedy_generate(prompt, num_words)` | Fastest sentence completion |
| `perplexity(text)` | Evaluate model quality on held-out text |
| `vocabulary()` | Returns all known words as a Python `set` |
| `get_stats()` | Dict with trigram count, vocab size, etc. |
| `save(path)` / `TrigramModel.load(path)` | Binary model persistence |
| `reset()` | Clear model and retrain from scratch |
| `"the quick" in model` | Check if a bigram context was seen |
| `len(model)` | Total number of stored trigrams |
| Thread-safe | All predictions guarded by a `threading.Lock` |
| Context manager | `with TrigramModel.load(path) as m:` |

---

## Installation

### Prerequisites
- Python 3.8+
- GCC (macOS: `xcode-select --install`, Ubuntu: `sudo apt install gcc`)

### Install (one command)

```bash
cd /path/to/Trigrams
pip install -e .
```

This compiles the C engine into `trigram/_trigram_c.dylib` (or `.so` on Linux) and installs the package in editable mode.

---

## Quickstart

```python
from trigram import TrigramModel

# 1. Create and train
model = TrigramModel()
model.train_from_text("""
    The quick brown fox jumps over the lazy dog.
    The quick brown fox was nimble and swift.
    The lazy dog slept peacefully under the old oak tree.
""")

# 2. Predict next word (greedy)
word = model.predict_next("the", "quick")
print(word)  # → "brown"

# 3. Top-N predictions with probabilities
preds = model.predict_top_n("the", "quick", n=3, temperature=1.0)
# [{"word": "brown", "probability": 0.75, "count": 2},
#  {"word": "red",   "probability": 0.25, "count": 1}]

# 4. Sentence completion (beam search)
completions = model.complete_sentence("the quick", num_words=4, beam_width=3)
# [{"sentence": "the quick brown fox jumps", "probability": 0.012}, ...]

# 5. Greedy generation (fastest)
sentence = model.greedy_generate("the quick", num_words=3)
# "the quick brown fox"

# 6. Evaluate quality
ppl = model.perplexity("the quick brown fox")
print(f"Perplexity: {ppl:.2f}")

# 7. Inspect model
print(len(model))          # → total trigrams
print("the quick" in model)  # → True
print(model.vocabulary())  # → {"the", "quick", "brown", ...}
print(model.get_stats())   # → {"total_trigrams": 7, "unique_first_words": 3, ...}
```

---

## Training from a File

```python
model = TrigramModel()
model.train_from_file("path/to/my_corpus.txt")

# Incremental training — add more data later
model.train_from_file("path/to/more_data.txt")
```

---

## Saving and Loading Models

```python
# Save
model.save("my_model.bin")

# Load (class method)
model2 = TrigramModel.load("my_model.bin")

# Context manager (auto-frees on exit)
with TrigramModel.load("my_model.bin") as m:
    print(m.predict_next("the", "quick"))
```

---

## Temperature Sampling

The `temperature` parameter controls how creative predictions are:

```python
# Deterministic — always picks the most common word
model.predict_top_n("the", "quick", temperature=0.1)

# Standard probability distribution
model.predict_top_n("the", "quick", temperature=1.0)

# More diverse / creative
model.predict_top_n("the", "quick", temperature=2.0)
```

---

## Advanced Usage

### Train from a word list (custom tokenisation)

```python
import nltk
tokens = nltk.word_tokenize("The quick brown fox")
tokens = [t.lower() for t in tokens if t.isalpha()]

model = TrigramModel()
model.train_from_list(tokens)
```

### Thread-safe batch prediction

```python
import threading

def worker(model, results, idx):
    results[idx] = model.predict_top_n("the", "quick", n=5)

model = TrigramModel.load("model.bin")
results = [None] * 10
threads = [threading.Thread(target=worker, args=(model, results, i)) for i in range(10)]
for t in threads: t.start()
for t in threads: t.join()
```

### Check if a context exists before predicting

```python
if "the quick" in model:
    result = model.predict_next("the", "quick")
```

---

## API Reference

### `TrigramModel()`
Creates a new empty model.

### `train_from_text(text: str) → int`
Train on a raw text string. Returns trigrams inserted.

### `train_from_file(path) → int`
Train from a text file. Returns trigrams inserted.

### `train_from_list(words: list) → int`
Train from a pre-tokenised word list. Returns trigrams inserted.

### `predict_next(w1, w2) → str | None`
Return the single most-likely next word or `None`.

### `predict_top_n(w1, w2, n=5, temperature=1.0) → list[dict]`
Return up to N predictions sorted by probability descending.
Each dict: `{"word": str, "probability": float, "count": int}`.

### `complete_sentence(prompt, num_words=5, beam_width=3) → list[dict]`
Generate sentence completions via beam search.
Each dict: `{"sentence": str, "probability": float}`.

### `greedy_generate(prompt, num_words=5) → str`
Fastest sentence completion using greedy decoding.

### `perplexity(text) → float`
Compute per-token perplexity on held-out text. Lower = better.

### `vocabulary() → set[str]`
All words seen in the first-word position of training trigrams.

### `get_stats() → dict`
`{"total_trigrams": int, "unique_first_words": int, "vocabulary_size": int}`.

### `save(path) → None`
Save model to binary file. Compatible with the C CLI tool.

### `TrigramModel.load(path) → TrigramModel` (classmethod)
Load a pre-trained binary model. Supports context manager protocol.

### `reset() → None`
Clear all training data.

### `len(model)` → int
Total stored trigrams.

### `"w1 w2" in model` / `("w1", "w2") in model` → bool
Check if a bigram context exists.

### `repr(model)`
`TrigramModel(trigrams=11,062,203, vocab=97,277)`

---

## Performance

| Operation | Latency |
|---|---|
| Single word prediction | < 1ms |
| Top-5 predictions | 1–2ms |
| Beam search (5 words, width 3) | 5–10ms |
| Training (1M words) | ~30s |

---

## Running Tests

```bash
pip install pytest
pytest tests/ -v
```

---

## Project Structure

```
Trigrams/
├── trigram/                  # Python library
│   ├── __init__.py
│   ├── _lib.py               # ctypes bindings
│   ├── model.py              # TrigramModel class
│   ├── utils.py              # Text preprocessing
│   └── _trigram_c.dylib      # Compiled C engine (auto-generated)
├── trigram_llm/
│   ├── src/                  # C source files
│   └── include/              # C headers
├── tests/                    # pytest test suite
├── setup.py                  # Build script
└── pyproject.toml
```

---

## License

MIT License — feel free to use, modify, and distribute.
