Metadata-Version: 2.4
Name: langidentify
Version: 1.0.0
Summary: Fast, high-accuracy language detection for Python. Uses ngram classification augmented with a topwords signal for improved short-text accuracy. Supports 80+ languages.
Author-email: Jeremy Lilley <jeremy@jlilley.net>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/jlpka/langidentify
Project-URL: Repository, https://github.com/jlpka/langidentify
Keywords: language-detection,nlp,ngram,text-classification
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cjclassifier>=1.0.3
Provides-Extra: full
Requires-Dist: langidentify-full-model>=1.0.0; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# LangIdentify

A fast, lightweight language detection library for Python. LangIdentify detects
the language of text using a combination of ngram frequency analysis and
whole-word ("topwords") frequency signals, both trained on the Wikipedia corpus.
It supports 80+ languages across Latin, Cyrillic, Arabic, CJK, and many other
scripts, and runs entirely offline with no network calls.

Most language detection libraries rely solely on character ngram models. While
ngrams are an excellent primary signal, they struggle with short or ambiguous
text. LangIdentify augments ngram scoring with a topwords signal that identifies
common whole words from each language, giving it higher accuracy on short
sentences than other approaches -- even on two-word phrases.

## Quick start

### Install

```bash
pip install langidentify
```

For the full (higher accuracy) model:

```bash
pip install "langidentify[full]"
```

### Basic usage

```python
from langidentify import Detector, Model, Language

# Load the model for the languages you care about.
languages = Language.from_comma_separated("en,fr,de,es,it")
model = Model.load(languages)

# Create a detector (lightweight, not thread-safe -- use one per thread).
detector = Detector(model)

# Detect.
lang = detector.detect("Bonjour le monde")
print(lang)            # Language.FRENCH
print(lang.iso_code)   # fr
```

### Inspecting results

After detection, `detector.results` provides scoring details:

```python
detector.detect("The quick brown fox")
results = detector.results
print(results.result)  # Language.ENGLISH
print(results.gap)     # confidence gap (0.0 = close, 1.0 = decisive)
```

### Incremental detection

For streaming or multi-part text:

```python
detector.clear_scores()
detector.add_text("Bonjour")
detector.add_text(" le monde")
result = detector.compute_result()  # Language.FRENCH
```

### Language boosts

When you have prior context (e.g. an HTTP Accept-Language header), you can bias
detection toward expected languages:

```python
boosts = model.build_boost_array({Language.FRENCH: 0.08})
lang = detector.detect("message", boosts)  # FRENCH
# Without the boost, "message" is ambiguous between English and French.
```

### Loading from a filesystem path

If you prefer to point directly at model data files instead of using the
bundled package data:

```python
model = Model.load_from_path("/path/to/models/lite", languages)
```

## Choosing languages

**Configure only the languages you actually need.** Each additional language
increases loading time and memory usage. Closely related languages can
cross-detect on very short phrases -- for example, adding Luxembourgish when
you only need German may cause short German phrases to be misidentified.

Group aliases are supported for convenience:

| Alias | Languages |
|-------|-----------|
| `efigs` | English, French, Italian, German, Spanish |
| `efigsnp` | EFIGS + Dutch, Portuguese |
| `europe_west_common` | EFIGSNP + Nordic languages |
| `europe_common` | Western + Eastern European + Cyrillic |
| `cjk` | Chinese (Simplified/Traditional), Japanese, Korean |
| `latin_alphabet` | All Latin-script languages |
| `unique_alphabet` | Languages where the script implies the language (e.g. Thai, Greek) |

```python
languages = Language.from_comma_separated("europe_west_common,cjk")
```

## Lite vs. full model

Both models are trained from the same Wikipedia data but cropped at different
probability floors:

| | Lite | Full |
|---|---|---|
| Log-probability floor | -12 | -15 |
| Disk size (all languages) | ~17 MB | ~89 MB |
| Best for | Most use cases | Maximum accuracy when memory is not a concern |

By default, `Model.load()` auto-discovers which model variant is available,
preferring the full model. To force a variant:

```python
model = Model.load_lite(languages)   # recommended default
model = Model.load_full(languages)   # higher accuracy, more memory
```

### Getting the full model

The lite model is sufficient for most use cases. If you want the full model
for maximum accuracy, install the companion package:

```bash
pip install "langidentify[full]"
```

This installs the `langidentify-full-model` package, which provides the full
model data. Once installed, `Model.load()` will automatically prefer the full
model, or you can request it explicitly:

```python
model = Model.load_full(languages)
```

## CJK detection

Chinese/Japanese disambiguation is handled by the
[cjclassifier](https://pypi.org/project/cjclassifier/) package, which is
installed automatically as a dependency. Korean uses the distinct Hangul script
and is identified by alphabet alone.

## Thread safety

`Model` caches loaded data in a module-level dict protected by a lock.
`Detector` is lightweight to construct and intentionally **not thread-safe**.
For concurrent detection, use a separate instance per thread:

```python
import threading

model = Model.load(languages)  # shared, thread-safe

local = threading.local()

def get_detector():
    if not hasattr(local, "detector"):
        local.detector = Detector(model)
    return local.detector

# In each thread:
lang = get_detector().detect(text)
```

## Requirements

- Python 3.9+
- [cjclassifier](https://pypi.org/project/cjclassifier/) (installed automatically)

## License

Apache License 2.0 -- see [LICENSE](LICENSE).

The bundled models contain statistical parameters derived from Wikipedia text.
The models do not contain or reproduce Wikipedia text.
