Metadata-Version: 2.4
Name: sanskrit-tokenizer
Version: 0.1.0
Summary: Sanskrit tokenizer with sandhi splitting for Information Retrieval.
Author: Hemanth HM
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown

# sanskrit-tokenizer

Tokenize Sanskrit text with sandhi splitting for Information Retrieval.

```bash
pip install .
```

## Quick start

```python
from sanskrit_tokenizer import tokenize

tokenize("devālaya")
# ['deva', 'ālaya']

tokenize("धर्म योग")
# ['dharma', 'yoga']

tokenize("dharmakṣetre kurukṣetre")
# ['dharmakṣa', 'itre', 'kurukṣa', 'itre']
```

`tokenize()` normalizes to IAST, splits on whitespace and punctuation, then applies reverse sandhi rules. Accepts both Devanagari and IAST.

## Sandhi splitting

```python
from sanskrit_tokenizer.sandhi import split_sandhi

split_sandhi("devālaya")   # savarna-dīrgha: ā → a + ā
# ['deva', 'ālaya']

split_sandhi("dharma")     # no junction found
# ['dharma']
```

Rule-based engine covering vowel sandhi (savarṇa-dīrgha, guṇa, vṛddhi, yān, ayādi), consonant sandhi (voicing, nasals, t-combinations), and visarga sandhi. Uses longest-match heuristic when splits are ambiguous.

## Transliteration

```python
from sanskrit_tokenizer.transliterate import (
    devanagari_to_iast,
    iast_to_devanagari,
    is_devanagari,
)

devanagari_to_iast("भगवद्गीता")
# 'bhagavadgītā'

iast_to_devanagari("rāmāyaṇam")
# 'रामायणम्'

is_devanagari("धर्म")
# True
```

## Word-level tokenization

```python
from sanskrit_tokenizer.tokenizer import tokenize_words

tokenize_words("devālaya namaḥ")
# ['devālaya', 'namaḥ']
```

`tokenize_words()` splits on whitespace and punctuation only — no sandhi splitting.

## CLI

```bash
sanskrit-tokenize "devālaya"
# deva
# ālaya

echo "धर्म योग" | sanskrit-tokenize
# dharma
# yoga

sanskrit-tokenize --no-sandhi "devālaya"
# devālaya

sanskrit-tokenize -s " " "dharma yoga"
# dharma yoga
```

- `--no-sandhi` — word-level only, skip sandhi splitting
- `--separator SEP` — output separator (default: newline)

## License

MIT © [Hemanth.HM](https://h3manth.com)
