Metadata-Version: 2.4
Name: modest_bauwenst
Version: 2026.3.1
Summary: MoDeST: a Morphological Decomposition & Segmentation Trove.
Author-email: Thomas Bauwens <thomas.bauwens@kuleuven.be>
Maintainer-email: Thomas Bauwens <thomas.bauwens@kuleuven.be>
Project-URL: Source, https://github.com/bauwenst/MoDeST
Project-URL: Issues, https://github.com/bauwenst/MoDeST/issues
Keywords: NLP,segmentation,natural language,decomposition,morphology,datasets
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets
Requires-Dist: bs4
Requires-Dist: selenium
Requires-Dist: webdriver_manager
Requires-Dist: langcodes>=3.5.0
Provides-Extra: all
Requires-Dist: tktkt[all]; extra == "all"
Provides-Extra: github
Dynamic: license-file

<img src="https://raw.githubusercontent.com/bauwenst/MoDeST/v2026.03.01/doc/logo.png" alt="MoDeST">

# MoDeST: a Morphological Decomposition &amp; Segmentation Trove
The point of MoDeST is two-fold:
1. Provide a general object-oriented Python interface to access morphological decompositions and segmentations;
2. Host morphological datasets generated by smaller research groups that would otherwise have a hard time being found.

*Morphological decomposition* is the task of recognising which building blocks a word was originally constructed from. These building blocks are its *morphemes*.
As an example, the Dutch derivation `isometrisch` ("isometric") can be decomposed into the morphemes `iso`, `meter` and `isch`.

*Morphological segmentation* is the task of isolating the substrings of a word that correspond to its morphemes. These substrings are called *morphs*.
In the above example, the segmentation would be `iso/metr/ic`.

```shell
pip install modest[all]
```

## Languages and Datasets
The supported languages are simply under `modest.languages`, so the list will not be reproduced here.
The list of datasets roughly coincides with the downloaders under `modest.datasets`. Currently, the package supports:
- CELEX
- MorphyNet
- MorphoChallenge2010
- CompoundPiece

## Example
Yes, it really is this easy, with full type checking and autocompletion by your IDE:
```python
from modest.languages.english import English_Celex, English_MorphoChallenge2010, English_MorphyNet_Inflections

for item in English_Celex().generate():
    print(item.word, "should be segmented as", item.segment(), "which derives from", item.decompose())
```

## Repo layout
Currently, the repo looks as follows:
```
data/                ---> Datasets hosted specifically by MoDeST on GitHub. Will NOT be downloaded when you install the package.
src/modest/          ---> All source code for the Python package that will be installed in your interpreter.
    languages/       ---> Per-language definitions of the dataset classes users will interact with.
    datasets/        ---> Support code for pulling in remote data, and reading raw examples from the resulting files.
    formats/         ---> Support code for turning raw examples into objects. Mainly used for parsing tags, whose format is independent of how they are stored (TSV, XML, JSON, ...).
    interfaces/      ---> Declarations of the interfaces users will interact with.
    transformations/ ---> Operations performed on datasets.
```

Currently, every language has its own file under `languages/`. The assumption is that the datasets pertaining to one language 
are sufficiently encapsulated that this will not clutter the imports from such a file. There are two arguments in favour of
going from `languages/{language}.py` to instead `languages/{language}/{dataset}.py`: 
1. Autocompletion for the last `.` of the `import` suggests exactly the list of available datasets for that language;
2. You do not have to have all the packages installed required to download/build all the datasets for a language if you 
   only need one. (However, realistically, since MoDeST is for final datasets rather than making datasets, the code for pulling
   them should not be that complicated.)
