Metadata-Version: 2.4
Name: lang-limes
Version: 0.1.0
Summary: Linguistic Identification of Morphosyntactic and Expressive Snags
Project-URL: documentation, https://deepsight.codeberg.page/LIMES/
Project-URL: source, https://codeberg.org/deepsight/LIMES/
Author-email: deepsight GmbH <hi@deepsight.de>
Maintainer-email: Jannik Schmitt <jannik.schmitt@deepsight.de>
License-Expression: Apache-2.0
License-File: LICENSES/Apache-2.0.txt
Requires-Python: >=3.12
Requires-Dist: numpy>=2.2.5
Requires-Dist: pydantic>=2.11.3
Requires-Dist: spacy>=3.8.5
Description-Content-Type: text/markdown

# LIMES: Linguistic Identification of Morphosyntactic and Expressive Snags
LIMES is a library for performing linguistic analyses on provided texts
regarding their complexity. The goal of this project is to create a tool that
provides actionable insights on how to make written texts easier to comprehend.

> Refer to [the project documentation](https://deepsight.codeberg.page/LIMES/)
> for in-depth information about concepts, API, and more.

Please note that the actual logic for identifying language barriers is
completely language-specific. Because it is a lot of work to develop these
heuristics, the library currently only ships with implemented analyzers for
**German** texts. However, we encourage you to **build your own analyzers**
based on the provided class templates, either for your own use or to
contribute to the project.

## Installation
You can install this package via pip by running:

```bash
pip install limes
```

> You currently CAN'T install via pip, the package isn't published yet! If
> you can't wait to try it out for yourself, try building from source using
> [uv](https://docs.astral.sh/uv/).

### Additional Dependencies
The library requires use of a Parser. Currently, we only ship a parser based on
[spaCy's](https://spacy.io/) excellent NLP pipeline. This means that you need to
[install a spaCy model](https://spacy.io/usage/models/) that supports the
language you are working with.

## Example Usage
You must use a string container to wrap the text you want to analyze. As our
analysis work on a sentence level, you can either manually sentencize and create
separate [Sentence](https://deepsight.codeberg.page/LIMES/api/sentence.html)
objects or just throw your whole text into a
[Text](https://deepsight.codeberg.page/LIMES/api/text.html) object that takes
care of sentencization for you.

We will do the latter for the purpose of this example.

```python
from limes import Text, SpacyParser, GermanAnalyzer

analyzer = GermanAnalyzer()

# You can also pass a spacy NLP object instead of the model name.
# Make sure the model you want to use is installed.
parser = SpacyParser(model="de_core_news_sm")

text = Text(
    raw="Das hier ist ein Text. Dieser Text hat mehrere Sätze.",
    analyzer=analyzer,
    parser=parser,
)
```

### Identifying Barriers
Barriers are detected lazily, and results are cached to avoid redundant
computations. Barriers themselves are a `property` of the Text object.

```python
# You can iterate over the all barriers in the entire text if you want.
for barrier in text.barriers:
    print(barrier.title)
    # Print the actual string of the token.
    print(barrier.affected_tokens)
    # Print the position of the token in the source text.
    print([token.i for token in barrier.affected_tokens])

# You can also iterate over each sentence.
for sentence in text:
    print(sentence.barriers)

# Alternatively, you can also inspect a specific sentence by index.
print(text[1].barriers)
```

Please note that barriers are also language-specific (because different
languages also differ in how they make comprehension "difficult").

### Calculating Complexities
There are multiple ways in which you can try to approximate language complexity
(see our documentation for more information).

```python
from limes import ComplexityAlgorithm

# Get the average complexity of the text. You can manually set the heuristic.
avg_complexity = text.average_complexity(
    heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
)
print(avg_complexity)

# Alternatively, you can get phrase-level complexities.
# These are also lazily computed and cached.
for phrase, complexity in text.local_complexities:
    print(phrase)
    print(complexity)

# You could also iterate over all sentences in the text and get each sentence's
# global complexity.
for sentence in text:
    complexity = sentence.global_complexity(
        heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
    )
    print(sentence)
    print(complexity)
```

## Next Steps
A good place to start is to
[get an overview of the concepts](https://deepsight.codeberg.page/LIMES/concepts/overview.html)
used to build and configure the whole processing pipeline.

## Currently Supported Languages
|Language|Contributors|
|--------|------------|
|DE|Katja Grosch & Susanne Wagner (IFTO GmbH), Jannik Schmitt (deepsight GmbH)|

## Additional Resources
### Word Frequency Lists
#### German
The frequency list for German words was kindly provided by [Projekt Deutscher Wortschatz](https://wortschatz-leipzig.de/de)
of the Universität Leipzig. The unprocessed list included in this repository
(`data/deu_words_2024.txt`) is based on [1]. Please note that it is not based on
the publicly available "Normgrößenkorpora" but was provided on request by the
Leipzig Corpora team under a **CC BY 4.0 license**.

## References
<a id="1">[1]</a>
Leipzig Corpora Collection (2024).
*German news corpus based on material from 2024.*
Leipzig Corpora Collection. Dataset.
https://corpora.uni-leipzig.de/en?corpusId=deu_news_2024
