Metadata-Version: 2.4
Name: lang-limes
Version: 0.1.3
Summary: Linguistic Identification of Morphosyntactic and Expressive Snags
Project-URL: documentation, https://deepsight.codeberg.page/LIMES/
Project-URL: source, https://codeberg.org/deepsight/LIMES/
Author-email: deepsight GmbH <hi@deepsight.de>
Maintainer-email: Jannik Schmitt <jannik.schmitt@deepsight.de>
License-Expression: Apache-2.0
License-File: LICENSES/Apache-2.0.txt
Requires-Python: >=3.12
Requires-Dist: marisa-trie>=1.3.1
Requires-Dist: numpy>=2.4.1
Requires-Dist: pydantic>=2.12.5
Requires-Dist: spacy>=3.8.11
Description-Content-Type: text/markdown

# LIMES: Linguistic Identification of Morphosyntactic and Expressive Snags
LIMES is a library for performing linguistic analyses on provided texts
regarding their complexity. The goal of this project is to create a tool that
provides actionable insights on how to make written texts easier to comprehend.

> Refer to [the project documentation](https://deepsight.codeberg.page/LIMES/)
> for in-depth information about concepts, API, and more.

Please note that the actual logic for identifying language barriers is
completely language-specific. Because it is a lot of work to develop these
heuristics, the library currently only ships with implemented analyzers for
**German** texts. However, we encourage you to **build your own analyzers**
based on the provided class templates, either for your own use or to
contribute to the project.

## Installation
You can install this package via pip by running:

```bash
pip install lang-limes
```

### Additional Dependencies
The library requires use of a Parser. Currently, we only ship a parser based on
[spaCy's](https://spacy.io/) excellent NLP pipeline. This means that you need to
[install a spaCy model](https://spacy.io/usage/models/) that supports the
language you are working with.

## Example Usage
You must use a string container to wrap the text you want to analyze. As our
analysis work on a sentence level, you can either manually sentencize and create
separate [Sentence](https://deepsight.codeberg.page/LIMES/api/sentence.html)
objects or just throw your whole text into a
[Text](https://deepsight.codeberg.page/LIMES/api/text.html) object that takes
care of sentencization for you.

We will do the latter for the purpose of this example.

```python
from limes import Text
from limes.parsers.spacy_parser import SpacyParser
from limes.analyzers.de import GermanAnalyzer

analyzer = GermanAnalyzer()

# You can also pass a spacy NLP object instead of the model name.
# Make sure the model you want to use is installed.
parser = SpacyParser(model="de_core_news_sm")

text = Text(
    raw="Das hier ist ein Text. Dieser Text hat mehrere Sätze.",
    analyzer=analyzer,
    parser=parser,
)
```

For parsing, we recommend using the most powerful model that your system can
reasonably run, as barrier detection is very sensitive to errors during morpho-
syntactic analysis. We've used the small model in the example above but if you
can, try using the transformer-based `de-dep-news-trf` for improved results at
the cost of compute time.

### Identifying Barriers
Barriers are detected lazily, and results are cached to avoid redundant
computations. Barriers themselves are a `property` of the Text object.

```python
# You can iterate over the all barriers in the entire text if you want.
for barrier in text.barriers:
    print(barrier.title)
    # Print the actual string of the token.
    print(barrier.affected_tokens)
    # Print the position of the token in the source text.
    if barrier.affected_tokens is not None:
        print([token.i for token in barrier.affected_tokens])

# You can also iterate over each sentence.
for sentence in text:
    print(sentence.barriers)

# Alternatively, you can also inspect a specific sentence by index.
print(text[1].barriers)
```

Please note that barriers are also language-specific (because different
languages also differ in how they make comprehension "difficult").

### Calculating Complexities
There are multiple ways in which you can try to approximate language complexity
(see our documentation for more information).

```python
from limes import ComplexityAlgorithm

# Get the average complexity of the text. You can manually set the heuristic.
avg_complexity = text.average_complexity(
    heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
)
print(avg_complexity)

# Alternatively, you can get phrase-level complexities.
# These are also lazily computed and cached.
for phrase, complexity in text.local_complexities:
    print(phrase)
    print(complexity)

# You could also iterate over all sentences in the text and get each sentence's
# global complexity.
for sentence in text:
    complexity = sentence.global_complexity(
        heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
    )
    print(sentence)
    print(complexity)
```

## Next Steps
A good place to start is to
[get an overview of the concepts](https://deepsight.codeberg.page/LIMES/concepts/overview.html)
used to build and configure the whole processing pipeline.

## Currently Supported Languages
|Language|Contributors|
|--------|------------|
|DE|Katja Grosch & Susanne Wagner (IFTO GmbH), Jannik Schmitt (deepsight GmbH)|

## Additional Resources
### Word Frequency Lists
#### German
The frequency list for German words was kindly provided by [Projekt Deutscher Wortschatz](https://wortschatz-leipzig.de/de)
of the Universität Leipzig. The unprocessed list included in this repository
(`data/deu_words_2024.txt`) is based on [1]. Please note that it is not based on
the publicly available "Normgrößenkorpora" but was provided on request by the
Leipzig Corpora team under a **CC BY 4.0 license**.

## References
<a id="1">[1]</a>
Leipzig Corpora Collection (2024).
*German news corpus based on material from 2024.*
Leipzig Corpora Collection. Dataset.
https://corpora.uni-leipzig.de/en?corpusId=deu_news_2024
