Metadata-Version: 2.4
Name: lang-limes
Version: 0.1.4
Summary: Linguistic Identification of Morphosyntactic and Expressive Snags
Project-URL: documentation, https://deepsight.codeberg.page/LIMES/
Project-URL: source, https://codeberg.org/deepsight/LIMES/
Author-email: deepsight GmbH <hi@deepsight.de>
Maintainer-email: Jannik Schmitt <jannik.schmitt@deepsight.de>
License-Expression: Apache-2.0
License-File: LICENSES/Apache-2.0.txt
Requires-Python: >=3.12
Requires-Dist: marisa-trie>=1.3.1
Requires-Dist: numpy>=2.4.1
Requires-Dist: pydantic>=2.12.5
Requires-Dist: spacy>=3.8.11
Description-Content-Type: text/markdown

# LIMES: Linguistic Identification of Morphosyntactic and Expressive Snags
LIMES is a library for performing linguistic analyses on provided texts
regarding their complexity. The goal of this project is to create a tool that
provides actionable insights on how to make written texts easier to comprehend.

> Refer to [the project documentation](https://deepsight.codeberg.page/LIMES/)
> for in-depth information about concepts, API, and more.

Please note that the actual logic for identifying language barriers is
completely language-specific. Because it is a lot of work to develop these
heuristics, the library currently only ships with implemented analyzers for
**German** texts. However, we encourage you to **build your own analyzers**
based on the provided class templates, either for your own use or to
contribute to the project.

## Installation
You can install this package via pip by running:

```bash
pip install lang-limes
```

### Additional Dependencies
The library requires use of a Parser. Currently, we only ship a parser based on
[spaCy's](https://spacy.io/) excellent NLP pipeline. This means that you need to
[install a spaCy model](https://spacy.io/usage/models/) that supports the
language you are working with.

## Example Usage
You must use a string container to wrap the text you want to analyze. As our
analysis work on a sentence level, you can either manually sentencize and create
separate [Sentence](https://deepsight.codeberg.page/LIMES/api/sentence.html)
objects or just throw your whole text into a
[Text](https://deepsight.codeberg.page/LIMES/api/text.html) object that takes
care of sentencization for you.

We will do the latter for the purpose of this example.

```python
from limes import Text
from limes.parsers import SpacyParser
from limes.analyzers.de import GermanAnalyzer

analyzer = GermanAnalyzer()

# You can also pass a spacy NLP object instead of the model name.
# Make sure the model you want to use is installed.
parser = SpacyParser(model="de_core_news_sm")

text = Text(
    raw="Das hier ist ein Text. Dieser Text hat mehrere Sätze.",
    analyzer=analyzer,
    parser=parser,
)
```

For parsing, we recommend using the most powerful model that your system can
reasonably run, as barrier detection is very sensitive to errors during morpho-
syntactic analysis. We've used the small model in the example above but if you
can, try using the transformer-based `de-dep-news-trf` for improved results at
the cost of compute time.

### Identifying Barriers
Barriers are detected lazily, and results are cached to avoid redundant
computations. Barriers themselves are a `property` of the Text object.

```python
# You can iterate over the all barriers in the entire text if you want.
for barrier in text.barriers:
    print(barrier.title)
    # Print the actual string of the token.
    print(barrier.affected_tokens)
    # Print the position of the token in the source text.
    if barrier.affected_tokens is not None:
        print([token.i for token in barrier.affected_tokens])

# You can also iterate over each sentence.
for sentence in text:
    print(sentence.barriers)

# Alternatively, you can also inspect a specific sentence by index.
print(text[1].barriers)
```

Please note that barriers are also language-specific (because different
languages also differ in how they make comprehension "difficult").

### Calculating Complexities
There are multiple ways in which you can try to approximate language complexity
(see our documentation for more information).

```python
from limes import ComplexityAlgorithm

# Get the average complexity of the text. You can manually set the heuristic.
avg_complexity = text.average_complexity(
    heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
)
print(avg_complexity)

# Alternatively, you can get phrase-level complexities.
# These are also lazily computed and cached.
for phrase, complexity in text.local_complexities:
    print(phrase)
    print(complexity)

# You could also iterate over all sentences in the text and get each sentence's
# global complexity.
for sentence in text:
    complexity = sentence.global_complexity(
        heuristic=ComplexityAlgorithm.AGGREGATED_LOCAL,
    )
    print(sentence)
    print(complexity)
```

### Finding Words
If you want to find specific tokens in a `Text` or `Sentence` (e.g. if you have a list of words that are allowlisted or disallowlisted for purposes of language simplification), you can utilize LIMES to perform more refined search than simple string matching.
You can search for tokens by providing the word and its Part-of-Speech tag.
This allows more robust word identification as we can normalize capitalization and decrease the risk of matching homographs.

See the example below for an example where LIMES provides improved locating of words over naive string matching.

```python
text = Text(
    raw="ich esse gerne essen.",
    analyzer=analyzer,
    parser=parser,
)

results = text.find("Essen", pos="NOUN")
print(results) # Will print reference to last token in sequence.

results = text.find("essen", pos="NOUN")
print(results) # Will print reference to second token in sequence.
```

Please note that incorrectly cased input texts usually decrease accuracy of automated text parsing tools, meaning that even with LIMES' word matching, all-lowercase texts may still yield accurate results.

## Next Steps
A good place to start is to
[get an overview of the concepts](https://deepsight.codeberg.page/LIMES/concepts/overview.html)
used to build and configure the whole processing pipeline.

## Currently Supported Languages
|Language|Contributors|
|--------|------------|
|DE|Katja Grosch & Susanne Wagner (IFTO GmbH), Jannik Schmitt (deepsight GmbH)|

## Additional Resources
### Word Frequency Lists
#### German
The frequency list for German words was kindly provided by [Projekt Deutscher Wortschatz](https://wortschatz-leipzig.de/de)
of the Universität Leipzig. The unprocessed list included in this repository
(`data/deu_words_2024.txt`) is based on [1]. Please note that it is not based on
the publicly available "Normgrößenkorpora" but was provided on request by the
Leipzig Corpora team under a **CC BY 4.0 license**.

## References
<a id="1">[1]</a>
Leipzig Corpora Collection (2024).
*German news corpus based on material from 2024.*
Leipzig Corpora Collection. Dataset.
https://corpora.uni-leipzig.de/en?corpusId=deu_news_2024
