Metadata-Version: 2.4
Name: text-processing-utils
Version: 0.0.0
Summary: Utility functions for natural language processing (NLP)
Author: Jan Göpfert
License-Expression: MIT
Project-URL: Homepage, https://github.com/FZJ-IEK3-VSA/text-processing-utils
Project-URL: Issues, https://github.com/FZJ-IEK3-VSA/text-processing-utils/issues
Keywords: char offsets,BIO tags,tagging,sequence labelling,sequence labeling,NER,named entity recogntion,information extraction,nlp,utilities
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Natural Language :: English
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools
Requires-Dist: pip
Requires-Dist: spacy
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Dynamic: license-file

<a href="https://www.fz-juelich.de/en/ice/ice-2"><img src="https://github.com/FZJ-IEK3-VSA/README_assets/blob/main/JSA-Header.svg?raw=True" alt="Forschungszentrum Juelich Logo" width="175px"></a>

# Text processing utils

Small helpers for NLP tasks.


## Installation

Create and activate a virtual environment.<br>
Then, install the package via pip and download the spaCy pipeline.

```bash
pip install text-processing-utils
python3 -m spacy download en_core_web_md
```


## Contents
* batches
    * get_batches_of_strict_size_with_remainder
    * get_n_batches
    * get_batches_of_roughly_equal_size
* boolean_checks
    * is_gibberish
    * is_plural
    * an_vs_a
* char_offsets
    * is_inside
    * get_span_distance_sorted
    * get_span_distance
    * remove_whitespace_from_annotation
    * merge_annotation_offsets
* bio_tags
    * bio_tags_to_spans
    * remove_overlapping_bio_tags
    * transform_into_char_offsets_and_readable_tag
    * token_spans_to_char_annotations
* locate
    * get_sent_idx
    * locate_span_in_context
* highlight_context
    * enclose_with_special_symbol
* sentences
    * lower_first_letter_if_sent_start
    * correct_sentence_boundary_detection
* regex
    * make_named_group_unique
* types
    * Offset
    * Annotations


## About Us 

<a href="https://www.fz-juelich.de/en/ice/ice-2"><img src="https://github.com/FZJ-IEK3-VSA/README_assets/blob/main/iek3-square.png?raw=True" alt="Institute image ICE-2" width="280" align="right" style="margin:0px 10px"/></a>

We are the <a href="https://www.fz-juelich.de/en/ice/ice-2">Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis</a> belonging to the <a href="https://www.fz-juelich.de/en">Forschungszentrum Jülich</a>. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

## Acknowledgements

The authors would like to thank the German Federal Government, the German state governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

<p float="left">
    <a href="https://nfdi4ing.de/"><img src="https://nfdi4ing.de/wp-content/uploads/2018/09/logo.svg" alt="NFDI4Ing Logo" width="130px"></a>&emsp;<a href="https://www.helmholtz.de/en/"><img src="https://www.helmholtz.de/fileadmin/user_upload/05_aktuelles/Marke_Design/logos/HG_LOGO_S_ENG_RGB.jpg" alt="Helmholtz Logo" width="200px"></a>
</p>
