Metadata-Version: 2.4
Name: pelican_nlp
Version: 0.3.24
Summary: Preprocessing and Extraction of Linguistic Information for Computational Analysis
Author-email: Yves Pauli <yves.pauli@gmail.com>
License-Expression: CC-BY-NC-4.0
Project-URL: Homepage, https://github.com/ypauli/pelican_nlp
Project-URL: Repository, https://github.com/ypauli/pelican_nlp
Project-URL: Documentation, https://github.com/ypauli/pelican_nlp#readme
Project-URL: Bug Tracker, https://github.com/ypauli/pelican_nlp/issues
Keywords: nlp,linguistics,preprocessing,language-processing,text-analysis
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Natural Language :: English
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/x-rst
License-File: LICENSE
Requires-Dist: numpy==2.0.1
Requires-Dist: pandas==2.2.3
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: torch==2.5.1
Requires-Dist: spacy==3.8.2
Requires-Dist: transformers==4.49.0
Requires-Dist: docx2txt>=0.9
Requires-Dist: striprtf>=0.0.28
Requires-Dist: chardet>=4.0.0
Requires-Dist: scikit_learn>=1.6.1
Requires-Dist: scipy==1.15.2
Requires-Dist: fasttext-wheel==0.9.2
Requires-Dist: matplotlib>=3.10.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: accelerate==1.4.0
Requires-Dist: editdistance>=0.8.1
Requires-Dist: psutil>=6.1.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: pytest>=8.3.4
Requires-Dist: statsmodels>=0.14.4
Requires-Dist: datasets==3.3.2
Requires-Dist: huggingface_hub==0.29.2
Requires-Dist: audiofile>=1.5.1
Requires-Dist: soundfile>=0.13.1
Requires-Dist: opensmile>=2.6.0
Requires-Dist: praat-parselmouth>=0.4.6
Requires-Dist: librosa>=0.10.0
Requires-Dist: pydub>=0.25.1
Requires-Dist: torchaudio>=2.5.1
Requires-Dist: pyannote.audio>=3.1.0
Requires-Dist: uroman>=0.1.0
Requires-Dist: PyPDF2>=3.0.1
Requires-Dist: ortools>=9.10.0
Requires-Dist: sentence-transformers>=2.7.0
Dynamic: license-file

====================================
pelican_nlp
====================================

.. |logo| image:: https://raw.githubusercontent.com/ypauli/pelican_nlp/main/docs/images/pelican_logo.png
    :alt: pelican_nlp Logo
    :width: 200px

+------------+-------------------------------------------------------------------+
| |logo|     | pelican_nlp stands for "Preprocessing and Extraction of Linguistic|
|            | Information for Computational Analysis - Natural Language         |
|            | Processing". This package enables the creation of standardized and|
|            | reproducible language processing pipelines, extracting linguistic |
|            | features from various tasks like discourse, fluency, and image    |
|            | descriptions.                                                     |
+------------+-------------------------------------------------------------------+

.. image:: https://img.shields.io/pypi/v/pelican_nlp.svg
    :target: https://pypi.org/project/pelican_nlp/
    :alt: PyPI version

.. image:: https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg
    :target: https://github.com/ypauli/pelican_nlp/blob/main/LICENSE
    :alt: License CC BY-NC 4.0

.. image:: https://img.shields.io/pypi/pyversions/pelican_nlp.svg
    :target: https://pypi.org/project/pelican_nlp/
    :alt: Supported Python Versions

.. image:: https://img.shields.io/badge/Contributions-Welcome-brightgreen.svg
    :target: https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md
    :alt: Contributions Welcome

Installation
============

Create conda environment

.. code-block:: bash

    conda create --name pelican-nlp --channel defaults python=3.10

Activate environment

.. code-block:: bash

    conda activate pelican-nlp

Install the package using pip:

.. code-block:: bash

    pip install pelican-nlp

Usage
=====

To run ``pelican_nlp``, you need a ``configuration.yml`` file in your main project directory. This file defines the settings and parameters used for your project.

Sample configuration files are available here:
`https://github.com/ypauli/pelican_nlp/tree/main/examples <https://github.com/ypauli/pelican_nlp/tree/main/examples>`_

1. Adapt a sample configuration to your needs.
2. Save your personalized ``configuration.yml`` in the root of your project directory.

Running pelican_nlp
-------------------

You can run ``pelican_nlp`` via the command line or a Python script.

**From the command line**:

Navigate to your project directory (must contain your ``participants/`` folder and ``configuration.yml``), then run:

.. code-block:: bash

    conda activate pelican-nlp
    pelican-run

To optimize performance, close other programs and limit GPU usage during language processing.

Data Format Requirements: LPDS
------------------------------

For reliable operation, your data must follow the *Language Processing Data Structure (LPDS)*, inspired by brain imaging data structures like BIDS.

Main Concepts (Quick Guide)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Project Root**: Contains a ``participants/`` folder plus optional files like ``participants.tsv``, ``dataset_description.json``, and ``README``.
- **Participants**: Each participant has a folder named ``part-<ID>`` (e.g., ``part-01``).
- **Sessions (Optional)**: For longitudinal studies, use ``ses-<ID>`` subfolders inside each participant folder.
- **Tasks/Contexts**: Each session (or directly in the participant folder for non-longitudinal studies) includes subfolders for specific tasks (e.g., ``interview``, ``fluency``, ``image-description``).
- **Data Files**: Named with structured metadata, e.g.:
  ``part-01_ses-01_task-fluency_cat-semantic_acq-baseline_transcript.txt``

Filename Structure
~~~~~~~~~~~~~~~~~~

Filenames follow this format::

    part-<id>[_ses-<id>]_task-<label>[_<key>-<value>...][_suffix].<extension>

- **Required Entities**: ``part``, ``task``
- **Optional Entities Examples**: ``ses``, ``cat``, ``acq``, ``proc``, ``metric``, ``model``, ``run``, ``group``, ``param``
- **Suffix Examples**: ``transcript``, ``audio``, ``embeddings``, ``logits``, ``annotations``

Example Project Structure
~~~~~~~~~~~~~~~~~~~~~~~~~

::

    my_project/
    ├── participants/
    │   ├── part-01/
    │   │   └── ses-01/
    │   │       └── interview/
    │   │           └── part-01_ses-01_task-interview_transcript.txt
    │   └── part-02/
    │       └── fluency/
    │           └── part-02_task-fluency_audio.wav
    ├── configuration.yml
    ├── dataset_description.json
    ├── participants.tsv
    └── README.md


Features
========

- **Feature 1: Cleaning text files**
    - Handles whitespaces, timestamps, punctuation, special characters, and case-sensitivity.

- **Feature 2: Linguistic Feature Extraction**
    - Extracts semantic embeddings, logits, distance from optimality, perplexity and semantic similarity.

- **Feature 3: Acoustic Feature Extraction**
    - Extracts prosogram and openSMILE feature.

Examples
========

You can find example setups on the github repository in the `examples <https://github.com/ypauli/pelican_nlp/tree/main/examples>`_ folder:

Contributing
============

Contributions are welcome! Please check out the `contributing guide <https://github.com/ypauli/pelican_nlp/blob/main/CONTRIBUTING.md>`_.

License
=======

This project is licensed under Attribution-NonCommercial 4.0 International. See the `LICENSE <https://github.com/ypauli/pelican_nlp/blob/main/LICENSE>`_ file for details.

Citation
========

If you use this project, please cite:

Pauli Y, Marsman J-B, Rabe F, et al. Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis. arXiv preprint arXiv:2511.15512 [cs.CL] 2025.
