Metadata-Version: 2.4
Name: pseudocare
Version: 0.1.4
Summary: Pseudonymization of medical reports using a named entity recognition model and the Faker library
Author-email: Youssouf Anis DAHLOUK <ydahlouk@chu-reims.fr>, Rudy MERIEUX <rmerieux@chu-reims.fr>, Seydou KANE <skane@chu-reims.fr>
License: Software PseudoCare - Copyright By Institut de l'Inteligence Artificielle en Santé - License BSD-3-Clause - YEAR 2024-2025
        
        Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
        
        1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
        
        2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
        
        3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
        
        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Project-URL: Homepage, https://github.com/IIAS-Research/PseudoCare
Project-URL: Issues, https://github.com/IIAS-Research/PseudoCare/issues
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: edsnlp[ml]>=0.15.0
Requires-Dist: faker>=35.2.0
Requires-Dist: pylint>=3.3.4
Requires-Dist: tqdm>=4.67.1
Requires-Dist: unidecode>=1.3.8
Dynamic: license-file
Dynamic: requires-python

# PseudoCare: A Python Libray for consistent and realistic Pseudonymization of clinical text entities
## Description
This Python package enables the automatic pseudonymization of medical reports and records using a deep learning-based Named Entity Recognition (NER) NLP model. It leverages the NER model's results to identify and replace sensitive information with synthetic data generated using the Faker library. \
The targeted entities are:
| Label          | Description                                                                                   |
| ------         | ------------                                                                                  |
| ADRESSE        | Postal address, e.g., 33 Boulevard de la Paix                                   |
| DATE           | Any absolute date other than a birth date                                                      |
| DATE_NAISSANCE | Patient's birth date                                                                      |
| HOPITAL        | Hospital name, e.g., Hôpital Robert Debré                                                        |
| IPP            | Permanent patient identifier, a number assigned during the patient's first hospital visit
| PRENOM         | Any first name (patients, doctors, etc.)                                                       |
| NOM            | Any last name (patients, doctors, etc.)                                                        |
| SECU           | Social security number social                                                                         |
| TEL            | Any phone number                                                                              |
| VILLE          | Any city                                                                                      |
| ZIP            | Any postal code                                                           |

## Functionalities
* Detection of sensitive entities using a Named Entity Recognition (NER) model.
* Automatic pseudonymization of medical reports by replacing detected entities with fictitious data generated by Faker.
* Customization of generated data with two custom Faker providers:
  * A dedicated provider handles date formats frequently found in medical reports, such as janv.12, 13 05.2015, mars2020, or mi-mai. The pseudonymization of dates is performed through offsetting, ensuring that for the same IPP, the dates across different documents are pseudonymized consistently. The user can define the maximum offset value via the **Pseudonymization** class constructor. By default, birth dates are shifted by a random number of days between 1 and 30, while other dates are shifted by a random value between 1 and 100. These parameters can be customized according to specific needs.
  * A provider dedicated to handling email addresses ensures pseudonymization while preserving the format used by the CHU de Reims (e.g., example@chu-reims.fr).
* Extensibility: the user can add custom Faker providers as well as their own NER models for entity detection.
* Default model used: the package utilizes the **eds-pseudo** model from AP-HP, specifically trained on medical documents, including reports.
* Generation of a results file: After executing this package, a **results.html** file is generated, allowing the user to view both the original predicted document and the pseudonymized document. \
Here is an example of execution on a fictitious medical report:
<img src= "./tests/test_pseudo.PNG" alt="pseudo-test" style="border-radius:5px;">

## Structure of files
* scripts/ : Contains pyhton script files
* scripts/providers/ : Contains all customised providers
* tests/ : Contains test notebooks
* Results/ : Contains the results (html and txt files)

## Launch 
### Using Gitlab repo

First, clone the project locally. Our package relies on the edsnlp model for entity detection, which is hosted on Hugging Face. Therefore, you need to create a Hugging Face access token [https://huggingface.co/settings/tokens?new_token=true], and register it on your machine. This step only needs to be done once by running the following script:

```
import huggingface_hub

huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)

```
Once completed, you'll be able to use the model.

Next, install uv, an ultra-fast tool for managing virtual environments and Python dependencies. It's compatible with pip, venv, setuptools, and poetry, but significantly faster.

```
uv sync
```
This command creates a virtual environment and installs all the dependencies required by the project.

Finally, to launch the main pipeline :\
Run this command from the project root directory if you have several documents for the same patient:\
 `uv run python -m tests.pseudo_test --input "your/folder/path" --seed "seed" --is_folder`

* --input indicates the path of the folder containing the CRs.
* --seed to indicate a seed for a patient (This is used to create a seed for this patient).
* --is_folder to indicate that --input is a folder and not plain text

Alternatively, if you have your CR in text or .txt format, run this command:\
 `uv run python -m tests.pseudo_test --input "your CR" --seed "seed"`
 
  OR

 `uv run python -m tests.pseudo_test --input "your/txt file/path" --seed "seed"`

However, before that, if you wish to use your own providers, make sure to add them in the **providers** folder and then include them in the **main.py** file during the initialization of the pseudonymization class. Each added provider must include a function named **pseudonymize_{entity_type}**. If no provider is added, the default providers will be used.

The user must specify the data and seed, and they have the option to test on one or more documents (a list of .txt files).

A **quick_start.py** file has been added, demonstrating how to easily use the package in just two lines of code if you prefer not to go through the command line.

### Using pip
You can install Pseudocare locally using `pip`. We recommend using `uv`, an ultra-fast tool for managing virtual environments and Python dependencies.
Follow the steps below to set up the package:
1. Create a local folder for your project:
```
mkdir your_folder
cd your_folder
```
2. Create a virtual environment with Python 3.10 or higher:
```
uv venv --python 3.10
```
3. Activate the virtual environment:
```
source .venv/bin/activate
```
4. (Optional) if `pip` is not available in your environment, install it:
```
uv run python -m ensurepip --upgrade
```
5. Install Pseudocare:
```
uv run python -m pip install -i https://test.pypi.org/simple/ pseudocare --extra-index-url https://pypi.org/simple/
```
Once the installation is complete, you can start using Pseudocare to pseudonymize your files.
Here's a quick exemple to get started:

```
from pseudocare.providers.custom_mail_provider import CustomMailProvider
from pseudocare.model.pseudo_faker import Pseudonymization

if __name__ == "__main__":
    # Import user providers
    custom_providers = {
         'MAIL': CustomMailProvider,
    }

    DOC = "Docteur BERNARD François, Tel: 04.10.14.10.14 Tel: 04.10.14.10.14, \
          Mail: fbernard@test.fr,\
          ipp: 12845673, \
          iep: 147085237, \
          Fait le mercredi 06/01/2025Aujourd'hui, le 6 janvier 2025, j'ai eu l'opportunité de recevoir en \
          consultation Monsieur Jean Dupont né le 15/12/1922, un patient résidant à Paris. Monsieur Dupont \
          est venu pour une consultation médicale afin de discuter de son état de santé général. Après un \
          entretien approfondi, nous avons examiné ses antécédents médicaux ainsi que ses préoccupations actuelles.\
          Le patient a été opéré en 07/2018 pour des problèmes cardiaques, puis à nouveau en sept-2019. En juin 1996,\
          il a subi une intervention chirurgicale pour une pathologie pulmonaire liée au tabagisme, et en sept.22, pour\
          une pathologie intestinale. La consultation du 06/01 a permis d'évaluer plusieurs aspects de son bien-être,\
          notamment en ce qui concerne ses habitudes de vie et ses symptômes. Nous avons convenu de plusieurs\
          recommandations pour améliorer sa santé et avons programmé une consultation pour mi-mai."
    # Instanciate the package
    pseudo_faker = Pseudonymization(custom_providers=custom_providers)
    # Run the pseudonymization process
    pseudo_document = pseudo_faker.run(DOC, 214)
    print(f"{pseudo_document = }")

```

## Credits
Youcef Anis DAHLOUK
