Metadata-Version: 2.4
Name: pucktrick
Version: 0.6.0.1
Summary: A python library for error generation in dataset for machine learning
Author-email: Andrea Maurino <andrea.maurino@unimib.it>
License: CC BY-NC 4.0
Project-URL: Homepage, https://github.com/andreamaurino/pucktrick
Classifier: Programming Language :: Python :: 3
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: numpy
Requires-Dist: pandas
Dynamic: license-file

# Pucktrick

Pucktrick is a Python library that provides utility functions for introducing errors in your dataframe.
The library's name is based on Puck. Puck is the name of the elf in the “A Midsummer Night’s Dream” of William Shakespeare, who is very famous for causing trouble and playing tricks on mortals and other fairies alike.
  

## Features
Pucktrick is organized in modules, one for each error type. Each module includes a main function (or a class injector) that receives as parameters the dataset to modify, the `strategy` dictionary, and the original dataset if `mode="extended"`. Functions return two parameters: an error code (0 for success, 1 for failure/no modifications) and the generated dataset.

## The Strategy Configuration
The core of Pucktrick is the `strategy` configuration, which is passed as a JSON object or a Python dictionary. It allows you to precisely define the error model.

### Base Parameters
```json
{
  "affected_features": ["column1", "column2"],
  "selection_criteria": "all",
  "percentage": 0.2,
  "mode": "new",
  "perturbate_data": {
    "sampling": "random"
  }
}
```
- **`affected_features`**: A list of strings specifying the columns to be corrupted.
- **`selection_criteria`**: A predicate (e.g., `"age > 30"`) to target specific rows, or `"all"` to target the entire dataset.
- **`percentage`**: A float (0.0 to 1.0) indicating the proportion of targeted rows to corrupt.
- **`mode`**: 
  - `"new"`: Applies errors to a clean dataset. 
  - `"extended"`: Incrementally adds errors to a previously corrupted dataset, reading the `original_df` to avoid double-corrupting rows.
- **`perturbate_data`**: A dictionary containing the noise injection logic.
  - **`sampling`**: How rows are chosen (`"random"`, `"uniform"`, `"normal"`, `"exponential"`).

### Modules & Specific Configurations

#### 1. Missing (`missing.py`)
Replaces values with `NaN`.
*Specifics*: No special parameters required in `perturbate_data`.

#### 2. Outliers (`outliers.py`)
Injects outliers using a 3-sigma rule for continuous numeric data, domain expansion for categorical integers, or specific string tokens for text.
*Specifics*: No special parameters required in `perturbate_data`.

#### 3. Duplicated (`duplicated.py`)
Duplicates existing rows and optionally applies text transformations.
*Specifics*: Set `"function"` in the main strategy to apply text transformations like `"shuffle_words"`, `"abbreviate_text"`, `"replace_punctuation"`, `"remove_replace"`, or `"upper_lower"`.

#### 4. Noisy (`noisy.py`)
Adds random noise or a systematic shift to data (numeric, string, or datetime).
*Specifics*: In `perturbate_data`, set `"distribution": "shift"` to apply systematic shifting. You must provide a `"param"` dictionary:
- `"shift_value"`: Numeric value to add (or days for dates).
- `"shift_unit"`: `"absolute"` or `"std"` (standard deviations).
- `"shift_sign"`: `"positive"`, `"negative"`, or `"random"`.
*(Use `"distribution": "random"` for standard uniform noise).*

#### 5. Labels (`labels.py`)
Flips labels for binary or multi-class classification.
*Specifics*: For multi-class labels in `perturbate_data`, set `"noise_model"` to:
- `"NCAR"` (Noise Completely At Random): Uniform random flip.
- `"NAR"` (Noise At Random): Class-dependent flip. Provide `"flip_distribution"` in `param`.
- `"NNAR"` (Nearest Neighbor At Random): Flips labels of instances close to decision boundaries. Provide `"features_for_similarity"` in `param`.

## Version
**version 0.6.0.1**
- Codebase fully refactored using Object-Oriented Programming (OOP) with the Template Method Pattern.
- Added systematic shift (`"distribution": "shift"`) error type to the `noisy` module.
- Standardized the `strategy` interface and improved the `extended` mode logic across all modules.

**version 0.5.1** 
- add multiclass definition

**version 0.5**
- add strategy, a JSON file where it is possible to create an error model by specifying the affected features (from one to many), a selection criterion, a Boolean predicate that specifies a subset of the rows to be corrupted, the mode, the percentage, the distribution function for injection errors.

**version 0.4**
 - errortype added: missing values

**version  0.3** 
 -error type added: duplicated

**version 0.2**
 - error type inserted: outliers  

**version 0.1**

- error type inserted: noisy error and inconsistency labels


## Installation

You can install pucktrick using pip:

pip install pucktrick


## References

## Contributing
We welcome contributions from the community. To contribute:

Fork the repository
Create a new branch (git checkout -b feature/your-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/your-feature)
Create a new Pull Request
Please ensure your code adheres to our coding standards and includes appropriate tests.

License
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)  - see the LICENSE file for details.

Acknowledgements
Thanks to the contributors and open-source community for their support.
