Metadata-Version: 2.4
Name: pucktrick
Version: 1.0.0
Summary: A python library for error generation in dataset for machine learning
Author-email: Andrea Maurino <andrea.maurino@unimib.it>
License: CC BY-NC 4.0
Project-URL: Homepage, https://github.com/andreamaurino/pucktrick
Classifier: Programming Language :: Python :: 3
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: numpy
Requires-Dist: pandas
Dynamic: license-file

# Pucktrick

Pucktrick is a Python library that provides utility functions for introducing errors in your dataframe.
The library's name is based on Puck. Puck is the name of the elf in the "A Midsummer Night's Dream" of William Shakespeare, who is very famous for causing trouble and playing tricks on mortals and other fairies alike.

## Features

Pucktrick is organized in modules, one for each error type. Each module includes a main function (or a class injector) that receives as parameters the dataset to modify, the `strategy` dictionary, and the original dataset if `mode="extended"` or `mode="composed"`. Functions return two parameters: an error code (0 for success, 1 for failure/no modifications) and the generated dataset.

## The Strategy Configuration

The core of Pucktrick is the `strategy` configuration, which is passed as a JSON object or a Python dictionary. It allows you to precisely define the error model.

### Base Parameters

```json
{
  "affected_features": ["column1", "column2"],
  "selection_criteria": "all",
  "percentage": 0.2,
  "mode": "new",
  "perturbate_data": {
    "sampling": "random"
  }
}
```

- **`affected_features`**: A list of strings specifying the columns to be corrupted.
- **`selection_criteria`**: A predicate (e.g., `"age > 30"`) to target specific rows, or `"all"` to target the entire dataset.
- **`percentage`**: A float (0.0 to 1.0) indicating the proportion of targeted rows to corrupt.
- **`mode`**:
  - `"new"`: Applies errors independently to the clean baseline dataset `D0`. Each call is stateless.
  - `"extended"`: Incrementally adds errors to a previously corrupted dataset. Reads `original_df` (the clean `D0`) to identify rows already modified and adds corruption only to unmodified rows, up to the cumulative `percentage` target. No row is corrupted twice.
  - `"composed"`: Applies errors exclusively to rows that have **already been modified** by a previous operator, using `original_df` (the clean `D0`) to identify them via a row-level, NaN-aware comparison across all columns. The `percentage` parameter controls what fraction of the already-modified set to corrupt. This enables cross-type corruption pipelines where heterogeneous errors are layered on the same row subset.
- **`perturbate_data`**: A dictionary containing the noise injection logic.
  - **`sampling`**: How rows are chosen (`"random"`, `"uniform"`, `"normal"`, `"exponential"`).

### Accumulation Modes: Summary

| Mode | Eligible rows | `percentage` applies to | Requires `original_df` |
|---|---|---|---|
| `new` | All rows | Full eligible set | No |
| `extended` | Rows not yet modified | Full eligible set (cumulative) | Yes |
| `composed` | Rows already modified in any column | Already-modified set | Yes |

### Example: Composed Pipeline

```python
from pucktrick.missing import missing
from pucktrick.outliers import outlier

# Step 1 — inject missing values on c1 (20% of rows), mode="new"
strategy_s1 = {
    "affected_features": ["c1"],
    "selection_criteria": "c1 == c1",
    "percentage": 0.20,
    "mode": "new",
    "perturbate_data": {"sampling": "random"}
}
err1, D1 = missing(df, strategy_s1)

# Step 2 — inject outliers on c2, mode="composed"
# Acts exclusively on the rows already modified by Step 1
strategy_s2 = {
    "affected_features": ["c2"],
    "selection_criteria": "c2 == c2",
    "percentage": 1.0,
    "mode": "composed",
    "perturbate_data": {"sampling": "random"}
}
err2, D2 = outlier(D1, strategy_s2, original_df=df)
# D2: rows with NaN in c1 coincide exactly with rows with outliers in c2
```

---

## Modules & Specific Configurations

### Error Injection Modules

#### 1. Missing (`missing.py`)

Replaces values with `NaN`.
*Specifics*: No special parameters required in `perturbate_data`.

#### 2. Outliers (`outliers.py`)

Injects outliers using a 3-sigma rule for continuous numeric data, domain expansion for categorical integers, or specific string tokens for text.
*Specifics*: No special parameters required in `perturbate_data`.

#### 3. Duplicated (`duplicated.py`)

Duplicates existing rows and optionally applies text transformations.
*Specifics*: Set `"function"` in the main strategy to apply text transformations like `"shuffle_words"`, `"abbreviate_text"`, `"replace_punctuation"`, `"remove_replace"`, or `"upper_lower"`.

#### 4. Noisy (`noisy.py`)

Adds random noise or a systematic shift to data (numeric, string, or datetime).
*Specifics*: In `perturbate_data`, set `"distribution": "shift"` to apply systematic shifting. You must provide a `"param"` dictionary:
- `"shift_value"`: Numeric value to add (or days for dates).
- `"shift_unit"`: `"absolute"` or `"std"` (standard deviations).
- `"shift_sign"`: `"positive"`, `"negative"`, or `"random"`.

*(Use `"distribution": "random"` for standard uniform noise).*

#### 5. Labels (`labels.py`)

Flips labels for binary or multi-class classification.
*Specifics*: For multi-class labels in `perturbate_data`, set `"noise_model"` to:
- `"NCAR"` (Noise Completely At Random): Uniform random flip.
- `"NAR"` (Noise At Random): Class-dependent flip. Provide `"flip_distribution"` in `param`.
- `"NNAR"` (Nearest Neighbor At Random): Flips labels of instances close to decision boundaries. Provide `"features_for_similarity"` in `param`.

---

### Drift Simulation Modules

Pucktrick supports the simulation of three structurally distinct forms of dataset drift, modelled as temporal corruption policies applied to dataset segments. Each drift module exposes a `run_drift_pipeline` function that accepts a `strategy` dictionary defining the drift configuration per segment.

```python
strategy = {
    "strategy": {
        "percentage": 0.35,
        "chunks": {
            "0": None,           # baseline segment, no drift
            "1": None,
            "2": { ... },        # drift configuration for segment 2
            "3": { ... },
        }
    }
}
drift_df, change_points, ranked_features = run_drift_pipeline(
    df=df,
    target_col="target",
    strategy=strategy,
    n_chunks=4,
    random_state=42
)
```

#### Drift Types Summary

| Drift type | Distribution affected | Module | `drift_type` value |
|---|---|---|---|
| Data drift (covariate noise) | $P(X)$ changes | `covariate_noise_drift` | `"covariate_noise"` |
| Data drift (offset) | $P(X)$ changes | `covariate_offset_drift` | `"covariate_offset"` |
| Concept drift (target offset) | $P(Y\|X)$ changes | `offset_drift` | `"concept"` |
| Concept drift (feature rotation) | $P(Y\|X)$ changes | `concept_drift` | `"concept_rotation"` |
| Label drift (prior shift) | $P(Y)$ changes | `prior_multinomial_drift` | `"prior_multinomial"` |
| Target scaling | $P(Y)$ changes | `target_scaling_drift` | `"target_scaling"` |
| Generic (all types) | configurable | `drift_generic` | any |

#### 6. Covariate Noise Drift (`covariate_noise_drift.py`)

Adds progressive Gaussian noise to selected features, simulating data drift where $P(X)$ shifts over time.

```python
"2": {
    "drift_type": "covariate_noise",
    "features": ["temp", "humidity"],
    "noise_mode": "relative",
    "noise_std": 0.08,
    "shape": "segment"
}
```

- `"noise_mode"`: `"relative"` (noise proportional to feature std) or `"absolute"`
- `"noise_std"`: magnitude of Gaussian noise
- `"shape"`: `"segment"` (this chunk only) or `"step"` (persists in subsequent chunks)

#### 7. Covariate Offset Drift (`covariate_offset_drift.py`)

Applies a systematic directional offset to selected features, simulating sensor calibration drift.

```python
"2": {
    "drift_type": "covariate_offset",
    "features": ["temp", "humidity"],
    "offset_mode": "relative",
    "offset_scale": 0.20,
    "direction": "up",
    "shape": "step"
}
```

- `"offset_mode"`: `"relative"` or `"absolute"`
- `"offset_scale"`: magnitude of the offset
- `"direction"`: `"up"`, `"down"`, or `"random"`
- `"shape"`: `"segment"` or `"step"`

#### 8. Concept Drift — Target Offset (`offset_drift.py`)

Shifts the target variable using a percentage offset, simulating concept drift where $P(Y \mid X)$ changes.

```python
"2": {
    "drift_type": "concept",
    "features": ["<TARGET>"],
    "offset_perc": 0.50,
    "offset_mode": "add",
    "base": "mean",
    "shape": "step",
    "direction": "up"
}
```

- `"offset_perc"`: fractional offset applied to the base value
- `"offset_mode"`: `"add"` or `"mul"` (multiplicative)
- `"base"`: `"mean"`, `"median"`, `"std"`, or `"quantile"`
- `"shape"`: `"step"`, `"ramp"`, `"spike"`, or `"sin"`

#### 9. Concept Drift — Feature Rotation (`concept_drift.py`)

Permutes or cycles feature values across instances, breaking the feature-label relationship without altering marginal distributions.

```python
"2": {
    "drift_type": "concept_rotation",
    "severity": 0.65,
    "rotation_mode": "cycle",
    "shape": "step"
}
```

- `"severity"`: fraction of features involved (0.0–1.0)
- `"rotation_mode"`: `"cycle"` or `"permute"`

#### 10. Label Drift — Prior Multinomial (`prior_multinomial_drift.py`)

Resamples the class distribution according to a user-specified probability vector, simulating prior probability shift $P(Y)$.

```python
"2": {
    "drift_type": "prior_multinomial",
    "features": ["<TARGET>"],
    "bins": 3,
    "class_probs_list": [0.05, 0.15, 0.80],
    "temperature": 0.6
}
```

- `"bins"`: number of bins for numeric columns
- `"class_probs_list"`: probability vector for each bin/class
- `"temperature"`: sharpens (`< 1.0`) or flattens (`> 1.0`) the distribution

#### 11. Target Scaling (`target_scaling_drift.py`)

Applies a multiplicative scaling factor to the numeric target variable.

```python
"2": {
    "drift_type": "target_scaling",
    "scale_perc": 0.10,
    "shape": "segment"
}
```

- `"scale_perc"`: fractional increase (e.g., `0.10` multiplies target by 1.10)
- `"scale_factor"`: direct multiplicative factor (alternative to `scale_perc`)

#### 12. Generic Drift (`drift_generic.py`)

A unified module supporting all drift types above plus additional specialized types (`conditional`, `offset_time`, `seasonal_shift`, `prior_bool`, `concept_ord_shift`, and others).

---

## Version

**version 1.0.0**
- Added drift simulation modules: `covariate_noise_drift`, `covariate_offset_drift`, `offset_drift`, `concept_drift`, `prior_multinomial_drift`, `target_scaling_drift`, `drift_generic`. All modules support both `strategy_path` (JSON file) and `strategy` (Python dict) as input.
- Added unified `drift.py` wrapper exposing a `drift()` function compatible with the PuckTrick strategy interface.
- All drift modules integrated and tested with synthetic datasets.

**version 0.6.1.1**
- Added `composed` mode to all modules.
- Added `_is_row_modified` method to `BaseErrorInjector` for row-level modification tracking.
- Fixed `_get_modifiable_mask` in `MissingErrorInjector` and `LabelErrorInjector`.
- Fixed type normalization in NAR label flip for integer target columns.

**version 0.6.0.1**
- Codebase fully refactored using Object-Oriented Programming with the Template Method Pattern.
- Added systematic shift (`"distribution": "shift"`) to the `noisy` module.
- Standardized the `strategy` interface and improved `extended` mode logic across all modules.

**version 0.5.1**
- add multiclass definition

**version 0.5**
- add strategy JSON configuration.

**version 0.4**
- errortype added: missing values

**version 0.3**
- error type added: duplicated

**version 0.2**
- error type inserted: outliers

**version 0.1**
- error type inserted: noisy error and inconsistency labels

---

## Installation

```bash
pip install pucktrick
```

---

## Contributing

1. Fork the repository
2. Create a new branch (`git checkout -b feature/your-feature`)
3. Commit your changes (`git commit -am 'Add new feature'`)
4. Push to the branch (`git push origin feature/your-feature`)
5. Create a new Pull Request

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) — see the LICENSE file for details.

## Acknowledgements

Thanks to the contributors and open-source community for their support.
