Metadata-Version: 2.4
Name: lecrapaud
Version: 0.22.0
Summary: Framework for machine and deep learning, with regression, classification and time series analysis
License: Apache License
License-File: LICENSE
Author: Pierre H. Gallet
Requires-Python: ==3.12.*
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: catboost (>=1.2.8)
Requires-Dist: category-encoders (>=2.8.1)
Requires-Dist: celery (>=5.5.3)
Requires-Dist: celery-redbeat (>=2.3.2)
Requires-Dist: ftfy (>=6.3.1)
Requires-Dist: hyperopt (>=0.2.7)
Requires-Dist: joblib (>=1.5.1)
Requires-Dist: keras (>=3.10.0)
Requires-Dist: keras-tcn (>=3.5.6)
Requires-Dist: lightgbm (>=4.6.0)
Requires-Dist: matplotlib (>=3.10.3)
Requires-Dist: mlxtend (>=0.23.4)
Requires-Dist: numpy (>=2.1.3)
Requires-Dist: openai (>=1.88.0)
Requires-Dist: pandas (>=2.3.0)
Requires-Dist: pydantic (>=2.9.2)
Requires-Dist: python-dotenv (>=1.1.0)
Requires-Dist: scikit-learn (>=1.6.1)
Requires-Dist: scipy (<1.14.0)
Requires-Dist: seaborn (>=0.13.2)
Requires-Dist: sqlalchemy (>=2.0.41)
Requires-Dist: tensorboardx (>=2.6.4)
Requires-Dist: tensorflow (>=2.19.0)
Requires-Dist: tiktoken (>=0.9.0)
Requires-Dist: tqdm (>=4.67.1)
Requires-Dist: xgboost (>=3.0.2)
Description-Content-Type: text/markdown

<div align="center">

<img src="https://s3.amazonaws.com/pix.iemoji.com/images/emoji/apple/ios-12/256/frog-face.png" width=120 alt="crapaud"/>

## Welcome to LeCrapaud

**An all-in-one machine learning framework**

[![GitHub stars](https://img.shields.io/github/stars/pierregallet/lecrapaud.svg?style=flat&logo=github&colorB=blue&label=stars)](https://github.com/pierregallet/lecrapaud/stargazers)
[![PyPI version](https://badge.fury.io/py/lecrapaud.svg)](https://badge.fury.io/py/lecrapaud)
[![Python versions](https://img.shields.io/pypi/pyversions/lecrapaud.svg)](https://pypi.org/project/lecrapaud)
[![License](https://img.shields.io/github/license/pierregallet/lecrapaud.svg)](https://github.com/pierregallet/lecrapaud/blob/main/LICENSE)
[![codecov](https://codecov.io/gh/pierregallet/lecrapaud/branch/main/graph/badge.svg)](https://codecov.io/gh/pierregallet/lecrapaud)

</div>

## 🚀 Introduction

LeCrapaud is a high-level Python library for end-to-end machine learning workflows on tabular data, with a focus on financial and stock datasets. It provides a simple API to handle feature engineering, model selection, training, and prediction, all in a reproducible and modular way.

## ✨ Key Features

- 🧩 Modular pipeline: Feature engineering, preprocessing, selection, and modeling as independent steps
- 🤖 Automated model selection and hyperparameter optimization
- 📊 Easy integration with pandas DataFrames
- 🔬 Supports both regression and classification tasks
- 🛠️ Simple API for both full pipeline and step-by-step usage
- 📦 Ready for production and research workflows

## ⚡ Quick Start


### Install the package

```sh
pip install lecrapaud
```

### How it works

This package provides a high-level API to manage experiments for feature engineering, model selection, and prediction on tabular data (e.g. stock data).

### Typical workflow

```python
from lecrapaud import LeCrapaud

# Create a new experiment with data
experiment = LeCrapaud(
    data=your_dataframe,
    target_numbers=[1, 2],
    target_clf=[2],  # TARGET_2 is classification
    columns_drop=[...],
    columns_date=[...],
    # ... other config options
)

# Train the model
experiment.fit(your_dataframe)

# Make predictions
predictions, reg_scores, clf_scores = experiment.predict(new_data)

# Load existing experiment by ID
experiment = LeCrapaud(id=123)

# Or get best experiment by name
best_exp = LeCrapaud.get_best_experiment_by_name('my_experiment')
```

### Database Configuration (Required)

LeCrapaud requires access to a MySQL database to store experiments and results. You can configure the database by:

- Passing a valid MySQL URI to the constructor:
  ```python
  experiment = LeCrapaud(uri="mysql+pymysql://user:password@host:port/dbname", data=df, ...)
  ```
- **OR** setting environment variables:
  - `DB_USER`, `DB_PASSWORD`, `DB_HOST`, `DB_PORT`, `DB_NAME`
  - Or set `DB_URI` directly with your full connection string.

If neither is provided, database operations will not work.

### Using OpenAI Embeddings (Optional)

If you want to use the `columns_pca` embedding feature (for advanced feature engineering), you must set the `OPENAI_API_KEY` environment variable with your OpenAI API key:

```sh
export OPENAI_API_KEY=sk-...
```

If this variable is not set, features relying on OpenAI embeddings will not be available.

### Experiment Context Arguments

The experiment context is a dictionary containing all configuration parameters for your ML pipeline. Parameters are stored in the experiment's database record and automatically retrieved when loading an existing experiment.

#### Required Parameters

| Parameter         | Type      | Description                                          | Example                |
|-------------------|-----------|------------------------------------------------------|------------------------|
| `data`           | DataFrame | Input dataset (required for new experiments only)    | `pd.DataFrame(...)`    |
| `experiment_name`| str       | Unique name for the experiment                      | `'stock_prediction'`   |
| `date_column`    | str       | Name of the date column (required for time series)  | `'DATE'`              |
| `group_column`   | str       | Name of the group column (required for panel data)  | `'STOCK'`             |

#### Feature Engineering Parameters

| Parameter             | Type  | Default | Description                                                              |
|-----------------------|-------|---------|--------------------------------------------------------------------------|
| `columns_drop`        | list  | `[]`    | Columns to drop during feature engineering                              |
| `columns_boolean`     | list  | `[]`    | Columns to convert to boolean features                                  |
| `columns_date`        | list  | `[]`    | Date columns for cyclic encoding                                        |
| `columns_te_groupby`  | list  | `[]`    | Groupby columns for target encoding                                     |
| `columns_te_target`   | list  | `[]`    | Target columns for target encoding                                      |

#### Preprocessing Parameters

| Parameter               | Type  | Default | Description                                                           |
|-------------------------|-------|---------|-----------------------------------------------------------------------|
| `time_series`           | bool  | `False` | Whether data is time series                                          |
| `val_size`              | float | `0.2`   | Validation set size (fraction)                                       |
| `test_size`             | float | `0.2`   | Test set size (fraction)                                             |
| `columns_pca`           | list  | `[]`    | Columns for PCA transformation                                       |
| `pca_temporal`          | list  | `[]`    | Temporal PCA config (e.g., lag features)                            |
| `pca_cross_sectional`   | list  | `[]`    | Cross-sectional PCA config (e.g., market regime)                    |
| `columns_onehot`        | list  | `[]`    | Columns for one-hot encoding                                         |
| `columns_binary`        | list  | `[]`    | Columns for binary encoding                                          |
| `columns_ordinal`       | list  | `[]`    | Columns for ordinal encoding                                         |
| `columns_frequency`     | list  | `[]`    | Columns for frequency encoding                                       |

#### Feature Selection Parameters

| Parameter                   | Type  | Default | Description                                                      |
|-----------------------------|-------|---------|------------------------------------------------------------------|
| `percentile`                | float | `20`    | Percentage of features to keep per selection method             |
| `corr_threshold`            | float | `80`    | Maximum correlation threshold (%) between features              |
| `max_features`              | int   | `50`    | Maximum number of final features                                |
| `max_p_value_categorical`   | float | `0.05`  | Maximum p-value for categorical feature selection (Chi2)        |

#### Model Selection Parameters

| Parameter              | Type  | Default | Description                                                           |
|------------------------|-------|---------|-----------------------------------------------------------------------|
| `target_numbers`       | list  | `[]`    | List of target indices to predict                                     |
| `target_clf`           | list  | `[]`    | Classification target indices                                         |
| `models_idx`           | list  | `[]`    | Model indices or names to use (e.g., `[1, 'xgb', 'lgb']`)           |
| `max_timesteps`        | int   | `120`   | Maximum timesteps for recurrent models                               |
| `perform_hyperopt`     | bool  | `True`  | Whether to perform hyperparameter optimization                       |
| `number_of_trials`     | int   | `20`    | Number of hyperopt trials                                            |
| `perform_crossval`     | bool  | `False` | Whether to use cross-validation during hyperopt                      |
| `plot`                 | bool  | `True`  | Whether to generate plots                                            |
| `preserve_model`       | bool  | `True`  | Whether to save the best model                                       |
| `target_clf_thresholds`| dict  | `{}`    | Classification thresholds per target                                 |

#### Example Context Configuration

```python
context = {
    # Required parameters
    "experiment_name": f"stock_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    "date_column": "DATE",
    "group_column": "STOCK",
    
    # Feature selection
    "corr_threshold": 80,
    "max_features": 20,
    "percentile": 20,
    "max_p_value_categorical": 0.05,
    
    # Feature engineering
    "columns_drop": ["SECURITY", "ISIN", "ID"],
    "columns_boolean": [],
    "columns_date": ["DATE"],
    "columns_te_groupby": [["SECTOR", "DATE"]],
    "columns_te_target": ["RET", "VOLUME"],
    
    # Preprocessing
    "time_series": True,
    "val_size": 0.2,
    "test_size": 0.2,
    "pca_temporal": [
        # Old format (still supported)
        # {"name": "LAST_20_RET", "columns": [f"RET_-{i}" for i in range(1, 21)]},
        # New simplified format - automatically creates lag columns
        {"name": "LAST_20_RET", "column": "RET", "lags": 20},
        {"name": "LAST_10_VOL", "column": "VOLUME", "lags": 10},
    ],
    "pca_cross_sectional": [
        {
            "name": "MARKET_REGIME",
            "index": "DATE",
            "columns": "STOCK",
            "value": "RET",
        }
    ],
    "columns_onehot": ["BUY_SIGNAL"],
    "columns_binary": ["SECTOR", "LOCATION"],
    "columns_ordinal": ["STOCK"],
    
    # Model selection
    "target_numbers": [1, 2, 3],
    "target_clf": [1],
    "models_idx": ["xgb", "lgb", "catboost"],
    "max_timesteps": 120,
    "perform_hyperopt": True,
    "number_of_trials": 50,
    "perform_crossval": True,
    "plot": True,
    "preserve_model": True,
    "target_clf_thresholds": {1: {"precision": 0.80}},
}

# Create experiment with the new unified API
experiment = LeCrapaud(data=your_dataframe, **context)
```

#### Important Notes

1. **Context Persistence**: All context parameters are saved in the database when creating an experiment and automatically restored when loading it.

2. **Parameter Precedence**: When loading an existing experiment, the stored context takes precedence over any parameters passed to the constructor.

3. **PCA Time Series**: 
   - For time series data, both `pca_cross_sectional` and `pca_temporal` automatically use an expanding window approach with periodic refresh (default: every 90 days) to prevent data leakage.
   - The system fits PCA only on historical data (lookback window of 365 days by default) and avoids look-ahead bias.
   - For panel data (e.g., multiple stocks), lag features are created per group when using the simplified `pca_temporal` format.
   - Missing PCA values are handled with forward-fill followed by zero-fill to ensure compatibility with downstream models.

4. **PCA Temporal Simplified Format**: 
   - Instead of manually listing lag columns: `{"name": "LAST_20_RET", "columns": ["RET_-1", "RET_-2", ..., "RET_-20"]}`
   - Use the simplified format: `{"name": "LAST_20_RET", "column": "RET", "lags": 20}`
   - The system automatically creates the lag columns, handling panel data correctly with `group_column`.

5. **OpenAI Embeddings**: If using `columns_pca` with text columns, ensure `OPENAI_API_KEY` is set as an environment variable.

6. **Model Indices**: The `models_idx` parameter accepts both integer indices and string names (e.g., `'xgb'`, `'lgb'`, `'catboost'`).



### Modular usage with sklearn-compatible components

You can also use individual pipeline components:

```python
from lecrapaud import FeatureEngineering, FeaturePreprocessor, FeatureSelector

# Create components with experiment context
feature_eng = FeatureEngineering(experiment=experiment)
feature_prep = FeaturePreprocessor(experiment=experiment)
feature_sel = FeatureSelector(experiment=experiment, target_number=1)

# Use sklearn fit/transform pattern
feature_eng.fit(data)
data_eng = feature_eng.get_data()

feature_prep.fit(data_eng)
data_preprocessed = feature_prep.transform(data_eng)

feature_sel.fit(data_preprocessed)

# Or use in sklearn Pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('feature_eng', FeatureEngineering(experiment=experiment)),
    ('feature_prep', FeaturePreprocessor(experiment=experiment))
])
```

## ⚠️ Using Alembic in Your Project (Important for Integrators)

If you use Alembic for migrations in your own project and you share the same database with LeCrapaud, you must ensure that Alembic does **not** attempt to drop or modify LeCrapaud tables (those prefixed with `{LECRAPAUD_TABLE_PREFIX}_`).

By default, Alembic's autogenerate feature will propose to drop any table that exists in the database but is not present in your project's models. To prevent this, add the following filter to your `env.py`:

```python
def include_object(object, name, type_, reflected, compare_to):
    if type_ == "table" and name.startswith(f"{LECRAPAUD_TABLE_PREFIX}_"):
        return False  # Ignore LeCrapaud tables
    return True

context.configure(
    # ... other options ...
    include_object=include_object,
)
```

This will ensure that Alembic ignores all tables created by LeCrapaud when generating migrations for your own project.

---

## 🤝 Contributing

### Reminders for Github usage

1. Creating Github repository

```sh
$ brew install gh
$ gh auth login
$ gh repo create
```

2. Initializing git and first commit to distant repository

```sh
$ git init
$ git add .
$ git commit -m 'first commit'
$ git remote add origin <YOUR_REPO_URL>
$ git push -u origin master
```

3. Use conventional commits  
https://www.conventionalcommits.org/en/v1.0.0/#summary

4. Create environment

```sh
$ pip install virtualenv
$ python -m venv .venv
$ source .venv/bin/activate
```

5. Install dependencies

```sh
$ make install
```

6. Deactivate virtualenv (if needed)

```sh
$ deactivate
```

---

Pierre Gallet © 2025
