Metadata-Version: 2.1
Name: tipeft
Version: 0.0.3
Summary: Tabular-Infused Parameter Efficient Finetuning (tipeft)
Author: Charles Alba
Author-email: alba@wustl.edu
Keywords: Parameter Efficient Finetuning,PEFT,AI in Medicine,AI in Healthcare,Postoperative Risk Prediction,IA3,LORA
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: Microsoft :: Windows
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: license.txt
Requires-Dist: numpy>=2.0.2
Requires-Dist: pandas>=2.2.2
Requires-Dist: scikit-learn>=1.5
Requires-Dist: tqdm>=4.67
Requires-Dist: torch==2.8.0
Requires-Dist: transformers==4.57.0
Requires-Dist: peft==0.17.1
Requires-Dist: accelerate==1.10.1
Requires-Dist: evaluate==0.4.2
Requires-Dist: datasets==2.21.0



# tipeft



**T**abular-**i**nfused **P**arameter **E**fficient **F**ine**t**uning (tipeft) is a novel PEFT method designed to infuse tabular features into the initialization process of re-parameterization parameter efficient finetuning (PEFT) methods. This provides an element of well-informed and representational capacity towards the newly introduced PEFT parameters, which are usually introduced and initialized independently



![Overview of tipeft framework](https://raw.githubusercontent.com/cja5553/peft_postoperative_risk_prediction/main/Figure_1.jpg)



It is specifically designed for postoperative predictions in clinical care, where predictive and valuable pre-operative tabular features are often under-utilized in language model finetuning. For now, it supports both `LoRA` and `IA3`





## Requirements  

### Dependencies





The following Python packages are required for `tipeft`:



- `torch`

- `transformers`

- `peft`

- `accelerate`

- `numpy`

- `pandas`

- `scikit-learn`

- `tqdm`



Install dependencies with:



```bash

pip install torch transformers peft accelerate numpy pandas scikit-learn tqdm

```



#### Note on Pytorch installation

Because PyTorch wheels vary by CUDA version and hardware, it is recommended to install PyTorch manually following the instructions at:

https://pytorch.org/ 



### System Requirements



`tipeft` has been tested and verified on the following configuration:



<table>

  <tr>

    <th>Component</th>

    <th>Tested Version</th>

  </tr>

  <tr>

    <td>OS</td>

    <td>Windows 10</td>

  </tr>

  <tr>

    <td>Python</td>

    <td>3.9.19</td>

  </tr>

  <tr>

    <td>CUDA</td>

    <td>12.6</td>

  </tr>

</table>



#### Important Notes



- **Environment**: Must be run in a Jupyter notebook. Running as a standalone Python script may cause multiprocessing issues.

- **CPU cores**: At least 10 CPU cores recommended (uses `Pool(processes=10)` internally).

- **GPU**: CUDA-compatible GPU required.

- **OS**: Tested on Windows. Linux/Mac compatibility not yet verified.



#### Known Compatibility Limitations



1. **Jupyter only** - Uses `tqdm.notebook` which may not display correctly outside Jupyter.

2. **Multiprocessing** - May behave differently on Linux/Mac due to different multiprocessing backends.



If you encounter issues on a different setup, please open an issue with your system info.



#### GPU requirements



`tipeft` is designed for GPU acceleration.

- At least 1 GPU is recommended

- Suggested minimum: 16GB VRAM 

- Memory usage depends on:

    - sequence length

    - model size

    - batch size

    - peft configuration







## Installation

To install in python, simply do the following: 

```bash

pip install tipeft

```





## Usage



### `train_tabular_infused_IA3`



Trains a tabular-infused IA3 model for binary classification. 



```python

from tipeft import train_tabular_infused_IA3



model, tokenizer = train_tabular_infused_IA3(

    train=train_df,

    val=val_df,

    pretrained_model_name="emilyalsentzer/Bio_ClinicalBERT",

    label_col="in_hospital_mortality",

    text_col="clinical_notes",

    columns_unique_labels_of_tabular_features={

        "gender": 2,

        "insurance": 3,

        "marital_status": 4,

        "anchor_age": 1,

        "anchor_year": 1

    },

    lr=0.001,

    num_epochs=5,

    lr_of_tabular_infused_features=0.0001

)

```



#### Parameters



<table>

  <tr>

    <th>Parameter</th>

    <th>Type</th>

    <th>Description</th>

  </tr>

  <tr>

    <td><code>train</code></td>

    <td>pandas.DataFrame</td>

    <td>Training dataframe containing text, label, and tabular feature columns</td>

  </tr>

  <tr>

    <td><code>val</code></td>

    <td>pandas.DataFrame</td>

    <td>Validation dataframe with same structure as train</td>

  </tr>

  <tr>

    <td><code>pretrained_model_name</code></td>

    <td>str</td>

    <td>Base model to fine-tune. Currently supports: <code>"emilyalsentzer/Bio_ClinicalBERT"</code> or <code>"microsoft/biogpt"</code></td>

  </tr>

  <tr>

    <td><code>label_col</code></td>

    <td>str</td>

    <td>Column name of the binary outcome label (must contain <code>True</code>/<code>False</code> values)</td>

  </tr>

  <tr>

    <td><code>text_col</code></td>

    <td>str</td>

    <td>Column name containing the clinical text</td>

  </tr>

  <tr>

    <td><code>columns_unique_labels_of_tabular_features</code></td>

    <td>dict</td>

    <td>Dictionary mapping tabular feature names to their number of unique values. Use <code>1</code> for continuous features, <code>>1</code> for categorical features</td>

  </tr>

  <tr>

    <td><code>lr</code></td>

    <td>float</td>

    <td>Learning rate for final model training (default: <code>0.001</code>)</td>

  </tr>

  <tr>

    <td><code>num_epochs</code></td>

    <td>int</td>

    <td>Number of training epochs for final model (default: <code>5</code>)</td>

  </tr>

  <tr>

    <td><code>lr_of_tabular_infused_features</code></td>

    <td>float</td>

    <td>Learning rate for tabular feature pre-training (default: <code>0.0001</code>)</td>

  </tr>

</table>



#### Returns



<table>

  <tr>

    <th>Return</th>

    <th>Type</th>

    <th>Description</th>

  </tr>

  <tr>

    <td><code>model</code></td>

    <td>PeftModel</td>

    <td>The trained IA3 model</td>

  </tr>

  <tr>

    <td><code>tokenizer</code></td>

    <td>AutoTokenizer</td>

    <td>The tokenizer for the model</td>

  </tr>

</table>





#### Notes



- The `label_col` must contain boolean values (`True`/`False`)

- Categorical features should have `>1` unique labels in `columns_unique_labels_of_tabular_features`

- Continuous/numerical features should have `1` as their value in `columns_unique_labels_of_tabular_features`

- Ensure all unique values in categorical columns appear in both train and val sets

- The trained model is saved to `trained_models/IA3_{pretrained_model_name}_{label_col}`





## Questions?



Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
