Metadata-Version: 2.4
Name: plicat_model
Version: 0.1.0
Summary: A custom PLiCat model for lipid-binding Protein prediction
Home-page: https://huggingface.co/Noora68/PLiCat-0.4B
Author: FeitongDong
Author-email: 12031011@mail.sustech.edu.cn
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.7.0
Requires-Dist: transformers>=4.32.0
Requires-Dist: esm>=0.5.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

---
license: mit
language:
- en
base_model:
- EvolutionaryScale/esmc-300m-2024-12
- google-bert/bert-base-uncased
new_version: Noora68/PLiCat-0.4B
tags:
- biology
- protein
- protein classification
- lipid binding
- lipid binding site
- recognition
---

---

# PLiCat (Protein–Lipid interaction Categorization tool)

we present a robust prediction tool termed PLiCat (Protein–Lipid interaction Categorization tool) 
for predicting the lipid categories that interact with proteins, utilizing 
protein sequences as the only input. Using a combined model architecture by
the fusion of ESM C and BERT models, our method enables accurate and
interpretable prediction to distinguish lipid-binding signature among
the 8 major lipid categories defined by LIPID MAPS.
PLiCat will serve as a powerful tool to facilitate the exploration of
lipid-binding specificity and rational protein design.

---
-   **Paper**: [https://...](https://....)
-   **GitHub Repository**: [https://github.com/Noora68/PLiCat](https://github.com/Noora68/PLiCat)
-   **Online Demo**: [https://colab/](https://colab.research.google.com/drive/1wGSZsy7KyYoJf2PiXzP4SVLXonl-cWb9?usp=sharing)

---

## Model Details
- **Architecture**: ESM Cambrian + BERT  + classification head  
- **Task**: Multi-label protein-lipid binding prediction  
- **Fine-tuned from**: `ESMC_300m` + `bert-base-uncased`  
- **Developed by**: Noora68  
- **Framework**: PyTorch + HuggingFace Transformers  

---

**Model usage workflow:**

1. Load the model and tokenizer
2. Process the input sequence (tokenize → batch → pad → mask)
3. Run inference to obtain logits → probabilities
4. Output the results and mark high-confidence categories

---

## Usage

```python
from plicat_model import PLiCat
import torch
from torch.nn.utils.rnn import pad_sequence
from esm.tokenization import EsmSequenceTokenizer

# Set device (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = EsmSequenceTokenizer()

# Default lipid type dictionary
default_dict = {
    "0": "NotLipidType",
    "1": "Fatty Acyl (FA)",
    "2": "Prenol Lipid (PR)",
    "3": "Glycerophospholipid (GP)",
    "4": "Sterol Lipid (ST)",
    "5": "Polyketide (PK)",
    "6": "Glycerolipid (GL)",
    "7": "Sphingolipid (SP)",
    "8": "Saccharolipid (SL)"
}

# Load pretrained PLiCat model
model = PLiCat.from_pretrained("Noora68/PLiCat-0.4B").to(device)

# Example protein sequence
sequence = "MDSNFLKYLSTAPVLFTVWLSFTASFIIEANRFFPDMLYFPM"

# Tokenize the sequence -> input_ids
input_ids = torch.tensor(tokenizer.encode(sequence))

# Add batch dimension: (batch_size=1, length)
input_ids = input_ids.unsqueeze(0)

# Pad to the longest sequence in the batch
input_ids_padded = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)

# Build attention mask: 1 for real tokens, 0 for padding
attention_mask = (input_ids_padded != tokenizer.pad_token_id).long()

# Move tensors to the same device as model
input_ids_padded = input_ids_padded.to(device)
attention_mask = attention_mask.to(device)

# Forward pass (no gradient needed during inference)
with torch.no_grad():
    outputs = model(input_ids_padded, attention_mask)

# Convert logits to probabilities using sigmoid
probs = torch.sigmoid(outputs['logits'])

# Convert to CPU and numpy array
probs = probs.squeeze().detach().cpu().numpy()

# Print results: add a check mark if probability > 0.6
for i, p in enumerate(probs):
    mark = " √" if p > 0.6 else ""
    print(f"{default_dict[str(i)]:<25}: {p:.4f}{mark}")

````

## output of the above example is:
```
NotLipidType             : 0.0007
Fatty Acyl (FA)          : 0.1092
Prenol Lipid (PR)        : 0.9178 √
Glycerophospholipid (GP) : 0.6059 √
Sterol Lipid (ST)        : 0.0083
Polyketide (PK)          : 0.0026
Glycerolipid (GL)        : 0.0771
Sphingolipid (SP)        : 0.0002
Saccharolipid (SL)       : 0.0000
```
---

## Limitations

* Trained only on lipid-binding protein data and may not generalize to other functions.
* Model performance is best with sequence lengths under 500.
* Dataset size is limited compared to large-scale protein corpora.
* Model may reflect biases present in training data (e.g., under-representation of certain lipid types).

---


## Citation

If you use this model, please cite:

```bibtex
@article{your2025paper,
  title={Deciphering the code of lipid binding by large language model},
  author={Feitong Dong,},
  journal={Bioinformatics},
  year={2025}
}
```

---

## License

MIT License

---
