Metadata-Version: 2.4
Name: GeoBPE
Version: 0.1.0
Summary: Protein Structure Tokenization via Geometric Byte Pair Encoding (GeoBPE)
Author-email: Michael Sun <msun415@mit.edu>
License: MIT License
        
        Copyright (c) 2026 Michael Sun
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/shiningsunnyday/PT-BPE
Project-URL: Repository, https://github.com/shiningsunnyday/PT-BPE
Project-URL: Issues, https://github.com/shiningsunnyday/PT-BPE/issues
Keywords: protein,structure,tokenization,geometry,machine-learning
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: imageio
Requires-Dist: matplotlib
Requires-Dist: torch
Requires-Dist: biopython
Requires-Dist: biotite
Requires-Dist: tqdm
Requires-Dist: joblib
Requires-Dist: esm
Requires-Dist: sortedcontainers
Dynamic: license-file

# Protein Geometric Byte Pair Encoding
[![Preprint](https://img.shields.io/badge/Arxiv-red)](https://arxiv.org/abs/2511.11758)
[![OpenReview](https://img.shields.io/badge/OpenReview-blue)](https://openreview.net/forum?id=55e5f3GVFc)

![GeoBPE](https://raw.githubusercontent.com/shiningsunnyday/PT-BPE/main/data/assets/fig1.png)

This repo contains our implementation of Protein Structure Tokenization via Geometric Byte Pair Encoding (ICLR 2026).

## Overview

Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences" of geometry while enforcing global constraints.

Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an SE(3) end-frame loss.

![GeoBPE](https://raw.githubusercontent.com/shiningsunnyday/PT-BPE/main/data/assets/geobpe.jpg)

## Run GeoBPE

GeoBPE supports two sub-commands: encode and induce. Run ```geobpe --help``` for a description.

Run ```geobpe encode --help``` and ```geobpe induce --help``` to see detailed arguments.

We include the following resources to make it easy to use GeoBPE:
- [**GeoBPE API and Usage Guidelines Doc**](https://raw.githubusercontent.com/shiningsunnyday/PT-BPE/main/docs/hparam_guide.md) -- descriptions, intuitions, and guidelines on how to effectively and efficiently use GeoBPE
- [**Experiment Logs**](https://raw.githubusercontent.com/shiningsunnyday/PT-BPE/main/docs/GeoBPE-logged-runs.pdf) -- collection of past experiments varying hyperparameters settings; quickly lookup settings & performance to save future iteration time.

## Citation
If you use GeoBPE in your research, please cite our paper:
```
@inproceedings{sun2025protein,
  title={Protein Structure Tokenization via Geometric Byte Pair Encoding},
  author={Sun, Michael and Yuan, Weize and Liu, Gang and Matusik, Wojciech and Zitnik, Marinka},
  booktitle={International Conference on Learning Representations},
  year={2026},
  url={https://arxiv.org/abs/2511.11758}
}
```

## Contact

Please contact msun415@mit.edu if you have any questions.
