Metadata-Version: 2.4
Name: pdb1topdb
Version: 1.0.0
Summary: Convert protien 3D structure BioUnit file ('.pdb1') to a standard 3D structure file ('.pdb').
Author-email: Matsvei Tsishyn <matsvei.tsishyn@protonmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/MatsveiTsishyn/pdb1topdb
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C++
Classifier: Programming Language :: C
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file


# pdb1topdb

[![PyPi Version](https://img.shields.io/pypi/v/pdb1topdb.svg)](https://pypi.org/project/pdb1topdb/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

A **Python package** to **convert** protien 3D structure PDB-**BioUnit** file (like `.pdb1` or `.pdb12`) to a standard 3D structure **PDB** file (`.pdb`).


### Installation
Install `pdb1topdb` package with `pip`:
```bash
pip install pdb1topdb
```


### CLI usage
Use as a command-line interface (CLI):
```bash
pdb1topdb --help
pdb1topdb ./test_data/5ceg.pdb1 -o ./test_data/5ceg_biounit1.pdb
pdb1topdb ./test_data/5ceg.pdb1 # you can omit output path (will be derived)
```

Or convert all BioUnits files in a directory with one single command:
```bash
pdb1topdb-folder --help
pdb1topdb-folder ./test_data/ -o ./test_data/
```


### Python usage
Use in-line in python:
```python
# Import
from pdb1topdb import pdb1topdb, read_chains_mapping

# Convert .pdb1 -> .pdb
pdb_str, metadata = pdb1topdb(
    pdb1_path="./test_data/5ceg.pdb1",
    pdb_path="./test_data/5ceg_biounit1.pdb",
    remove_initial_biounit_file=False, # optional
    metadata_path=None, # optional
    verbose=True, # optional
)

# Read chains mapping
mapping = read_chains_mapping(pdb_path="./test_data/5ceg_biounit1.pdb")
```


### Description

**Why ?**
PDB files from the Protein Data Bank that are obtained by X-ray crystallography describe the asymmetric unit as `.pdb` or `.cif` files (the smallest repeating unit in the crystal), while the biological assembly is provided as a BioUnit file like `.pdb1` or `.pdb12` files (the biologically relevant complex of chains).  BioUnit files can be more biologically relevant, however many structural bioinformatics tools expect a standard `.pdb`, so conversion is useful.

**How ?**
`pdb1topdb` maps every chain from every model in the BioUnit to a unique chain ID in the output PDB. For example, if the BioUnit contains:
- Model 1: chains `A`, `B`
- Model 2: chains `A`, `B`
then `pdb1topdb` uses the following mapping:
- (Model 1, `A`) → `A`, (Model 1, `B`) → `B`
- (Model 2, `A`) → `C`, (Model 2, `B`) → `D`

It merges coordinates into a single model and updates chain-related metadata lines such as `SEQRES`, `HELIX`, and `LINK`. The chain mapping is injected in the header as `REMARK   0`.


### Technical Notes

PDB chain IDs are a single character, so output is limited by the available alphabet.

The default alphabet is:
`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`

Mapping rules:
- preserve a chain ID if it is in the alphabet and not already used
- otherwise assign the next unused character from the alphabet

This means the default conversion can handle up to 62 unique chain IDs. You can override the default alphabet using argument `chains_alphabet`, however nonstandard chain IDs may cause compatibility issues with other software.

