Metadata-Version: 2.4
Name: domain-finger-print
Version: 0.1.0a1
Summary: Build local CATH domain databases and generate structure-based domain fingerprints.
Author-email: Shuaiyuchen <shuaiyuchen4@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Phoenix-Chen-Git/protein_domain_fingerprint/tree/release/package-cleanup
Project-URL: Repository, https://github.com/Phoenix-Chen-Git/protein_domain_fingerprint/tree/release/package-cleanup
Keywords: protein,structure,domain,foldseek,cath,fingerprint
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: tqdm>=4.66
Dynamic: license-file

# domain-finger-print

`domain-finger-print` is a Python package for:

- building a local CATH domain database from official domain-only structures
- generating structure-only fingerprints against that database with Foldseek

The package scope is intentionally narrow: it builds databases and writes fingerprint `npz` files. Visualization is not part of the package API.

## What It Does

- Downloads official CATH classification files
- Downloads an official nonredundant CATH domain-only PDB archive (`S20` or `S40`)
- Extracts one PDB file per domain into a local `structures/` directory
- Excludes `...00` whole-chain entries by default, so the local DB is chopped-domain focused
- Builds a SQLite metadata index for downstream search
- Generates fixed-width fingerprints over the full CATH superfamily vocabulary
- Stores `qTM` and `tTM` separately, plus a stacked `[qTM || tTM]` matrix, in one compressed `npz`

## Install

```bash
pip install -e . --no-build-isolation
```

## Build a CATH Database

```bash
dfp build-db cath --out-dir ./cath_s20_db --redundancy 20
```

Useful flags:

- `--redundancy 20|40`
- `--version latest-release`
- `--keep-archive`
- `--include-whole-chain`
- `--force`

## Generate Fingerprints

Single query:

```bash
dfp fingerprint \
  --db ./cath_s20_db \
  --query ./queries/my_protein.pdb \
  --out ./results/my_protein_fingerprint.npz
```

Directory of queries:

```bash
dfp fingerprint \
  --db ./cath_s20_db \
  --query-dir ./queries \
  --glob "*.pdb" \
  --out ./results/fingerprints_full.npz \
  --workers 96 \
  --prefilter-max-seqs 100
```

Useful flags:

- `--foldseek tools/foldseek/bin/foldseek`
- `--foldseek-db ./foldseek_db/cath_s20`
- `--foldseek-gpu`
- `--workers 96`
- `--prefilter-max-seqs 100`
- `--foldseek-sensitivity 9.5`
- `--min-domain-length 0`
- `--min-aligned-length 60`

## Output Format

The package writes one compressed `npz` containing:

- `query_labels`
- `q_feature_labels`
- `t_feature_labels`
- `stacked_feature_labels`
- `q_matrix`
- `t_matrix`
- `stacked_matrix`
- `metadata_json`

Feature space is fixed by CATH superfamily. This means every run against the same database has the same dimensionality.

The stacked fingerprint is:

```text
[qTM over all superfamilies || tTM over all superfamilies]
```

Schema details are documented in [`docs/fingerprint_npz_schema.md`](docs/fingerprint_npz_schema.md).

## Output Layout

```text
cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/
```

## Notes

- CATH `latest-release` provides nonredundant domain-only PDB archives for `S20` and `S40`.
- By default the builder removes CATH entries whose domain number is `00`, because those represent whole-chain entries without domain chopping.
- Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
- Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.
