Metadata-Version: 2.4
Name: domain-finger-print
Version: 0.1.0a4
Summary: Build local CATH domain databases and generate structure-based domain fingerprints.
Author-email: Shuaiyuchen <shuaiyuchen4@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Phoenix-Chen-Git/protein_domain_fingerprint/tree/release/package-cleanup
Project-URL: Repository, https://github.com/Phoenix-Chen-Git/protein_domain_fingerprint/tree/release/package-cleanup
Keywords: protein,structure,domain,foldseek,cath,fingerprint
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: tqdm>=4.66
Dynamic: license-file

# domain-finger-print

`domain-finger-print` is a Python package for:

- building a local CATH domain database from official domain-only structures
- generating structure-only fingerprints against that database with Foldseek

The package scope is intentionally narrow: it builds databases and writes fingerprint `npz` files. Visualization is not part of the package API.

## What It Does

- Downloads official CATH classification files
- Downloads an official nonredundant CATH domain-only PDB archive (`S20` or `S40`)
- Extracts one PDB file per domain into a local `structures/` directory
- Excludes `...00` whole-chain entries by default, so the local DB is chopped-domain focused
- Builds a SQLite metadata index for downstream search
- Generates fixed-width fingerprints over the full target CATH domain vocabulary
- Stores one compact fingerprint matrix where each hit score is `(qTM + tTM) / 2`

## Install

```bash
pip install domain-finger-print
```

Foldseek must be installed separately and available as `foldseek` on `PATH`, or passed with `--foldseek`.

## First-Time Setup

Configure Foldseek, download CATH S20/S40, and prebuild Foldseek indexes:

```bash
dfp init
```

This writes a config file to:

```text
~/.config/domain_finger_print/config.json
```

Useful setup variants:

```bash
dfp init --redundancy 20
dfp init --redundancy 40 --default-redundancy 40
dfp init --redundancy both --data-dir ~/.cache/domain_finger_print
dfp init --foldseek /path/to/foldseek
```

## Build a CATH Database

```bash
dfp build-db cath --out-dir ./cath_s20_db --redundancy 20
```

Useful flags:

- `--redundancy 20|40`
- `--version latest-release`
- `--keep-archive`
- `--include-whole-chain`
- `--force`

## Generate Fingerprints

Single query:

```bash
dfp fingerprint \
  --query ./queries/my_protein.pdb \
  --out ./results/my_protein_fingerprint.npz
```

If `--db` is omitted, `dfp fingerprint` uses the configured default database from `dfp init`. If no config exists, it automatically initializes CATH S20 first.

Directory of queries:

```bash
dfp fingerprint \
  --query-dir ./queries \
  --glob "*.pdb" \
  --recursive \
  --out ./results/fingerprints_full.npz \
  --workers 96 \
  --prefilter-max-seqs 100
```

Python API:

```python
from domain_finger_print import collect_query_paths, generate_fingerprints

query_paths = collect_query_paths(query_dir="./queries", recursive=True)
generate_fingerprints(
    query_paths=query_paths,
    db_root="./cath_s20_db",
    out_path="./results/fingerprints.npz",
    workers=8,
)
```

Switch configured CATH versions:

```bash
dfp fingerprint \
  --redundancy 40 \
  --query-dir ./queries \
  --out ./results/fingerprints_s40.npz
```

Useful flags:

- `--db ./cath_s20_db`
- `--redundancy 20|40`
- `--config ~/.config/domain_finger_print/config.json`
- `--foldseek tools/foldseek/bin/foldseek`
- `--foldseek-db ./foldseek_db/cath_s20`
- `--foldseek-gpu`
- `--workers 96`
- `--prefilter-max-seqs 100`
- `--recursive`
- `--foldseek-sensitivity 9.5`
- `--foldseek-verbosity 0`
- `--min-domain-length 0`
- `--min-aligned-length 60`

## Output Format

The package writes one compressed `npz` containing:

- `query_labels`
- `feature_labels`
- `fingerprint_matrix`
- `metadata_json`

Feature space is fixed by target CATH domain ID. This means every run against the same database has the same dimensionality.

The fingerprint score is:

```text
tm_score = (qTM + tTM) / 2
```

Each target CATH domain has its own column. Hits are not pooled by superfamily, so this keeps finer structural detail than a superfamily-level fingerprint. Target domains not returned by Foldseek, or filtered by `--min-aligned-length`, are stored as `0`.
The domain ID for each column is stored in `metadata_json["feature_domain_ids"]`.

Schema details are documented in [`docs/fingerprint_npz_schema.md`](docs/fingerprint_npz_schema.md).

## Output Layout

```text
cath_s20_db/
├── db_info.json
├── downloads/
├── metadata.sqlite
└── structures/
```

## Notes

- CATH `latest-release` provides nonredundant domain-only PDB archives for `S20` and `S40`.
- CATH also publishes S35/S60 domain list files, but not matching nonredundant domain-only PDB archives in the same `latest-release` directory.
- By default the builder removes CATH entries whose domain number is `00`, because those represent whole-chain entries without domain chopping.
- Foldseek is used directly for structure search and scoring; the package does not currently expose TM-align reranking.
- Visualization, PCA, UMAP, and heatmaps are kept out of the installable package on purpose.
