Metadata-Version: 2.4
Name: criscross
Version: 0.1.3
Summary: CPU-friendly sequence-only CRISCross off-target prediction with a scikit-learn-style API.
Project-URL: Homepage, https://github.com/alexv-/criscross
Author: CRISCross CPU contributors
License: MIT
License-File: LICENSE
Keywords: bioinformatics,crispr,deep-learning,off-target,transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: numpy>=1.23
Requires-Dist: pandas>=1.5
Requires-Dist: pyfaidx>=0.7
Requires-Dist: torch>=2.0
Requires-Dist: zstandard>=0.21
Provides-Extra: test
Requires-Dist: pytest>=7; extra == 'test'
Requires-Dist: torchmetrics>=1.0; extra == 'test'
Description-Content-Type: text/markdown

# criscross

CPU-friendly **sequence-only** CRISCross off-target prediction with a
scikit-learn-style API. The pretrained model weights ship inside the
wheel (fp16 + zstd-compressed) so no extra downloads are required.

## Install

```bash
pip install criscross
```

CPU-only install (no CUDA libraries pulled in):

```bash
pip install criscross --extra-index-url https://download.pytorch.org/whl/cpu
```

## Quickstart

```python
from criscross import sequence_model
import pandas as pd

# Single datapoint (dict)
prob = sequence_model.predict({
    "Guide_sequence":   "GCTCGGGGACACAGGATCCCTGG",     # 23 nt
    "off_target_512nt": "GCAG...TGCC",                 # 512 nt, RC for - strand
    "strand_id":        1,                             # 1 for +, 0 for -
})
print(prob)   # float in [0, 1]

# Dataset (DataFrame or CSV path)
df = pd.read_csv("examples/sample_input.csv")
probs = sequence_model.predict(df)           # -> np.ndarray, shape [N]
probs = sequence_model.predict("examples/sample_input.csv")  # same
```

## Preparing inputs from a genome scan (Cas-OFFinder)

If you have guide RNA(s) and a reference genome FASTA, you can generate the
`Guide_sequence/off_target_512nt/strand_id` table with Cas-OFFinder and feed
it directly into `sequence_model.predict(...)`.

```python
from criscross import offinder, sequence_model

X = offinder.prepare(
    guide_rnas=["GCTCGGGGACACAGGATCCCTGG"],
    fasta="/path/to/GRCh38.primary_assembly.genome.fa",
    pam="NGG",            # default
    max_mismatches=6,     # default
)

# X is a DataFrame you can pass straight to criscross
probs = sequence_model.predict_proba(X)
```

Requirements:
- Cas-OFFinder needs an **OpenCL runtime** even for CPU mode. On Linux, the
  simplest CPU runtime is PoCL, e.g. `conda install -c conda-forge pocl ocl-icd-system`
  (or `sudo apt install pocl-opencl-icd`).
- Cas-OFFinder 2.4.1 is **bundled inside the criscross wheel**. If you prefer
  to use your own build, set `CAS_OFFINDER=/path/to/cas-offinder` (or pass
  `cas_offinder_path=`).

## Accepted inputs to `predict(X)`

| `X`                                                               | Returned                  |
|-------------------------------------------------------------------|---------------------------|
| `dict` / `pandas.Series` with the 3 required keys                 | `float`                   |
| `(guide, off_target_512nt, strand_id)` 3-tuple                    | `float`                   |
| `pandas.DataFrame` with the 3 required columns                    | `np.ndarray` shape `[N]`  |
| `list` of dicts                                                   | `np.ndarray` shape `[N]`  |
| `str` / `pathlib.Path` pointing to a CSV with the 3 columns       | `np.ndarray` shape `[N]`  |

Required columns/keys:

| key                | dtype       | meaning                                                              |
|--------------------|-------------|----------------------------------------------------------------------|
| `Guide_sequence`   | 23nt string | sgRNA guide sequence                                                 |
| `off_target_512nt` | 512nt string| candidate off-target window, **already reverse-complemented** for `-` strand |
| `strand_id`        | int 0/1     | `1` for `+` strand, `0` for `-`                                      |

## CLI

```bash
criscross predict --csv examples/sample_input.csv --out preds.csv
```

If the input CSV also has a `label` column (0/1), AUPRC is printed to stderr.

## Loading a custom checkpoint

```python
from criscross import sequence_model
sequence_model.load("path/to/my_model.pt")           # fp32 raw .pt
sequence_model.load("path/to/my_model.pt.zst")       # zstd-compressed fp16
```

## Inspecting the model

```python
sequence_model.config()    # hyperparameters used to build CRISCross(**config)
sequence_model.metadata()  # versions, training-time test_auprc, input/output signature, seed
```

## Citation

If you use this package in research, please cite the upstream CRISCross
work. This package is a CPU-only, sequence-only repackaging of that
model.
