Metadata-Version: 2.4
Name: cryoPARES
Version: 0.1.0
Summary: Cryo-EM Pose-Assignment for Related Experiments via Supervision
Author-email: Ruben Sanchez-Garcia <rsanchezgarc@faculty.ie.edu>
License: BSD-3-Clause
Project-URL: Homepage, https://github.com/rsanchezgarc/cryoPARES
Project-URL: Bug Tracker, https://github.com/rsanchezgarc/cryoPARES/issues
Keywords: deep learning,cryo-em,pose estimation,so3-equivariance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: argParseFromDoc<1.0.0,>=0.1.5
Requires-Dist: autoCLI-config<1.0.0,>=0.1.12
Requires-Dist: einops<1.0.0,>=0.8.1
Requires-Dist: cachetools<7.0.0,>=6.2.0
Requires-Dist: e3nn<1.0.0,>=0.5.7
Requires-Dist: healpy<2.0.0,>=1.18.1
Requires-Dist: joblib<2.0.0,>=1.5.2
Requires-Dist: kornia<1.0.0,>=0.8.1
Requires-Dist: mrcfile<2.0.0,>=1.5.4
Requires-Dist: numba<1.0.0,>=0.62.0
Requires-Dist: numpy<3.0.0,>=2.3.3
Requires-Dist: omegaconf<3.0.0,>=2.3.0
Requires-Dist: pandas<3.0.0,>=2.3.2
Requires-Dist: progressBarDistributed<2027.0.0,>=2026.6.0
Requires-Dist: psutil<8.0.0,>=7.1.0
Requires-Dist: pytest<9.0.0,>=8.4.2
Requires-Dist: lightning[pytorch-extra]<3.0.0,>=2.5.5
Requires-Dist: requests<3.0.0,>=2.32.5
Requires-Dist: scipy<2.0.0,>=1.16.2
Requires-Dist: scikit-learn<2.0.0,>=1.7.2
Requires-Dist: scikit-image<1.0.0,>=0.25.2
Requires-Dist: starfile<1.0.0,>=0.5.13
Requires-Dist: starstack<1.0.0,>=0.2.14
Requires-Dist: setuptools>=70.0.0
Requires-Dist: tensorboard<3.0.0,>=2.20.0
Requires-Dist: torch<2.9.0,>=2.8.0
Requires-Dist: torchvision<1.0.0,>=0.23.0
Requires-Dist: torch-ctf<1.0.0,>=0.5.0
Requires-Dist: torch-fourier-shell-correlation<1.0.0,>=0.0.4
Requires-Dist: torch-fourier-shift<1.0.0,>=0.0.6
Requires-Dist: torch-fourier-slice<1.0.0,>=0.3.2
Requires-Dist: tqdm<5.0.0,>=4.67.1
Dynamic: license-file

# CryoPARES: Cryo-EM Pose Assignment for Related Experiments via Supervised deep learning

**CryoPARES** is a software package for assigning poses to 2D cryo-electron microscopy (cryo-EM) particle images.
It uses a supervised deep learning approach to accelerate 3D reconstruction in related cryo-EM experiments.
The key idea is to train a neural network on a high-quality reference reconstruction, and then reuse this trained model
to rapidly estimate particle poses in other, similar datasets.

This workflow is divided into two main phases:

*   **Training:** In this phase, you use a pre-existing, high-resolution dataset (where particle poses have already been determined by
traditional methods like RELION refine) to train a cryoPARES model. This process creates a model that
can recognize and assign poses to particles of that specific type of macromolecule.

*   **Inference:** Once the model is trained, you can use it for inference on new datasets of the *same* or *very similar* molecules
(e.g., the same protein with a different ligand bound). Because the model has already learned the features of the molecule, it can predict
particle poses almost instantly, bypassing the computationally expensive and time-consuming alignment steps of traditional workflows.
This is especially powerful for applications like drug screening, where many similar datasets need to be processed quickly.

This "train once, infer many times" paradigm allows for near real-time 3D reconstruction, providing rapid feedback during data
collection and analysis.

For a detailed explanation of the method, please refer to our paper:
[Supervised Deep Learning for Efficient Cryo-EM Image Alignment in Drug Discovery](https://www.biorxiv.org/content/10.1101/2025.03.04.641536)

> **Documentation:** See the [full documentation](https://rsanchezgarc.github.io/cryoPARES) for detailed instructions on training, configuration, CLI reference, troubleshooting, and API reference.

## Table of Contents

- [Installation](#installation)
- [Usage](#usage)
  - [Training](#training)
  - [Inference](#inference)
    - [Static Mode](#static-mode)
    - [Daemon Mode (On-the-fly)](#daemon-mode-on-the-fly)
  - [Utility Tools](#utility-tools)
    - [Projection Matching](#projection-matching)
    - [Post-processing](#post-processing)
    - [Reconstruction](#reconstruction)
  - [Checkpoint Compactification](#checkpoint-compactification)
- [Documentation](#documentation)
  - [Training Guide](./docs/training_guide.md)
  - [API Reference](https://rsanchezgarc.github.io/cryoPARES/api/)
  - [Configuration Guide](./docs/configuration_guide.md)
  - [Troubleshooting Guide](./docs/troubleshooting.md)
  - [CLI Reference](./docs/cli.md)
  - [Configuration System](#configuration-system)
- [Example Workflow](#example-workflow)
- [Getting Help](#getting-help)
- [License and Attribution](#license-and-attribution)

## Installation

It is strongly recommended to use a virtual environment (e.g., conda) to avoid conflicts with other packages. CryoPARES has been tested on Ubuntu 20.04+ and Rocky Linux 8+ systems. NVIDIA Ampere or newer GPUs are recommended for running the code.

1.  **Create and activate a conda environment:**

    ```bash
    conda create -n cryopares python=3.12
    conda activate cryopares
    ```

### Option 1: Install from GitHub (Recommended for Users)

This is the simplest way to install cryoPARES.

```bash
pip install git+https://github.com/rsanchezgarc/cryoPARES.git
```

### Option 2: Install from a Local Clone (Recommended for Developers)

This method is recommended if you want to modify the cryoPARES source code.

1.  **Clone the repository:**

    ```bash
    git clone https://github.com/rsanchezgarc/cryoPARES.git
    cd cryoPARES
    ```

2.  **Install the package in editable mode:**

    This allows you to make changes to the code without having to reinstall the package.

    ```bash
    pip install -e .
    ```

Installation should take no more than a few minutes.

## Usage

**IMPORTANT:** CryoPARES keeps a file handler open for each `.mrcs` file referenced in the `.star` file. This can lead 
to a "Too many open files" error if the number of particle files is larger than the system's limit.
Before running training or inference, it is highly recommended to increase the open file limit by running 
the following command in your terminal:

```bash
ulimit -n 65536 #You are now able to deal with more than 30K .mrcs files. 
```
This command does not generally require sudo. If you are not allowed to increase this number, please, join the
.mrcs from different micrographs together to reduce the number of required files.

<br>
CryoPARES has two main modes of operation: training and inference. Particles need to be provided as RELION 3.1+ starfile(s).

### Training

The `cryoPARES.train.train` module is used to train a new model for pose estimation. Training needs to be done first using 
a pre-aligned dataset of particles. While not mandatory, we encourage using particles alignments estimated with RELION.


**Usage:**
```bash
cryopares_train [ARGUMENTS] [--config [CONFIG_OVERRIDES]] [--show-config]
```

**Key Arguments:**

<!-- AUTO_GENERATED:train_parameters:START -->
**Required Parameters:**

*   `--symmetry`: Point group symmetry of the molecule (e.g., C1, D7, I, O, T)

*   `--particles_star_fname`: Path(s) to RELION 3.1+ format .star file(s) containing pre-aligned particles. Can accept multiple files

*   `--train_save_dir`: Output directory where model checkpoints, logs, and training artifacts will be saved


**Optional Parameters:**

*   `--image_size_px_for_nnet`: Target image size in pixels for neural network input. After rescaling to target sampling rate, images are cropped or padded to this size. We recommend tight box-sizes

*   `--particles_dir`: Root directory for particle image paths. If paths in .star file are relative, this directory is prepended (similar to RELION project directory concept)

*   `--n_epochs`: Number of training epochs. More epochs allow better convergence, although it does not help beyond a certain point (Default: `100`)

*   `--batch_size`: Number of particles per batch. Try to make it as large as possible before running out of GPU memory. We advice using batch sizes of at least 32 images (Default: `32`)

*   `--num_dataworkers`: Number of parallel data loading workers per GPU. Each worker is a separate CPU process. Set to 0 to load data in the main thread (useful only for debugging). Try not to oversubscribe by asking more workers than CPUs (Default: `8`)

*   `--sampling_rate_angs_for_nnet`: Target sampling rate in Angstroms/pixel for neural network input. Particle images are first rescaled to this sampling rate before processing (Default: `1.5`)

*   `--mask_radius_angs`: Radius of circular mask in Angstroms applied to particle images. If not provided, defaults to half the box size

*   `--split_halves`: If True (default), trains two separate models on data half-sets for cross-validation. Use --NOT_split_halves to train single model on all data (Default: `True`)

*   `--continue_checkpoint_dir`: Path to checkpoint directory to resume training from a previous run

*   `--finetune_checkpoint_dir`: Path to checkpoint directory to fine-tune a pre-trained model on new dataset

*   `--compile_model`: Enable torch.compile for faster training (experimental) (Default: `False`)

*   `--val_check_interval`: Fraction of epoch between validation checks. You generally don't want to touch it, but you can set it to smaller values (0.1-0.5) for large datasets to get quicker feedback

*   `--overfit_batches`: Number of batches to use for overfitting test (debugging feature to verify model can memorize small dataset)

*   `--map_fname_for_simulated_pretraining`: Path(s) to reference map(s) for simulated projection warmup before training on real data. The number of maps must match number of particle star files

*   `--junk_particles_star_fname`: Optional star file(s) with junk-only particles for estimating confidence z-score thresholds

*   `--junk_particles_dir`: Root directory for junk particle image paths (analogous to particles_dir)

<!-- AUTO_GENERATED:train_parameters:END -->

**Additional relevant Parameters (via --config):**

You can override configuration parameters using `--config KEY=VALUE`. Multiple key-value pairs can be provided. The `--config` flag should be the last argument. To see all available configuration options, run `cryopares_train --show-config`.

*   **`train.learning_rate`**: Initial learning rate. (Default: `1e-3`). It needs to be tuned to get the best performance.
*   **`train.weight_decay`**: Weight decay for optimizer, that regularizes the model. (Default: `1e-5`). Make it larger if you are suffer from overfitting.
*   **`train.accumulate_grad_batches`**: Gradient accumulation batches to simulate larger batch sizes. (Default: `16`). The effecive batch size is batch_size * accumulate_grad_batches. We recommend to train with effective batches of size 512 < x < 2048. 
*   **`models.image2sphere.lmax`**: Maximum spherical harmonic degree. The larger, the more expresive the network is (Default: `12`). Reduce it if you see overfitting.
*   **`datamanager.num_augmented_copies_per_batch`**: Number of augmented copies per particle. Each copy undergoes a different data augmentation. The batch_size needs to be selected to be divisible by this number. Large batches with large num_augmented_copies_per_batch values help stabilizing training, but require a lot of GPU memory (Default: `4`)

For comprehensive training guidance including monitoring with TensorBoard and avoiding overfitting/underfitting, see the **[Training Guide](./docs/training_guide.md)**. 
For a complete list of all configuration parameters, see the **[Configuration Guide](./docs/configuration_guide.md)**.
<br>
Once the training is done, you could use the checkpoint dir contained in `--train_save_dir` to infer poses of new datasets.
The checkpoint dir is named version_0. If you run another training experiment with the same `--train_save_dir`, another
checkpoint dir names version_1 will be created.

### Inference

The `cryoPARES.inference.infer` module is used to predict poses for a new set of particles using a trained model. It can be run in two modes: static and daemon.

#### Static Mode

In static mode, the inference is run on a fixed set of particles, that again, need to be provided as RELION 3.1+ starfiles.

**Usage:**
```bash
cryopares_infer [ARGUMENTS] [--config [CONFIG_OVERRIDES]] [--show-config]
```

**Key Arguments:**

<!-- AUTO_GENERATED:inference_parameters:START -->
**Required Parameters:**

*   `--particles_star_fname`: Path to input RELION particles .star file

*   `--checkpoint_dir`: Path to training directory (or .zip file) containing half-set models with checkpoints and hyperparameters. By default they are called version_0, version_1, etc.

*   `--results_dir`: Output directory for inference results including predicted poses and optional reconstructions


**Optional Parameters:**

*   `--data_halfset`: Which particle half-set(s) to process: "half1", "half2", or "allParticles" (Default: `allParticles`)

*   `--model_halfset`: Model half-set selection policy: "half1", "half2", "allCombinations", or "matchingHalf" (uses matching data/model pairs) (Default: `matchingHalf`)

*   `--particles_dir`: Root directory for particle image paths. If provided, overrides paths in the .star file

*   `--batch_size`: Number of particles per batch for inference (Default: `32`)

*   `--n_jobs`: Number of worker processes. Defaults to number of GPUs if CUDA enabled, otherwise 1

*   `--num_dataworkers`: Number of parallel data loading workers per GPU. Each worker is a separate CPU process. Set to 0 to load data in the main thread (useful only for debugging). Try not to oversubscribe by asking more workers than CPUs (Default: `8`)

*   `--use_cuda`: Enable GPU acceleration for inference. If False, runs on CPU only (Default: `True`)

*   `--n_cpus_if_no_cuda`: Maximum CPU threads per worker when CUDA is disabled (Default: `4`)

*   `--compile_model`: Compile model with torch.compile for faster inference (experimental, requires PyTorch 2.0+) (Default: `False`)

*   `--top_k_poses_nnet`: Number of top pose predictions to retrieve from neural network before local refinement (Default: `1`)

*   `--top_k_poses_localref`: Number of best matching poses to keep after local refinement (Default: `1`)

*   `--grid_distance_degs`: Maximum angular distance in degrees for local refinement search. Grid ranges from -grid_distance_degs to +grid_distance_degs around predicted pose (Default: `4.0`)

*   `--reference_map`: Path to reference map (.mrc) for FSC computation during validation

*   `--reference_mask`: Path to reference mask (.mrc) for masked FSC calculation

*   `--directional_zscore_thr`: Confidence z-score threshold for filtering particles. Particles with scores below this are discarded as low-confidence

*   `--skip_localrefinement`: Skip local pose refinement step and use only neural network predictions (Default: `False`)

*   `--skip_reconstruction`: Skip 3D reconstruction step and output only predicted poses (Default: `False`)

*   `--subset_idxs`: List of particle indices to process (for debugging or partial processing)

*   `--n_first_particles`: Process only the first N particles from dataset (debug feature)

*   `--check_interval_secs`: Polling interval in seconds for parent loop in distributed processing (Default: `2.0`)

*   `--merge_halves_output`: No description available (Default: `False`)

<!-- AUTO_GENERATED:inference_parameters:END -->

**Half-Set Selection (`--data_halfset` and `--model_halfset`)**

To avoid overfitting and to ensure a fair evaluation, cryo-EM datasets are often split into two halves (half1 and half2). CryoPARES uses this concept for both the data and the model.

*   `--data_halfset`: Specifies which half of the data to use for inference.
    *   `half1`: Use only the particles from the first half of the dataset.
    *   `half2`: Use only the particles from the second half of the dataset.
    *   `allParticles`: Use all particles from the dataset. (Default)

*   `--model_halfset`: Specifies which trained model to use for inference. During training, CryoPARES creates two models, one for each half of the training data.
    *   `half1`: Use the model trained on the first half of the training data.
    *   `half2`: Use the model trained on the second half of the training data.
    *   `matchingHalf`: Use the model from the corresponding half of the data (e.g., `half1` data with `half1` model). This is the default and recommended setting.
    *   `allCombinations`: Run inference for all possible combinations of data and model halves (e.g., `half1` data with `half1` model, `half1` data with `half2` model, etc.). 

**Note:** Many of these parameters can also be set via `--config` (e.g., `--config projmatching.grid_step_degs=2.0`). However, using the direct CLI flags is recommended for commonly adjusted parameters. 
To see all available configuration options, run `cryopares_infer --show-config`.

For detailed API documentation, see the **[API Reference](https://rsanchezgarc.github.io/cryoPARES/api/)**.

#### Daemon Mode (On-the-fly)

In daemon mode, the inference script runs continuously and watches for new particles to be added to a directory. This is useful for processing particles as they are being generated.

The daemon workflow consists of three main components:

1.  **Queue Manager:** A central server that hosts one or more **named queues** on a single port.
2.  **Spooling Filler:** A script that monitors a directory for new `.star` files and adds them to a queue. You could implement other filler protocols using this module as an example.
3.  **Daemon Inferencer:** One or more worker processes that consume jobs from a queue and perform inference.

All three components communicate over the network. The Spooling Filler and Daemon Inferencer must be
configured with `ip`/`port`/`authkey`/`queue_name` values that match the Queue Manager.
The Queue Manager does not take a `queue_name` argument — it hosts all named queues, creating them
on demand when a client first requests a given name.

**Workflow:**

1.  **Start the Queue Manager:**

    This script creates the central queue server. It should be run once and kept running in the background.
    A single server instance can host **multiple independent named queues** on the same port — useful
    when running several independent inference pipelines simultaneously.

    ```bash
    python -m cryoPARES.inference.daemon.queueManager [--ip IP] [--port PORT] [--authkey KEY] [--queue_maxsize N]
    ```

    | Argument | Default | Description |
    |---|---|---|
    | `--ip` | `localhost` | IP address to bind to. Use `0.0.0.0` to accept remote connections. |
    | `--port` | `50000` | TCP port to listen on. |
    | `--authkey` | `shared_queue_key` | Authentication passphrase shared by all clients. |
    | `--queue_maxsize` | unlimited | Max pending jobs per queue (`None` = no limit). |

    ```bash
    # Default settings
    python -m cryoPARES.inference.daemon.queueManager

    # Remote server, custom port/key, bounded queues
    python -m cryoPARES.inference.daemon.queueManager \
        --ip 0.0.0.0 --port 51000 --authkey mysecret --queue_maxsize 100
    ```

2.  **Start the Spooling Filler:**

    This script watches a directory for new `.star` files and adds them to a named queue.

    ```bash
    python -m cryoPARES.inference.daemon.spoolingFiller --directory DIR \
        [--ip IP] [--port PORT] [--authkey KEY] [--queue_name NAME] \
        [--pattern GLOB] [--check_interval SECS]
    ```

    | Argument | Default | Description |
    |---|---|---|
    | `--directory` | *(required)* | Directory to monitor for new `.star` files. |
    | `--ip` | `localhost` | IP address of the queue manager server. |
    | `--port` | `50000` | Port of the queue manager server. |
    | `--authkey` | `shared_queue_key` | Authentication key (must match the server). |
    | `--queue_name` | `default` | Name of the queue to submit jobs to. |
    | `--pattern` | `*.star` | Glob pattern for files to watch. |
    | `--check_interval` | `10` | Seconds between directory scans. |

    ```bash
    # Default queue, local server
    python -m cryoPARES.inference.daemon.spoolingFiller --directory /path/to/watch

    # Named queue on a remote server
    python -m cryoPARES.inference.daemon.spoolingFiller \
        --directory /path/to/watch \
        --ip 192.168.1.10 --port 51000 --authkey mysecret \
        --queue_name my_pipeline
    ```

**Alternative: Manually Submit Jobs**

You can also submit jobs programmatically using `queue_connection`.

```python
from cryoPARES.inference.daemon.queueManager import queue_connection

# Submit a single .star file (default queue, default ip/port/authkey)
with queue_connection(ip="localhost", port=50000, authkey="shared_queue_key") as queue:
    queue.put("/path/to/particles.star")

# Submit to a named queue on a remote server
with queue_connection(ip="192.168.1.10", port=51000, authkey="mysecret",
                      queue_name="my_pipeline") as queue:
    for star_file in ["/path/to/particles1.star", "/path/to/particles2.star"]:
        queue.put(star_file)

# Submit particles already loaded in memory (no disk I/O on the worker side)
import starfile
star = starfile.read("/path/to/particles.star")
# Optionally filter star["particles"] here before submitting
with queue_connection() as queue:
    queue.put((star["optics"], star["particles"]))  # (optics_df, particles_df)

# Send poison pill to terminate workers of a specific queue gracefully
with queue_connection(queue_name="my_pipeline") as queue:
    queue.put(None)
```

**Input formats accepted by the Daemon Inferencer:**
- **String:** Path to a `.star` file
- **Tuple `(optics_df, particles_df)`:** pandas DataFrames already loaded or filtered in memory; no disk I/O required by the worker
- **`None`:** Poison pill — signals workers to terminate gracefully


3.  **Start the Daemon Inferencer(s):**

    You can start as many inference workers as you want. Each worker will take jobs from the queue and
    process them. **Important:** each worker must have its own `--results_dir`.
    The inference arguments are the same as for `cryopares_infer`, plus the network arguments below.

    ```bash
    python -m cryoPARES.inference.daemon.daemonInference \
        --checkpoint_dir DIR --results_dir DIR \
        [--net_address IP] [--net_port PORT] [--net_authkey KEY] [--net_queue_name NAME] \
        [inference options…]
    ```

    | Argument | Default | Description |
    |---|---|---|
    | `--net_address` | `localhost` | IP address of the queue manager server. |
    | `--net_port` | `50000` | Port of the queue manager server. |
    | `--net_authkey` | `shared_queue_key` | Authentication key (must match the server). |
    | `--net_queue_name` | `default` | Name of the queue to consume jobs from. |

    ```bash
    # Two workers on the default queue
    python -m cryoPARES.inference.daemon.daemonInference \
        --checkpoint_dir /path/to/checkpoint --results_dir /path/to/results_worker1 \
        --particles_dir /path/to/particles
    python -m cryoPARES.inference.daemon.daemonInference \
        --checkpoint_dir /path/to/checkpoint --results_dir /path/to/results_worker2 \
        --particles_dir /path/to/particles

    # Worker on a named queue from a remote server
    python -m cryoPARES.inference.daemon.daemonInference \
        --checkpoint_dir /path/to/checkpoint --results_dir /path/to/results_pipe2 \
        --net_address 192.168.1.10 --net_port 51000 --net_authkey mysecret \
        --net_queue_name my_pipeline
    ```

4.  **Materialize the Volume:**

    You can materialize the final 3D volume from the partial results at any time, even while the inference workers are still running.
    The script will combine all the available partial results.

    ```bash
    python -m cryoPARES.inference.daemon.materializePartialResults \
        --partial_outputs_dirs /path/to/results_worker1/ /path/to/results_worker2 \
        --output_mrc /path/to/final_map.mrc --output_star /path/to/final_particles.star
    ```

### Utility Tools

CryoPARES includes standalone utility tools for projection matching and reconstruction. 
**Note:** These tools are automatically used within the `cryopares_infer` workflow, but can also be run independently if needed.<br>

#### Projection Matching

The projection matching utility performs local pose refinement by searching around existing particle orientations to find the best match against reference volume projections. This is used automatically during inference for local refinement, but can also be run standalone.

**Usage:**
```bash
cryopares_projmatching [ARGUMENTS] [--config [CONFIG_OVERRIDES]] [--show-config]
```

**Key Arguments:**

<!-- AUTO_GENERATED:projmatching_parameters:START -->
**Required Parameters:**

*   `--reference_vol`: Path to reference 3D volume (.mrc file) for generating projection templates

*   `--particles_star_fname`: Path to input STAR file with particle metadata

*   `--out_fname`: Path for output STAR file with aligned particle poses

*   `--particles_dir`: Root directory for particle image paths. If provided, overrides paths in the .star file


**Optional Parameters:**

*   `--mask_radius_angs`: Radius of circular mask in Angstroms applied to particle images

*   `--grid_distance_degs`: Maximum angular distance in degrees for local refinement search. Grid ranges from -grid_distance_degs to +grid_distance_degs around predicted pose (Default: `4.0`)

*   `--grid_step_degs`: Angular step size in degrees for grid search during local refinement (Default: `2.0`)

*   `--return_top_k_poses`: Number of top matching poses to save per particle (Default: `1`)

*   `--filter_resolution_angst`: Low-pass filter resolution in Angstroms applied to reference volume before matching

*   `--n_jobs`: Number of parallel worker processes for distributed projection matching (Default: `1`)

*   `--num_dataworkers`: Number of CPU workers per PyTorch DataLoader for data loading (Default: `1`)

*   `--batch_size`: Number of particles to process simultaneously per job (Default: `32`)

*   `--use_cuda`: Enable GPU acceleration. If False, runs on CPU only (Default: `True`)

*   `--verbose`: Enable verbose logging output (Default: `False`)

*   `--float32_matmul_precision`: PyTorch float32 matrix multiplication precision mode ("highest", "high", or "medium") (Default: `high`)

*   `--gpu_id`: Specific GPU device ID to use (if multiple GPUs available)

*   `--n_first_particles`: Process only the first N particles from dataset (for testing or validation)

*   `--correct_ctf`: Apply CTF correction during projection matching (Default: `True`)

*   `--halfmap_subset`: Select half-map subset (1 or 2) for half-map validation

<!-- AUTO_GENERATED:projmatching_parameters:END -->

For additional details, see the [Command-Line Interface documentation](./docs/cli.md).

#### Post-processing

The post-processing utility sharpens reconstructed volumes using B-factor estimation (Guinier analysis) and FSC weighting. Run it after reconstruction to improve map interpretability.

**Usage:**
```bash
cryopares_postprocess bfactor \
    --half1 /path/to/half1.mrc \
    --half2 /path/to/half2.mrc \
    --mask /path/to/mask.mrc \      # or --auto_mask
    --output_dir /path/to/postprocess_output
```

For all options, see the [CLI Reference](./docs/cli.md#cryopares_postprocess).

#### Reconstruction

The reconstruction utility creates a 3D volume from particles with known poses using direct Fourier inversion. This is used automatically during inference to generate the final 3D map, but can also be run standalone for particles aligned by other methods (e.g., RELION).

**Usage:**
```bash
cryopares_reconstruct [--config [CONFIG_OVERRIDES]] [--show-config]
```

**Key Arguments:**

<!-- AUTO_GENERATED:reconstruct_parameters:START -->
**Required Parameters:**

*   `--particles_star_fname`: Path to input STAR file with particle metadata and poses to reconstruct

*   `--symmetry`: Point group symmetry of the volume for reconstruction (e.g., C1, D2, I, O, T)

*   `--output_fname`: Path for output reconstructed 3D volume (.mrc file)


**Optional Parameters:**

*   `--particles_dir`: Root directory for particle image paths. If provided, overrides paths in the .star file

*   `--n_jobs`: Number of parallel worker processes for distributed reconstruction (Default: `1`)

*   `--num_dataworkers`: Number of CPU workers per PyTorch DataLoader for data loading (Default: `1`)

*   `--batch_size`: Number of particles to backproject simultaneously per job (Default: `128`)

*   `--use_cuda`: Enable GPU acceleration for reconstruction. If False, runs on CPU only (Default: `True`)

*   `--correct_ctf`: Apply CTF correction during reconstruction (Default: `True`)

*   `--eps`: Regularization mode and strength. Sign selects mode: eps >= 0 uses Tikhonov regularization, eps < 0 uses RELION-style radial averaging. Magnitude sets scale: for Tikhonov, eps is the regularization constant (ideally 1/SNR); for radial averaging, abs(eps) is the divisor for radial weights (RELION uses 1000). Recommended: -1000 for radial averaging, 1e-3 for Tikhonov (Default: `-1000.0`)

*   `--min_denominator_value`: Minimum denominator threshold for numerical stability (prevents division by zero). Applied as final safety clamp regardless of regularization mode. RELION uses 1e-6 (Default: `1e-06`)

*   `--use_only_n_first_batches`: Reconstruct using only first N batches (for testing or quick validation)

*   `--float32_matmul_precision`: PyTorch float32 matrix multiplication precision mode ("highest", "high", or "medium") (Default: `high`)

*   `--weight_with_confidence`: Apply per-particle confidence weighting during backprojection. If True, particles with higher confidence contribute more to reconstruction. It reads the confidence from the metadata label "rlnParticleFigureOfMerit" (Default: `False`)

*   `--halfmap_subset`: Select half-map subset (1 or 2) for half-map reconstruction and validation

*   `--apply_soft_mask`: Apply soft spherical masking after reconstruction to reduce edge artifacts (RELION-style) (Default: `True`)

*   `--mask_radius_pix`: Radius for soft mask in pixels. If negative, defaults to box_size/2  (Default: `-1.0`)

*   `--mask_edge_width`: Width of cosine falloff edge in pixels (Default: `3`)

<!-- AUTO_GENERATED:reconstruct_parameters:END -->

For additional details, see the [Command-Line Interface documentation](./docs/cli.md).

### Checkpoint Compactification

After training, you can package your checkpoint into a compact ZIP file for easy distribution and storage. This reduces the checkpoint size from ~40 GB to ~10 GB by removing training logs, metrics, and intermediate files while keeping everything needed for inference.

**Compactify a checkpoint:**
```bash
python -m cryoPARES.scripts.compactify_checkpoint \
    --checkpoint_dir /path/to/training_output/version_0
```

This creates `version_0_compact.zip` containing only the essential files.

**Use the compactified checkpoint for inference:**
```bash
cryopares_infer \
    --particles_star_fname /path/to/particles.star \
    --checkpoint_dir /path/to/version_0_compact.zip \
    --results_dir /path/to/results
```

The ZIP file is used directly without extraction, making it ideal for:
- **Sharing models** with collaborators
- **Archiving** trained models efficiently
- **Deploying** to inference servers with limited storage


## Documentation

- **[Training Guide](./docs/training_guide.md)** - Comprehensive guide on training models, monitoring with TensorBoard, and avoiding overfitting/underfitting
- **[API Reference](https://rsanchezgarc.github.io/cryoPARES/api/)** - Auto-generated API documentation with type hints (hosted on GitHub Pages)
- **[Configuration Guide](./docs/configuration_guide.md)** - Complete reference for all configuration parameters
- **[Troubleshooting Guide](./docs/troubleshooting.md)** - Solutions to common issues
- **[CLI Reference](./docs/cli.md)** - Command-line interface documentation

**Building Documentation Locally:**
```bash
cd docs
pip install -r requirements.txt
make html
# Open _build/html/index.html in your browser
```


### Configuration System

CryoPARES uses a flexible configuration system that allows you to manage settings from multiple sources.

*   **`--show-config`:** To see all available options, run any main script with the `--show-config` flag. This will print a comprehensive list of all parameters, their current values, and their paths.

    ```bash
    cryopares_train --show-config
    ```

*   **YAML Files:** Create a `.yaml` file with your desired parameters.
*   **Command-Line Overrides:** Pass `KEY=VALUE` pairs to the program. Use dot notation to specify nested parameters (e.g., `models.image2sphere.lmax=6`).
*   **Direct Arguments:** Use standard command-line flags (e.g., `--batch_size 32`).

**Precedence:** Direct command-line arguments override `--config` overrides, which override YAML files, which override the default configuration.

For a complete reference of all configuration parameters, see the **[Configuration Guide](./docs/configuration_guide.md)**.

## Example Workflow

### Quick Start with Test Dataset

Before running on your own data, we recommend testing cryoPARES with a small dataset. If you don't have a small particles .star file,
you can download some examples from [CESPED](https://github.com/rsanchezgarc/cesped) (Cryo-EM Supervised Pose Estimation Dataset). 
CESPED provides benchmark datasets specifically designed for supervised pose estimation.

#### Install CESPED (Optional)

```bash
pip install cesped
```

#### Download a Test Dataset

For a quick test, use the small `TEST` dataset (subset of EMPIAR-11120):

```bash
python -m cesped.particlesDataset download_entry -t TEST --benchmarkDir /path/to/your/data
``` 
Please, notice that you won't be able to train an accurate model using this small dataset, but it will be good to check that you 
can run the full workflow

For a full benchmark dataset, you can download other CEPSPED entries such as the EMPIAR-10166 (Human 26S proteasome, C1 symmetry, 238K particles):

```bash
# Download both half-sets
python -m cesped.particlesDataset download_entry -t 10166 --benchmarkDir /path/to/your/data

```

### Training and Inference Example

Once you have downloaded a CESPED dataset, you can train and test cryoPARES:

1. **Train a model on an existing, aligned dataset:**

```
cryopares_train  \
   --symmetry C1  \
   --particles_star_fname /path/to/your/data/CESPED/TEST/particles_merged.star  \
   --particles_dir /path/to/cesped_benchmark/TEST/   \
   --train_save_dir /path/to/training_output   \
   --n_epochs 3  \
   --batch_size 32  \
   --sampling_rate_angs_for_nnet 1.5 \
   --image_size_px_for_nnet 64 \
   --config models.image2sphere.lmax=6  models.image2sphere.so3components.so3outputgrid.hp_order=3  models.image2sphere.so3components.i2sprojector.sphere_fdim=64 models.image2sphere.so3components.s2conv.f_out=16 models.image2sphere.imageencoder.unet.out_channels_first=4
   
   ```
Notice that we have added several `--config` flags to create a small model, that will not perform well, but it will be quick.
We are also using a `--image_size_px_for_nnet` much smaller than advisable (we recomend 128 to 256, depending on the particle)

   For production use:

   ```bash
   cryopares_train \
       --symmetry C1 \
       --particles_star_fname /path/to/particles.star \
       --particles_dir /path/to/particles/ \ 
       --train_save_dir /path/to/training_output \
       --n_epochs 100 \
       --batch_size 32 \
       --image_size_px_for_nnet 160 \
       --sampling_rate_angs_for_nnet 1.5  #We are using the default model, hence no --config
   ```
You can tweak the neural network setting different values with the `--config` flag. Use `--show-config` to get the list of all available options.

After training, there should be a directory called /path/to/training_output/version_* with our checkpoint. 
We need to provide such a directory to the inference command.

2. **Run inference on a new dataset with local refinement and reconstruction:**


   ```bash
   cryopares_infer \
       --particles_star_fname /path/to/new_particles.star \
       --particles_dir /path/to/particles \
       --checkpoint_dir /path/to/training_output/version_0 \
       --results_dir /path/to/inference_results \
       --reference_map /path/to/initial_model.mrc \ #If not provided, it is automatically generated from the training data
       --batch_size 32 \
       --grid_distance_degs 12 \  #Local search will be from -12º to +12º
       --directional_zscore_thr 1.0   # Remove all particles with directional zscore <1.0

   ```


## Getting Help

If you encounter issues:

1. Check the **[Troubleshooting Guide](./docs/troubleshooting.md)** for common problems and solutions
2. Review the **[Training Guide](./docs/training_guide.md)** for training best practices
3. Consult the **[Configuration Guide](./docs/configuration_guide.md)** for parameter details
4. See the **[API Reference](https://rsanchezgarc.github.io/cryoPARES/api/)** for programmatic usage

For bugs or feature requests, please open an issue on [GitHub](https://github.com/rsanchezgarc/cryoPARES/issues).

## License and Attribution

CryoPARES is licensed under the **GNU General Public License v3.0** (GPL-3.0). See the [LICENSE](LICENSE) file for details.

### Third-Party Code

This project incorporates code derived from the following open-source projects:

- **[torch-fourier-slice](https://github.com/teamtomo/torch-fourier-slice)** (Copyright © 2023 Alister Burt, BSD 3-Clause License)
  - Used in: `cryoPARES/reconstruction/insert_central_slices_rfft_3d.py`
  - Used in: `cryoPARES/projmatching/projmatchingUtils/extract_central_slices_as_real.py`

See [THIRD-PARTY-LICENSES](THIRD-PARTY-LICENSES) for complete license texts and attribution details.
