Metadata-Version: 2.2
Name: recovar
Version: 0.4.5
Summary: RECOVAR cryo-EM heterogeneity analysis
Author-email: Marc Aurele Gilles <gilles@princeton.edu>
Project-URL: Homepage, https://github.com/ma-gilles/recovar
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dataframe_image==0.2.7
Requires-Dist: finufft==2.3.1
Requires-Dist: healpy==1.18.0
Requires-Dist: ipython==8.12.3
Requires-Dist: ipywidgets==8.1.5
Requires-Dist: jax_finufft==0.1.0
Requires-Dist: jaxopt==0.8.3
Requires-Dist: jaxtyping==0.2.38
Requires-Dist: kneed==0.8.5
Requires-Dist: lineax==0.0.7
Requires-Dist: matplotlib==3.10.0
Requires-Dist: matplotlib_scalebar==0.9.0
Requires-Dist: more_itertools==10.6.0
Requires-Dist: mrcfile==1.5.4
Requires-Dist: numpy
Requires-Dist: pandas==2.2.3
Requires-Dist: plotly==6.0.0
Requires-Dist: scikit-learn==1.6.1
Requires-Dist: cufflinks
Requires-Dist: jupyterlab
Requires-Dist: notebook<7
Requires-Dist: scipy==1.15.2
Requires-Dist: seaborn==0.13.2
Requires-Dist: scikit-image
Requires-Dist: starfile==0.5.11
Requires-Dist: tensorflow_cpu==2.18.0
Requires-Dist: typing_extensions==4.12.2
Requires-Dist: twine
Requires-Dist: ipykernel
Requires-Dist: scikit-fmm
Requires-Dist: umap-learn
Provides-Extra: dev

# RECOVAR: Regularized covariance estimation for cryo-EM heterogeneity analysis


<details><summary>Click here to see more details about the features RECOVAR</summary>

- High resolution. According to a thorough benchmarking led by Ellen Zhong’s lab, [cryobench](https://cryobench.cs.princeton.edu), it achieves higher resolution than other methods in most cases. See [analyzing the output](#analyze).
- Extracting images based on a volume. If you see a particular feature in a volume generated by RECOVAR, there is a feature to automatically extract the set of images that have created that feature. See [extracting](#extracting-image-subset-based-on-volumes)
- Accurate estimation of conformational density.
- Estimation of low free-energy motions (if you believe conformational density and free energy are related). See [analyzing the output](#analyze).
- It can use a focus mask. See option `--focus_mask` in [pipeline](#iii-running-the-recovar-pipeline).
- It supports tomography data (same format as cryoDRGN-ET). One practical advantage over the cryoDRGN-ET and tomodrgn is that a focus mask can be input. This is a new feature (no paper yet) so it may be less stable. See [tomography](#tomography)
- You can also use the RECOVAR volume generation on latent space generated by your favorite heterogeneity method. For example, I have used it to improve cryoDRGN reconstructions. Since the volume generation scheme is transparent, you can be quite confident that it produces no hallucination which may also be useful to validate the output of another method. See [kernel regression](#using-kernel-regression-with-other-embeddings)
</details>

If you are interested in the research and methods
[see preprint here](https://www.biorxiv.org/content/10.1101/2023.10.28.564422v4) and recorded talk [here](https://www.youtube.com/watch?v=cQBQlCCRp8Q&t=740s).

[Installation](#installation)


Running RECOVAR:
* [1. Preprocessing](#i-preprocessing)
* [2. Specifying a mask](#ii-specifying-a-real-space-mask)
* [3. Running the pipeline](#iii-running-the-recovar-pipeline)
  - Uses [`recovar pipeline`](#iii-running-the-recovar-pipeline)
* [4. Analyzing results](#iv-analyzing-results)
  - Uses [`analyze`](#usage-of-analyzepy)
  - Generating volumes at specific points: [`compute_state`](#generating-volumes-at-specific-points-in-latent-space)
  - Generating and analyzing trajectories: [`compute_trajectory`](#generating-and-analyzing-trajectories-in-latent-space)
  - Extracting image subsets:
    - Based on volumes: [`extract_image_subset`](#extracting-image-subset-based-on-volumes)
    - Based on k-means clusters: [`extract_image_subset_from_kmeans`](#command-extract_image_subset_from_kmeans)
  - Estimating conformational density: [`estimate_conformational_density`](#command-estimate_conformational_densitypy)
* [5. Visualizing results](#v-visualizing-results)

Other:
* [RECOVAR for Cryo-ET](#cryo-et)
* [Testing your installation](#small-test-dataset)
<!-- * [6. Generating trajectories](#vi-generating-additional-trajectories) -->
* [Using kernel regression with other embeddings](#using-kernel-regression-with-other-embeddings)


[TLDR](#tldr)

<!-- (OUT OF DATE)
Peak at what output looks like on a [synthetic dataset](output_visualization_simple_synthetic.ipynb) and [real dataset](  x). -->

Also:
[using the source code](#using-the-source-code), 
[limitations](#limitations), 
[contact](#contact)



## Installation 
<!-- To run this code, CUDA and [JAX](https://jax.readthedocs.io/en/latest/index.html#) are required. See information about JAX installation [here](https://jax.readthedocs.io/en/latest/installation.html). -->


<!-- Assuming you already have CUDA, installation should take less than 5 minutes.
Below are a set of commands which runs on our university cluster (Della), but may need to be tweaked to run on other clusters.
You may need to load CUDA before installing JAX, E.g., on our university cluster with
    module load cudatoolkit/12.3

Then create an environment, download JAX-cuda (for some reason the latest version is causing issues, so make sure to use 0.4.23), clone the directory and install the requirements (note the --no-deps flag. This is because of some conflict with dependencies of cryodrgn. Will fix it soon.).

    git clone https://github.com/ma-gilles/recovar.git
    conda create --name recovar python=3.11 -y
    conda activate recovar
    pip install -U "jax[cuda12_pip]"==0.4.23 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html --no-input
    pip install --no-deps -r recovar/recovar_install_requirements.txt --no-input
    python -m ipykernel install --user --name=recovar 

It is recommanded to test your installation before running on a real dataset, see [Testing your installation](#small-test-dataset). -->





<!-- ## Installation  -->
CUDA and [JAX](https://jax.readthedocs.io/en/latest/index.html#) are required to run this code. JAX will be installed by the command below, and the cudatoolkit is now included, but you need to have the CUDA drivers installed, see info here about JAX installation [here](https://jax.readthedocs.io/en/latest/installation.html).
Assuming you already have CUDA drivers (probably already installed on your cluster), installation should take less than 5 minutes.

If you have an internet connection, you can copy paste the commands below. It creates a conda environment to install recovar in, then `pip install`s the GPU version of JAX, the cpu version of pytorch, and recovar.

    conda create --name recovar python=3.11 -y
    conda activate recovar
    pip install -f https://download.pytorch.org/whl/torch_stable.html torch==2.3.1+cpu "jax[cuda12]"==0.5.0  recovar

You can then use the recovar command after activating the environment. It is recommended to test your installation before running on a real dataset. You can do this by running this on a GPU-capable device:

    conda activate recovar
    recovar run_test_dataset

see details in [testing your installation](#small-test-dataset).

<!-- The code was tested on [this commit](https://github.com/ma-gilles/recovar/commit/6388bcc8646c535ae1b121952aa5c04e52402455).

The code for the paper was run on [this commit](https://github.com/ma-gilles/recovar/commit/6388bcc8646c535ae1b121952aa5c04e52402455). -->



## I. Preprocessing 
The input interface of RECOVAR is borrowed directly from the excellent [cryoDRGN toolbox](https://cryodrgn.cs.princeton.edu/). 
Particles, poses and CTF must be prepared in the same way, and below is copy-pasted part of 
[cryoDRGN's documentation](https://github.com/ml-struct-bio/cryodrgn#2-parse-image-poses-from-a-consensus-homogeneous-reconstructiqqon).
You should first install cryoDRGN, and prepare the dataset as below before going on to step 2.


**NOTE: You need to install cryoDRGN and RECOVAR in two different conda environments (they have versioning conflict). This means you should install cryoDRGN in the crydrgn environment as below, process the data using the cryoDRGN environment after doing `conda activate cryodrgn`, then activate the recovar environement with `conda activate recovar`, and go on with the analysis.**



### cryoDRGN Installation

`cryodrgn` may be installed via `pip`, and we recommend installing `cryodrgn` in a clean conda environment.

    # Create and activate conda environment
    (base) $ conda create --name cryodrgn python=3.9
    (cryodrgn) $ conda activate cryodrgn

    # install cryodrgn
    (cryodrgn) $ pip install cryodrgn

(NOTE: right now you need to install cryoDRGN and RECOVAR in two different environments, will fix soon!) 


You can alternatively install a newer, less stable, development version of `cryodrgn` using our beta release channel:

    (cryodrgn) $ pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ cryodrgn --pre

More installation instructions are found in the [documentation](https://ez-lab.gitbook.io/cryodrgn/installation).



### 1. Preprocess image stack

First resize your particle images using the `cryodrgn downsample` command:

<details><summary><code>$ cryodrgn downsample -h</code></summary>

    usage: cryodrgn downsample [-h] -D D -o MRCS [--is-vol] [--chunk CHUNK]
                               [--datadir DATADIR]
                               mrcs

    Downsample an image stack or volume by clipping fourier frequencies

    positional arguments:
      mrcs               Input images or volume (.mrc, .mrcs, .star, .cs, or .txt)

    optional arguments:
      -h, --help         show this help message and exit
      -D D               New box size in pixels, must be even
      -o MRCS            Output image stack (.mrcs) or volume (.mrc)
      --is-vol           Flag if input .mrc is a volume
      --chunk CHUNK      Chunksize (in # of images) to split particle stack when
                         saving
      --relion31         Flag for relion3.1 star format
      --datadir DATADIR  Optionally provide path to input .mrcs if loading from a
                         .star or .cs file
      --max-threads MAX_THREADS
                         Maximum number of CPU cores for parallelization (default: 16)
      --ind PKL          Filter image stack by these indices

</details>

We recommend first downsampling images to 128x128 since larger images can take much longer to train:

    $ cryodrgn downsample [input particle stack] -D 128 -o particles.128.mrcs

The maximum recommended image size is D=256, so we also recommend downsampling your images to D=256 if your images are larger than 256x256:

    $ cryodrgn downsample [input particle stack] -D 256 -o particles.256.mrcs

The input file format can be a single `.mrcs` file, a `.txt` file containing paths to multiple `.mrcs` files, a RELION `.star` file, or a cryoSPARC `.cs` file. For the latter two options, if the relative paths to the `.mrcs` are broken, the argument `--datadir` can supply the path to where the `.mrcs` files are located.

If there are memory issues with downsampling large particle stacks, add the `--chunk 10000` argument to save images as separate `.mrcs` files of 10k images.


### 2. Parse image poses from a consensus homogeneous reconstruction

CryoDRGN expects image poses to be stored in a binary pickle format (`.pkl`). Use the `parse_pose_star` or `parse_pose_csparc` command to extract the poses from a `.star` file or a `.cs` file, respectively.

Example usage to parse image poses from a RELION 3.1 starfile:

    $ cryodrgn parse_pose_star particles.star -o pose.pkl -D 300

Example usage to parse image poses from a cryoSPARC homogeneous refinement particles.cs file:

    $ cryodrgn parse_pose_csparc cryosparc_P27_J3_005_particles.cs -o pose.pkl -D 300

**Note:** The `-D` argument should be the box size of the consensus refinement (and not the downsampled images from step 1) so that the units for translation shifts are parsed correctly.

### 3. Parse CTF parameters from a .star/.cs file

CryoDRGN expects CTF parameters to be stored in a binary pickle format (`.pkl`). Use the `parse_ctf_star` or `parse_ctf_csparc` command to extract the relevant CTF parameters from a `.star` file or a `.cs` file, respectively.

Example usage for a .star file:

    $ cryodrgn parse_ctf_star particles.star -D 300 --Apix 1.03 -o ctf.pkl

The `-D` and `--Apix` arguments should be set to the box size and Angstrom/pixel of the original `.mrcs` file (before any downsampling).

Example usage for a .cs file:

    $ cryodrgn parse_ctf_csparc cryosparc_P27_J3_005_particles.cs -o ctf.pkl

## II. Specifying a real-space mask

A real space mask is important to boost SNR. Most consensus reconstruction software output a mask, which you can use as input (`--mask=path_to_mask.mrc`). Make sure the mask is not too tight; you can use the input `--dilate-mask-iter` to expand the mask if needed. You may also want to use a focusing mask to focus on heterogeneity in one part of the volume [click here](https://guide.cryosparc.com/processing-data/tutorials-and-case-studies/mask-selection-and-generation-in-ucsf-chimera) to find instructions to generate one with Chimera.

If you don't input a mask, you can ask the software to estimate one using the two halfmaps of the mean ( `--mask=from_halfmaps`). You may also want to run with a loose spherical mask (option `--mask=sphere`) and use the computed variance map to observe which parts have large variance.



## III. Running the RECOVAR Pipeline

To run the RECOVAR pipeline, ensure that the input images (.mrcs), poses (.pkl), and CTF parameters (.pkl) are prepared. Then, execute the following command:

```bash
$ recovar pipeline particles.128.mrcs -o output_test --ctf ctf.pkl --poses poses.pkl --mask [path_to_your_mask.mrc]
```

To view a list of all available options and their usage, use the help command:

<details><summary><code>$ recovar pipeline -h</code></summary>

```
usage: pipeline [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl --mask mrc [--focus-mask mrc] 
                   [--mask-dilate-iter MASK_DILATE_ITER] [--correct-contrast] [--ignore-zero-frequency] 
                   [--ind PKL] [--uninvert-data UNINVERT_DATA] [--lazy] [--datadir DATADIR] [--n-images N_IMAGES] 
                   [--padding PADDING] [--halfsets HALFSETS] [--keep-intermediate] [--noise-model NOISE_MODEL] 
                   [--mean-fn MEAN_FN] [--accept-cpu] [--test-covar-options] [--low-memory-option] 
                   [--very-low-memory-option] [--dont-use-image-mask] [--do-over-with-contrast] [--tilt-series] 
                   [--tilt-series-ctf TILT_SERIES_CTF] [--dose-per-tilt DOSE_PER_TILT] [--angle-per-tilt ANGLE_PER_TILT] 
                   [--only-mean] [--ntilts NTILTS]
                   particles

positional arguments:
  particles             Input particles (.mrcs, .star, .cs, or .txt)

optional arguments:
  -h, --help            Show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory to save model
  --zdim ZDIM           Dimensions of latent variable. Default = 1, 2, 4, 10, 20
  --poses POSES         Image poses (.pkl)
  --ctf pkl             CTF parameters (.pkl)
  --mask mrc            Solvent mask (.mrc). Options: from_halfmaps, sphere, none
  --focus-mask mrc      Focus mask (.mrc)
  --mask-dilate-iter MASK_DILATE_ITER
                        Number of iterations to dilate solvent and focus mask
  --correct-contrast    Estimate and correct for amplitude scaling (contrast) variation across images
  --ignore-zero-frequency
                        Ignore zero frequency. Useful if images are normalized to 0 mean
  --tilt-series         Use tilt series data
  --tilt-series-ctf TILT_SERIES_CTF
                        Specify CTF for tilt series. Default is the same as tilt series
  --dose-per-tilt DOSE_PER_TILT
                        Specify dose per tilt
  --angle-per-tilt ANGLE_PER_TILT
                        Specify angle per tilt
  --only-mean           Only compute mean
  --ntilts NTILTS       Number of tilts to use per tilt series. Default is all

Dataset loading:
  --ind PKL             Filter particles by these indices (.pkl)
  --uninvert-data UNINVERT_DATA
                        Invert data sign: Options are true, false, automatic (default). 
                        Automatic will swap signs if sum(estimated mean) < 0
  --lazy                Use lazy loading if the full dataset is too large to fit in memory
  --datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file
  --n-images N_IMAGES   Number of images to use (useful for quick runs)
  --padding PADDING     Real-space padding
  --halfsets HALFSETS   Path to a file with indices of split dataset (.pkl)
  --keep-intermediate   Save intermediate results (useful for debugging)
  --noise-model NOISE_MODEL
                        Noise model to use. Options: radial (default), white
  --mean-fn MEAN_FN     Mean function to use. Options: triangular (default), old, triangular_reg
  --accept-cpu          Accept running on CPU if no GPU is found
  --test-covar-options  Test different covariance estimation options (for development)
  --low-memory-option   Use low memory options for covariance estimation
  --very-low-memory-option
                        Use very low memory options for covariance estimation
  --dont-use-image-mask Do not use image mask
  --do-over-with-contrast
                        Run again once contrast is estimated
```

</details>

### Required Arguments:

- **particles**: Input particle stack file (.mrcs, .star, .cs, or .txt)
- **`-o OUTDIR, --outdir OUTDIR`**: Output directory to save the model
- **`--poses POSES`**: Image poses file (.pkl)
- **`--ctf pkl`**: CTF parameters file (.pkl)
- **`--mask mrc`**: Solvent mask file (.mrc). Options: from_halfmaps, sphere, none (no mask used).

### Commonly Used Optional Arguments:

- **`--focus-mask mrc`**: Specify a focus mask file (.mrc). If using only a solvent mask, pass it with `--mask`. If only a focus mask is available, you can use `--mask=sphere`.
- **`--mask-dilate-iter MASK_DILATE_ITER`**: Number of iterations to dilate the mask (default = 0).
- **`--zdim ZDIM`**: Dimensions of the PCA to use for embedding. You can specify one integer (`--zdim=20`) or a comma-separated list (`--zdim=10,50,100`). Default is `--zdim=1,2,4,10,20` with no regularization.
- **`--only-mean`**: Only compute the mean. Fast and useful to make dataset is properly preprocessed.


## IV. Analyzing results

After the pipeline is run, you can find the mean, eigenvectors, variance maps, and embeddings in the `outdir/results` directory, where outdir is the option given above by `-o`. You can run some standard analysis by running:

    recovar analyze [pipeline_output_dir] --zdim=10

It will run k-means, generate volumes corresponding to the centers, generate trajectories between pairs of cluster centers, and run UMAP. See more input details below.

If you want to sample many points of the distribution (e.g 100-200), and want to do it fast, you can use `--n-clusters=200` and reduce run time (at the cost of resolution) with `--n-bins=10`. Afterwards, you can recompute the same cluster to higher resolution using `compute_state`.

### Usage of `analyze`

**Command**:

```bash
recovar analyze [result_dir] --zdim=10
```

**Positional Arguments:**

- `result_dir`: Path to the result directory (output directory of the pipeline).

**Required Arguments:**

- `--zdim ZDIM`: Dimension of the latent variable (must be a single integer, not a list).

<details><summary>Optional Arguments</summary>

- `-o OUTDIR, --outdir OUTDIR`: Output directory to save model. If not provided, results will be saved in `result_dir/analysis_{zdim}/`.
- `--n-clusters N_CLUSTERS`: Number of k-means clusters (default: 40).
- `--n-trajectories N_TRAJECTORIES`: Number of trajectories to compute between k-means clusters (default: 0).
- `--skip-umap`: Skip UMAP embedding (useful for large datasets where UMAP can be slow).
- `--skip-centers`: Skip generating volumes of the k-means cluster centers.
- `--n-vols-along-path N_VOLS_ALONG_PATH`: Number of volumes to compute along each trajectory (default: 6).
- `--Bfactor BFACTOR`: B-factor for sharpening (default: 0).
- `--n-bins N_BINS`: Number of bins for kernel regression.
- `--maskrad-fraction MASKRAD_FRACTION`: Radius of mask used in kernel regression. Default is 20, which means radius = `grid_size/20` pixels, or `grid_size * voxel_size / 20` Å.
- `--n-min-images N_MIN_IMAGES`: Minimum number of images required for kernel regression. Default is 100 for SPA and 10 particles for tilt series.
- `--density DENSITY`: Path to a `.pkl` file containing density data with keys `density` and `latent_space_bounds`.
- `--normalize-kmeans`: Normalize latent variables (`z`) before computing k-means clustering.
- `--no-z-regularization`: Use latent variables without regularization (e.g., use `2_noreg` instead of `2`).
- `--lazy`: Enable lazy loading if the full dataset is too large to fit in memory.

</details>

### Generating Volumes at Specific Points in Latent Space

To generate volumes at specific coordinates in latent space, use the `compute_state` script. Ensure you specify the coordinates using a `.txt` file readable by `np.loadtxt` and provide a B-factor if sharpening is required.

**Command**:

```bash
recovar compute_state [pipeline_output_dir] -o [volume_output_dir] --latent-points [zfiles.txt] --Bfactor=[Bfac]
```

**Positional Arguments:**

- `result_dir`: The output directory provided to `pipeline` using the `--o` option. For `analyze`, this is set by default to `result_dir/output/analysis_[zdim]`.

**Required Arguments:**

- `--latent-points LATENT_POINTS`: Path to latent points (`.txt` file). You can use the output of k-means, such as `output/analysis_2/centers.txt` from `analyze`, or create your own latent points. The file should have a shape of `(n_points, zdim)`.

<details><summary>Optional Arguments</summary>

- `-o OUTDIR, --outdir OUTDIR`: Output directory to save all outputs. For `analyze`, it defaults to `result_dir/output/analysis_[zdim]`. For other functions, this argument is required.
- `--Bfactor BFACTOR`: B-factor for sharpening. The B-factor of the consensus reconstruction is likely a good guess. Default is 0, meaning no sharpening.
- `--n-bins N_BINS`: Number of bins for kernel regression. Default is 50, which works well for most cases. This setting was used to generate all figures in the paper.
- `--maskrad-fraction MASKRAD_FRACTION`: Radius of mask used in kernel regression. Default is 20, meaning `radius = grid_size/20` pixels or `grid_size * voxel_size / 20` Å. The default setting works well for most cases. This was used for all figures in the paper. If using cryo-ET or very noisy (or very non-noisy) data, you might want to adjust this value accordingly. For low-resolution data (less than 128x128 images), consider increasing this value. For high-resolution data (more than 512x512 images), consider decreasing it.
- `--n-min-images N_MIN_IMAGES`: Minimum number of images required for kernel regression. The default is 100 for SPA and 10 particles for tilt series. The default works well for most cases, as it was used to generate all figures in the paper. For cryo-ET or very noisy (or very non-noisy) data, consider adjusting this value.
- `--zdim1`: Enable this if dimension 1 is used. This setting addresses a specific issue with `np.loadtxt`.
- `--no-z-regularization`: If set, uses latent variables without regularization.
- `--lazy`: Enables lazy loading when the full dataset is too large to fit in memory.
- `--particles PARTICLES`: Path to the particle stack dataset. If not specified, the same stack as provided to `pipeline` will be used. Use this option to utilize a higher-resolution stack.
- `--datadir DATADIR`: Path prefix to the particle stack if loading relative paths from a `.star` or `.cs` file. Similar to the `--datadir` option in `pipeline`. If not specified, the same stack as provided to `pipeline` will be used. Use this option to load a higher-resolution stack.

</details>



### Generating and Analyzing Trajectories in Latent Space

To compute a high-density (low free energy) trajectory in latent space and generate corresponding volumes, use the `compute_trajectory` script.

**Command**:

```bash
recovar compute_trajectory [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] [options]
```

#### Positional Arguments:

- **`result_dir`**:
  The output directory provided to `pipeline` using the `-o` option. This is where the results from `pipeline` are stored.

#### Required Arguments:

- **`-o OUTDIR, --outdir OUTDIR`**:
  Output directory to save all results. This is a **required** argument.

- **`--zdim ZDIM`**:
  Dimension of the latent variable (a single integer). This should match the `zdim` used in the pipeline.

#### Endpoint Specification (One of the following is required):

To define the start and end points of the trajectory in latent space, you need to specify endpoints using one of the following options:

- **`--ind IND`**:
  Indices of points in the list of coordinates to use as endpoints. Provide indices as a comma-separated list, e.g., `--ind 0,1`.

  This option is used in conjunction with the `--endpts` option.

- **`--endpts ENDPTS_FILE`**:
  Path to a file containing endpoint coordinates in latent space. The file should be a `.txt` file with at least two rows, each containing the coordinates of an endpoint. If the file has more than two lines, it will use the first two lines as endpoints by default. If you want to specify different lines, use the `--ind` option to specify the indices of the lines to use.

- **`--z_st Z_ST_FILE`** and **`--z_end Z_END_FILE`**:
  Paths to files containing the starting point (`z_st`) and ending point (`z_end`) coordinates in latent space, respectively. Each file should be a `.txt` file containing a single row of coordinates.

#### Optional Arguments:

- **`--density DENSITY`**:
  Path to the density file saved in `.pkl` format, containing keys `'density'` and `'latent_space_bounds'`. This is typically generated by `estimate_conformational_density`, e.g., `density/deconv_density_knee.pkl`.

- **`--n-vols-along-path N_VOLS_ALONG_PATH`**:
  Number of volumes to compute along the trajectory. Default is 6.

- **`--Bfactor BFACTOR`**:
  B-factor for sharpening. The B-factor of the consensus reconstruction is a good estimate. Default is 0 (no sharpening).

- **`--n-bins N_BINS`**:
  Number of bins for kernel regression. Default is 50.

- **`--maskrad-fraction MASKRAD_FRACTION`**:
  Radius of mask used in kernel regression. Default is 20, which means `radius = grid_size / 20` pixels, or `grid_size * voxel_size / 20` Å.

- **`--n-min-images N_MIN_IMAGES`**:
  Minimum number of images required for kernel regression. Default is 100 for SPA data.

- **`--lazy`**:
  Use lazy loading if the dataset is too large to fit in memory.

- **`--particles PARTICLES`**:
  Path to the particle stack dataset. If not specified, the same stack used in `pipeline` will be used. Use this option if you want to use a higher-resolution stack.

- **`--datadir DATADIR`**:
  Path prefix to particle stack if loading relative paths from a `.star` or `.cs` file.

- **`--zdim1`**:
  Enable this if using a 1-dimensional latent space. This addresses a specific issue with `np.loadtxt`.

- **`--override_z_regularization`**:
  Override the `z` regularization setting from the pipeline. Use with caution.

#### Examples:

1. **Compute trajectory using indices in a coordinate file**:

   If you have a coordinate file (e.g., `kmeans_center_coords.txt` generated by `analyze`), you can specify which lines (points) to use as endpoints using `--ind`.

   ```bash
   recovar compute_trajectory [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
       --density [density.pkl] --endpts kmeans_center_coords.txt --ind 0,1
   ```

   This will use the first two lines (indices 0 and 1) in `kmeans_center_coords.txt` as the endpoints.

2. **Compute trajectory using endpoint coordinates from a file**:

   If your endpoint file has exactly two lines, you can simply specify:

   ```bash
   recovar compute_trajectory [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
       --density [density.pkl] --endpts endpoints.txt
   ```

   If the file has more than two lines, and you want to use specific lines, use `--ind` to specify the indices.

3. **Compute trajectory using specific start and end point files**:

   If you have separate `.txt` files for the starting point (`z_st.txt`) and ending point (`z_end.txt`):

   ```bash
   recovar compute_trajectory [pipeline_output_dir] -o [volume_output_dir] --zdim [dimension] \
       --density [density.pkl] --z_st z_st.txt --z_end z_end.txt
   ```

#### Additional Notes:

- The `--density` option is important if you want to compute trajectories that follow regions of high conformational density (low free energy). The density file can be generated using `estimate_conformational_density`.

- Ensure that the `zdim` matches the dimension used in the pipeline and in density estimation.

- The volumes generated along the trajectory will be saved in the specified output directory (`--outdir`), typically under subdirectories like `vol0000`, `vol0001`, etc.

#### Help Message:

For more details and options, you can view the help message by running:

```bash
recovar compute_trajectory -h
```

**Help Output:**

```
usage: compute_trajectory [-h] -o OUTDIR [--zdim ZDIM] [--Bfactor BFACTOR] [--n-bins N_BINS]
                             [--maskrad-fraction MASKRAD_FRACTION] [--n-min-images N_MIN_IMAGES] [--zdim1]
                             [--no-z-regularization] [--override_z_regularization] [--lazy]
                             [--particles PARTICLES] [--datadir DATADIR] [--n-vols-along-path N_VOLS_ALONG_PATH]
                             [--density DENSITY] [--ind IND] [--endpts ENDPTS_FILE]
                             [--z_st Z_ST_FILE] [--z_end Z_END_FILE]
                             result_dir

positional arguments:
  result_dir            Output directory provided to pipeline using the --o option.

optional arguments:
  -h, --help            Show this help message and exit.
  -o OUTDIR, --outdir OUTDIR
                        Output directory to save all outputs. This is a required argument.
  --zdim ZDIM           Dimension of latent variable (a single int, not a list). Must match the zdim used in the pipeline.
  --Bfactor BFACTOR     B-factor sharpening. The B-factor of the consensus reconstruction is probably a good guess. Default is 0, which means no sharpening.
  --n-bins N_BINS       Number of bins for kernel regression. Default is 50 and works well for most cases.
  --maskrad-fraction MASKRAD_FRACTION
                        Radius of mask used in kernel regression. Default = 20, which means radius = grid_size/20 pixels, or grid_size * voxel_size / 20 angstrom.
  --n-min-images N_MIN_IMAGES
                        Minimum number of images to compute kernel regression. Default = 100 for SPA, and 10 particles for tilt series.
  --zdim1               Enable if using a 1-dimensional latent space. Addresses an issue with np.loadtxt.
  --no-z-regularization
                        Use unregularized latent variables. This should be consistent with how the density was estimated.
  --override_z_regularization
                        Override the z regularization setting from the pipeline. Use with caution.
  --lazy                Use lazy loading if the dataset is too large to fit in memory.
  --particles PARTICLES
                        Particle stack dataset. If not specified, the same stack as provided to pipeline will be used. Use this option if you want to use a higher resolution stack.
  --datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file.
  --n-vols-along-path N_VOLS_ALONG_PATH
                        Number of volumes to compute along each trajectory (default 6)
  --density DENSITY     Density saved in pkl file, keys are 'density' and 'latent_space_bounds'.
  --ind IND             Indices of points in the list of coordinates to use as endpoints. Provide as a comma-separated list, e.g., --ind 0,1.
  --endpts ENDPTS_FILE  Endpoints file (txt). If it has more than 2 lines, it will use the first two lines as endpoints. If you want to specify different lines, use --ind.
  --z_st Z_ST_FILE      Starting point file (txt).
  --z_end Z_END_FILE    Ending point file (txt).
```

---

## Extracting image subset based on volumes


### Overview

This command will automatically figure out which images produced a particular feature of a volume generated by RECOVAR. You may want to use these images for a downstream tasks, or to re-import into RELION/cryosparc.

To do this, you should use the command `extract_image_subset` as described below, which will output a .pkl file that contains the corresponding subset of images.
You can then use `cryodrgn_utils write_star` and `cryodrgn write_cs` to do that after running `extract_image_subset`.

E.g., you can do 
```bash
  conda activate cryodrgn 
  cryodrgn_utils write_star path_to_original_starfile.starfile --ind path_to_pkl_file.pkl -o path_to_new_starfile.starfile
  conda activate recovar
```
This activates the cryodrgn environment, generates the new starfile from the old one (by picking out only the correct rows) and then reactivate the recovar environment. You can see more details about cryodrgn_utils write_star [here](https://github.com/ml-struct-bio/cryodrgn/blob/main/cryodrgn/commands_utils/write_star.py) and write_cs [here](https://github.com/ml-struct-bio/cryodrgn/blob/main/cryodrgn/commands_utils/write_cs.py).


The volume generation method of RECOVAR uses different sets of images to generate different parts of the volume. Thus, you need to tell this function which part of the volume is of interest. This can be done by giving a small mask around a region (note that it is really the center of mass of the mask which is used, so if you give a strange non convex mask, it will give strange result). 

### Usage

```bash
recovar extract_image_subset --input-dir [input_directory] --output [output_path] --subvol-idx [subvolume_idx] --mask [mask_file] --coordinate [x,y,z]
```

### Parameters

- **`--input-dir`** *(required)*: 
  - Type: `str`
  - Description: Path to the directory containing the volume data, typically generated by volume-processing scripts like analyze, compute_state, or compute_trajectory. input-dir should point to a folder that looks like `\vol00**\`. For example, if you ran pipeline then analyze --zdim 20, this folder should exist: `[pipeline_output_dir]/output/analysis_20/centers/vol0000/`, which contains all the parameters of the volume stored in `[pipeline_output_dir]/output/analysis_20/centers/all_volumes/vol0000.mrc`.

- **`--output`** *(required)*:
  - Type: `str`
  - Description: Output path where the `.pkl` file containing the indices of the extracted image subset will be saved.

- **`--subvol-idx`** *(optional)*:
  - Type: `int`
  - Description: Specifies the subvolume index to use for extracting the subset of images. This index directs the function to focus on a particular subvolume.

- **`--mask`** *(optional)*:
  - Type: `str` (file path)
  - Description: Path to an MRC file that defines a mask around the feature of interest. If provided, the center of mass of the mask is calculated, and the subset of images closest to this center is selected.

- **`--coordinate`** *(optional)*:
  - Type: `list of int`
  - Description: Specifies the pixel coordinates of the feature for which images are to be extracted. This should be provided in the format `x,y,z`.

### Example Commands

1. **Extracting by Subvolume Index:**

   ```bash
   recovar extract_image_subset --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --subvol-idx 3
   ```

2. **Extracting by Mask:**

   ```bash
   recovar extract_image_subset --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --mask "/path/to/mask.mrc"
   ```

3. **Extracting by Coordinates:**

   ```bash
   recovar extract_image_subset --input-dir "/path/to/volume_data" --output "/path/to/output_indices.pkl" --coordinate 50,50,50
   ```

### Notes

- Ensure that the correct path is provided for both the `--input-dir` and `--output` parameters.
- Only one of `--subvol-idx`, `--mask`, or `--coordinate` should be provided to define the criteria for image extraction. Providing more than one of these parameters will result in an error.


## Command: `extract_image_subset_from_kmeans`

### Overview

This command extracts a subset of images based on the k-means clustering indices from a specified `.pkl` file. The images can be selected either by directly specifying the k-means indices to keep, or by excluding the specified indices using the inverse option.

### Usage

```bash
recovar extract_image_subset_from_kmeans [path_to_centers] [output_path] [kmeans_indices] [-i]
```

### Parameters

- **`path_to_centers`** *(required)*: 
  - Type: `str`
  - Description: Path to the `centers.pkl` file. This file is usually found in the folder `[recovar_output_folder]/output/analysis_[zdim]/centers.pkl` and contains the labels of the k-means clustering.

- **`output_path`** *(required)*:
  - Type: `str`
  - Description: Path where the output `.pkl` file containing the indices of the extracted image subset will be saved.

- **`kmeans_indices`** *(required)*:
  - Type: `list of int`
  - Description: A comma-separated list of k-means indices that you wish to keep. For example, `20,30,50` would keep the images that belong to these k-means clusters.

- **`-i, --inverse`** *(optional)*:
  - Type: `bool`
  - Description: If this option is provided, the function will exclude the images that correspond to the specified k-means indices. By default, the function keeps the images that correspond to the specified indices.

### Example Commands

1. **Extracting images corresponding to specific k-means clusters:**

   ```bash
   recovar extract_image_subset_from_kmeans /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50
   ```

2. **Excluding specific k-means clusters:**

   ```bash
   recovar extract_image_subset_from_kmeans /path/to/centers.pkl /path/to/output_indices.pkl 20,30,50 -i
   ```

### Notes

- The `path_to_centers` must be a valid `.pkl` file containing the k-means clustering labels, typically located at `[recovar_output_folder]/output/analysis_[zdim]/centers.pkl`.
- The `output_path` must also be a `.pkl` file where the subset of indices will be saved.
- Only one of the `--inverse` option should be used. If provided, the images not in the specified clusters will be saved.


## Command: `estimate_conformational_density`

### Overview

This script estimates the conformational density from RECOVAR results using a deconvolution approach. It processes the results in a specified principal component space and outputs the estimated densities along with related plots.

Note that this conformational density works better if you have a relatively simple distribution, which can be well represented in a small number of PCS (<=4). Often this is achievable by using focused mask to focus the analysis on a part of the biomoecule.

Note that the density estimation requires a regularization parameter $\alpha$ in the paper. By default, the code will try a range of different $\alpha$'s and estimate the correct one by finding the knee in the L-curve. This does not always work great, so you should look at the `all_densities.png` plot and the `L-curve.png` plots to make sure the densities don't look overfit at the selected knee. The estimated densities with all values of $\alpha$ are stored in `all_densities/`, and you can select the corresponding one if you do not think the knee selection is accurate, e.g. if you think the `a[3]` is the most accurate density on the `all_densities.png` plot, select the density `all_densities/deconv_density_3.pkl` for further analysis (e.g. by passing it to `compute_trajectory`).

<!-- See the RECOVAR paper for the theory and explanation of the outputs. -->

### Usage

```bash
recovar estimate_conformational_density [recovar_result_dir] 
```

**Positional Arguments:**

- `recovar_result_dir`  
  Directory containing RECOVAR results provided to `pipeline`.



<details><summary>Optional Arguments</summary>

- `--pca_dim PCA_DIM`  
Dimension of PCA space in which the density is estimated (default 4). The runtime increases exponentially with this number, so <=5 is recommended.

- `--z_dim_used Z_DIM_USED`  
  Dimension of latent variable used (default 4). Should be at least as big as pca_dim, and should be one of the dims used in analyze

- `--output_dir OUTPUT_DIR`  
  Directory to save the density estimation results. Default is `recovar_result_dir/density/`.


- `--percentile_reject PERCENTILE_REJECT`  
  Percentile of data to reject due to large covariance (default: 10%).

- `--num_disc_points NUM_DISC_POINTS`  
  Number of discretization points in each dimension for the grid density estimation (default: 50). For example, `50^4` points for a 4-dimensional grid.

- `--alphas ALPHAS`  
  List of alphas for regularization. Provide as space-separated values, e.g., `--alphas 1e-9 1e-8 1e-7 1e-1`. If not provided, defaults to values from `1e-9` to `1e1` in logarithmic spacing.

- `--percentile_bound PERCENTILE_BOUND`  
  Reject zs with coordinates above this percentile when deciding the bounds of the grid (default: 1%).

</details>

### Example Commands

1. **Basic Density Estimation:**

   ```bash
   recovar estimate_conformational_density /path/to/recovar_results --pca_dim 3
   ```

2. **Advanced Example with Custom Parameters:**

   ```bash
   recovar estimate_conformational_density /path/to/recovar_results --z_dim_used 4 \
    --pca_dim 2 \
     --output_dir /path/to/output \
     --percentile_reject 15 \
     --num_disc_points 100 \
     --alphas 1e-9 1e-5 1e-3 1e-1 \
     --percentile_bound 5
   ```

### Notes

<!-- - The `--z_dim_used` argument is **required** and specifies the dimension of the latent variable used for the analysis. -->

- The dimensionality of the deconvolved space (`--pca_dim`) determines the runtime complexity, which grows exponentially with higher dimensions.

- Ensure that the correct `recovar_result_dir` is provided, as it should contain the necessary results generated by `pipeline`.

- The `--alphas` parameter allows for fine-tuning regularization, and the script will automatically select an optimal alpha based on the L-curve.

### Output

The script will create a `density/` directory inside the `recovar_result_dir` (or in the specified output directory), which includes:

- **`all_densities/`**: Contains density estimations for different alphas.
- **`deconv_density_knee.pkl`**: Density file corresponding to the optimal alpha determined by the knee point of the L-curve.
- **`all_densities.png`**: Visualization of all densities across different alphas.
- **`Lcurve.png`**: L-curve plot used for selecting the optimal alpha.


## V. Visualizing results

### Output Structure

After running the RECOVAR pipeline and the subsequent analysis scripts (e.g., `analyze`, `compute_state`, `compute_trajectory`, `estimate_conformational_density`), the outputs will be organized in a structured directory tree. Understanding this structure will help you navigate the results and locate the files of interest.

If you are running on a remote server and wish to visualize the results locally, it is recommended to copy only the `[output_dir]/output` directory to your local machine, as the model files can be quite large. You can then use visualization tools like ChimeraX to inspect the generated volumes.

Below is the updated directory tree structure (click to expand):

<details><summary><strong>Output File Structure</strong></summary>

```
.
├── analysis_2_noreg                   # Generated by `analyze` with zdim=2
│   ├── contrast_histogram.png         # Histogram of contrast values
│   ├── density_plots                  # Density plots from `estimate_conformational_density`
│   │   └── density_01.png
│   ├── density_plots_sliced           # Sliced density plots
│   │   └── density_01.png
│   ├── kmeans_center_coords.txt       # Coordinates of k-means cluster centers in latent space
│   ├── kmeans_center_volumes          # Volumes corresponding to k-means cluster centers
│   │   ├── all_volumes
│   │   │   ├── locres0000.mrc
│   │   │   ├── locres0001.mrc
│   │   │   ├── locres0002.mrc
│   │   │   ├── vol0000.mrc
│   │   │   ├── vol0001.mrc
│   │   │   └── vol0002.mrc
│   │   ├── latent_coords.txt          # Latent coordinates of volumes
│   │   ├── vol0000                    # Subdirectories for each cluster center volume
│   │   │   ├── half1_unfil.mrc
│   │   │   ├── half2_unfil.mrc
│   │   │   ├── heterogeneity_distances.txt
│   │   │   ├── locres_filtered.mrc
│   │   │   ├── locres_filtered_nob.mrc
│   │   │   ├── locres.mrc
│   │   │   ├── params.pkl
│   │   │   ├── split_choice.pkl
│   │   │   ├── unfil.mrc
│   │   │   └── volume_sampling.mrc
│   │   ├── vol0001
│   │   │   └── [same structure as vol0000]
│   │   └── vol0002
│   │       └── [same structure as vol0000]
│   ├── kmeans_result.pkl              # Results of k-means clustering
│   ├── PCA                            # Principal Component Analysis results
│   │   ├── PC_01.png
│   │   └── PC_01no_annotate.png
│   ├── run.log                        # Log file from `analyze`
│   ├── traj0                          # Trajectory directory generated by `compute_trajectory`
│   │   ├── all_volumes
│   │   │   ├── locres0000.mrc
│   │   │   ├── locres0001.mrc
│   │   │   ├── locres0002.mrc
│   │   │   ├── locres0003.mrc
│   │   │   ├── locres0004.mrc
│   │   │   ├── locres0005.mrc
│   │   │   ├── vol0000.mrc
│   │   │   ├── vol0001.mrc
│   │   │   ├── vol0002.mrc
│   │   │   ├── vol0003.mrc
│   │   │   ├── vol0004.mrc
│   │   │   └── vol0005.mrc
│   │   ├── density
│   │   │   └── density_01.png
│   │   ├── latent_coords.txt
│   │   ├── path.json                  # Trajectory path in latent space
│   │   ├── vol0000
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0001
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0002
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0003
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   ├── vol0004
│   │   │   └── [volume files as in kmeans_center_volumes]
│   │   └── vol0005
│   │       └── [volume files as in kmeans_center_volumes]
│   ├── trajectory_endpoints.pkl       # Endpoints used for trajectory computation
│   └── umap                           # UMAP embedding results
│       ├── kmeans_centers.png
│       ├── kmeans_centers_no_annotate.png
│       ├── sns.png
│       ├── sns_hex.png
│       └── umap_embedding.pkl
├── command.txt                        # Command used to run the pipeline
├── density                            # Generated by `estimate_conformational_density`
│   ├── all_densities
│   │   ├── deconv_density_0.pkl
│   │   ├── deconv_density_1.pkl
│   │   ├── deconv_density_2.pkl
│   │   ├── deconv_density_3.pkl
│   │   ├── deconv_density_4.pkl
│   │   ├── deconv_density_5.pkl
│   │   ├── deconv_density_6.pkl
│   │   ├── deconv_density_7.pkl
│   │   ├── deconv_density_8.pkl
│   │   ├── deconv_density_9.pkl
│   │   └── deconv_density_10.pkl
│   ├── all_densities.png              # Visualization of all densities across alphas
│   ├── deconv_density_knee.pkl        # Optimal density selected via L-curve
│   └── Lcurve.png                     # L-curve plot for regularization selection
├── model                              # Generated by `pipeline`
│   ├── covariance_cols.pkl            # Covariance matrices
│   ├── embeddings.pkl                 # Embeddings of the data
│   ├── halfsets.pkl                   # Information about half-sets
│   ├── params.pkl                     # Parameters used during pipeline execution
│   └── particles_halfsets.pkl         # Indices of particles in each half-set
├── output
│   ├── plots                          # Plots generated during the pipeline
│   │   ├── contrast_histogram.png     # Histogram of contrast values
│   │   ├── eigenvalues.png            # Eigenvalues plot
│   │   ├── mean_fsc.png               # Fourier Shell Correlation of the mean volume
│   │   └── mean_variance_eigenvolume_plots.png
│   └── volumes                        # Volumes generated by `pipeline`
│       ├── dilated_mask.mrc           # Dilated version of the mask
│       ├── eigen_neg*.mrc             # Negative eigenvolumes (principal components)
│       ├── eigen_pos*.mrc             # Positive eigenvolumes (principal components)
│       ├── focus_mask.mrc             # Focus mask if provided
│       ├── mask.mrc                   # Solvent mask used
│       ├── mean.mrc                   # Mean volume
│       ├── mean_filt.mrc              # Filtered mean volume
│       ├── mean_half1_unfil.mrc       # Unfiltered mean volume from half-set 1
│       ├── mean_half2_unfil.mrc       # Unfiltered mean volume from half-set 2
│       ├── variance4.mrc              # Variance map for zdim=4
│       ├── variance10.mrc             # Variance map for zdim=10
│       └── variance20.mrc             # Variance map for zdim=20
├── run.log                            # Log file from the pipeline execution
└── trajectory1                        # Generated by `compute_trajectory`
    ├── density
    │   └── density_01.png
    ├── path.json                      # Trajectory path in latent space
    └── vol0000
        └── [volume files as in kmeans_center_volumes]
```

</details>

#### Directory and File Descriptions

- **`analysis_2_noreg/`**: Results from `analyze` with `zdim=2` and no regularization.
  - **`contrast_histogram.png`**: Histogram of estimated contrast values across images.
  - **`density_plots/`**: Density plots generated from conformational density estimation.
  - **`kmeans_center_coords.txt`**: Coordinates of k-means cluster centers in latent space.
  - **`kmeans_center_volumes/`**: Contains volumes corresponding to k-means cluster centers.
    - **`all_volumes/`**: Aggregated volumes for all centers.
      - `vol****.mrc`: Volumes for each cluster center.
      - `locres****.mrc`: Locally filtered volumes.
    - **`latent_coords.txt`**: Latent space coordinates of the volumes.
    - **`vol****/`**: Subdirectories for each cluster center containing detailed volume files.
  - **`kmeans_result.pkl`**: Results of the k-means clustering, including labels and centers.
  - **`PCA/`**: Principal Component Analysis plots.
    - `PC_01.png`: Plot of the first principal component.
    - `PC_01no_annotate.png`: Same plot without annotations.
  - **`traj0/`**: Trajectory results from `compute_trajectory`.
    - **`all_volumes/`**: Volumes along the trajectory.
    - **`density/`**: Density plots along the trajectory.
    - **`path.json`**: JSON file containing the trajectory path.
    - **`vol****/`**: Volumes at specific points along the trajectory.
  - **`trajectory_endpoints.pkl`**: Endpoints used for trajectory computations.
  - **`umap/`**: UMAP embedding results.
    - `sns.png`, `sns_hex.png`: Scatter plots of the embeddings.
    - `kmeans_centers.png`: UMAP plot showing k-means centers.
    - `umap_embedding.pkl`: UMAP embedding coordinates.

- **`command.txt`**: Contains the command used to run the pipeline, useful for record-keeping and reproducing results.

- **`density/`**: Generated by `estimate_conformational_density`.
  - **`all_densities/`**: Deconvolved densities for different regularization parameters (`alphas`).
    - `deconv_density_*.pkl`: Density files for each alpha value.
  - **`all_densities.png`**: Visualization of all densities across different alphas.
  - **`deconv_density_knee.pkl`**: Deconvolved density corresponding to the optimal regularization parameter (knee point).
  - **`Lcurve.png`**: L-curve plot used for selecting the optimal alpha.

- **`model/`**: Generated by `pipeline`.
  - `covariance_cols.pkl`: Covariance matrices of the columns.
  - `embeddings.pkl`: Embeddings of the data.
  - `halfsets.pkl`: Information about half-sets used in the analysis.
  - `params.pkl`: Parameters used during the pipeline execution.
  - `particles_halfsets.pkl`: Indices of particles in each half-set.

- **`output/`**: Main output directory containing results and volumes.
  - **`plots/`**: Various plots generated during the pipeline.
    - `contrast_histogram.png`: Histogram of estimated contrast values.
    - `eigenvalues.png`: Plot showing the decay of eigenvalues.
    - `mean_fsc.png`: Fourier Shell Correlation of the mean volume.
    - `mean_variance_eigenvolume_plots.png`: Combined plots of mean, variance, and eigenvolumes.
  - **`volumes/`**: Volumes generated by `pipeline`.
    - `mean.mrc`: The mean volume reconstructed from the data.
    - `mean_filt.mrc`: Filtered mean volume.
    - `mean_half1_unfil.mrc` & `mean_half2_unfil.mrc`: Unfiltered mean volumes from half-sets.
    - `variance*.mrc`: Variance maps computed for different dimensions.
    - `eigen_pos*.mrc` & `eigen_neg*.mrc`: Positive and negative eigenvolumes representing principal components of variability.
    - `mask.mrc`: The solvent mask used during reconstruction.
    - `focus_mask.mrc`: The focus mask if provided.
    - `dilated_mask.mrc`: Dilated version of the mask.

- **`run.log`**: Log file containing messages and errors during the pipeline execution.

- **`trajectory1/`**: Generated by `compute_trajectory`.
  - **`density/`**: Density plots along the trajectory.
    - `density_01.png`: Density plot for the first pair of dimensions.
  - **`path.json`**: JSON file containing the trajectory path in latent space.
  - **`vol0000/`**: Volume at the starting point of the trajectory.
    - [Contains volume files as in `kmeans_center_volumes/vol****/`]

#### Functions and Their Outputs

- **`pipeline`**: Generates the `model/` directory and the `output/` directory.
  - Computes the mean volume, variance maps, eigenvolumes, and saves necessary parameters and embeddings.
  - Generates plots such as eigenvalues and FSC curves.

- **`analyze`**: Creates the `analysis_[zdim]_noreg/` directories.
  - Performs k-means clustering, generating cluster center volumes, and computing UMAP and PCA embeddings.
  - Generates contrast histograms and density plots.

- **`compute_state`**: Generates volumes at specific points in latent space provided by the user.
  - Outputs are stored in directories like `vol****/` under `kmeans_center_volumes/` or specified output directory.

- **`compute_trajectory`**: Produces the `traj0/` directory within `analysis_[zdim]_noreg/` or a separate trajectory directory.
  - Contains volumes along a trajectory in latent space, density plots, and the trajectory path (`path.json`).

- **`estimate_conformational_density`**: Creates the `density/` directory with deconvolved density estimations, plots, and related data.
  - Computes the conformational density in the principal component space and selects the optimal regularization parameter using the L-curve method.

#### Copying and Visualizing Results

For visualization purposes, especially when working remotely, it's recommended to copy only the necessary output directories (e.g., `output/`, `analysis_[zdim]_noreg/`, `trajectory1/`, `density/`) to your local machine. This allows you to use visualization tools without transferring large model files.

You can then use software like ChimeraX to open and visualize the `.mrc` volume files found in these directories.



### Visualization in jupyter notebook

You can visualize the results using [this notebook](output_visualization.ipynb), which will show a bunch of results including:
* the FSC of the mean estimation, which you can interpret as an upper bound of the resolution you can expect. 
* decay of eigenvalues to help you pick the right `zdim`
* and standard clustering visualization (borrowed from the cryoDRGN output).


## Small test dataset

To verify that RECOVAR is installed properly and functioning correctly, you can run a small test using a synthetic dataset. The script `run_test_dataset` is already included in the repository and will:

- Generate a small synthetic dataset.
- Run the RECOVAR pipeline on this dataset.
- Perform analysis, including clustering and embedding.
- Estimate conformational density.
- Compute trajectories in the latent space.
- Report which steps passed or failed.
- Delete the test dataset directory at the end if all steps pass.

### Steps to Run the Test Script

1. **Activate your `recovar` environment**:

   ```bash
   conda activate recovar
   ```

2. **Run the test script**

   Navigate to the directory where RECOVAR is installed and execute:

   ```bash
   recovar run_test_dataset
   ```

   This script is located in the root directory of the RECOVAR repository.

### What the Script Does

The script performs the following steps:

1. **Generates a small synthetic dataset** using `make_test_dataset`.
2. **Runs the RECOVAR pipeline** on the test dataset using `pipeline`.
3. **Analyzes the results** using `analyze`, including clustering and embedding.
4. **Estimates the conformational density** with `estimate_conformational_density`.
5. **Computes trajectories** in the latent space using `compute_trajectory`.
6. **Reports which steps passed or failed**.
7. **Deletes the `test_dataset` directory** at the end if all steps pass.

### Verifying the Results

- **Success Message**: If all steps complete successfully, you'll see:

  ```
  Total steps passed: 6
  Total steps failed: 0

  All functions completed successfully!
  Test dataset directory 'test_dataset' has been deleted.
  ```

- **Failure Notification**: If any steps fail, the script will indicate which functions failed, and you can check the output for details.

  ```
  Total steps passed: 5
  Total steps failed: 1

  The following functions failed:
  - estimate_conformational_density

  Please check the output above for details.
  ```

### Notes

- **Verifying Outputs**: You can compare the volumes generated in `test_dataset/pipeline_output/output/analysis_2_noreg/kmeans_center_volumes/all_volumes` with the simulated volumes in `recovar/data/vol*.mrc` to ensure the pipeline is producing expected results (the order may differ).

- **Cleanup**: The script will delete the `test_dataset` directory at the end if all steps pass. This helps keep your workspace clean.



## TLDR
 <!-- (WIP - Untested) -->
A short example illustrating the steps to run the code on EMPIAR-10076. Assuming you have downloaded the code and have a GPU, the code should take less than an hour to run, and less than 10 minutes if you downsample to 128 instead (exact running time depends on your hardware). and  Read above for more details:

    # Downloaded poses from here: https://github.com/zhonge/cryodrgn_empiar.git
    git clone https://github.com/zhonge/cryodrgn_empiar.git
    cd cryodrgn_empiar/empiar10076/inputs/

    # Download particles stack from here. https://www.ebi.ac.uk/empiar/EMPIAR-10076/ with your favorite method.
    # My method of choice is to use https://www.globus.org/ 
    # Move the data into cryodrgn_empiar/empiar10076/inputs/

    conda activate recovar
    # Downsample images to D=256
    cryodrgn downsample Parameters.star -D 256 -o particles.256.mrcs --chunk 50000

    # Extract pose and ctf information from cryoSPARC refinement
    cryodrgn parse_ctf_csparc cryosparc_P4_J33_004_particles.cs -o ctf.pkl
    cryodrgn parse_pose_csparc cryosparc_P4_J33_004_particles.cs -D 320 -o poses.pkl
    
    # run recovar
    recovar pipeline particles.256.mrcs --ctf --ctf.pkl -poses poses.pkl --o recovar_test 

    # run analysis
    recovar analyze recovar_test --zdim=20 

    # Open notebook output_visualization.ipynb
    # Change the recovar_result_dir = '[path_to_this_dir]/recovar_test' and 

Note that this is different from the one in the paper. Run the following pipeline command to get the one in the paper (runs on the filtered stack from the cryoDRGN paper, and uses a predefined mask):

    # Download mask
    git clone https://github.com/ma-gilles/recovar_masks.git

    recovar pipeline particles.256.mrcs --ctf ctf.pkl --poses poses.pkl -o test-mask --mask recovar_masks/mask_10076.mrc --ind filtered.ind.pkl

<!-- The output should be the same as [this notebook](output_visualization_empiar10076.ipynb). -->

## Using kernel regression with other embeddings

You can generate volumes from embedding not generated by RECOVAR using `reconstruct_from_external_embedding`. E.g., for a cryoDRGN embedding:

    recovar reconstruct_from_external_embedding particles.256.mrcs --poses poses.pkl --ctf ctf.pkl --embedding 02_cryodrgn256/z.24.pkl --o [output_dir] --target zfile.txt


<details><summary><code>$ recovar reconstruct_from_external_embedding -h</code></summary>

    usage: reconstruct_from_external_embedding [-h] -o OUTDIR [--zdim ZDIM] --poses POSES --ctf pkl [--ind PKL] [--uninvert-data UNINVERT_DATA]
                                    [--datadir DATADIR] [--n-images N_IMAGES] [--padding PADDING] [--halfsets HALFSETS]
                                    [--noise-model NOISE_MODEL] [--Bfactor BFACTOR] [--n-bins N_BINS] --embedding EMBEDDING --target
                                    TARGET [--zdim1]
                                    particles

    positional arguments:
    particles             Input particles (.mrcs, .star, .cs, or .txt)

    optional arguments:
    -h, --help            show this help message and exit
    -o OUTDIR, --outdir OUTDIR
                            Output directory to save model
    --zdim ZDIM           Dimensions of latent variable. Default=1,2,4,10,20
    --poses POSES         Image poses (.pkl)
    --ctf pkl             CTF parameters (.pkl)
    --Bfactor BFACTOR     0
    --n-bins N_BINS       number of bins for kernel regression
    --embedding EMBEDDING
                            Image embeddings zs (.pkl), e.g. 00_cryodrgn256/z.24.pkl if you want to use a cryoDRGN embedding.
    --target TARGET       Target zs to evaluate the kernel regression (.txt)
    --zdim1               Whether dimension 1 embedding is used. This is an annoying corner case for np.loadtxt...

    Dataset loading:
    --ind PKL             Filter particles by these indices
    --uninvert-data UNINVERT_DATA
                            Invert data sign: options: true, false (default)
    --datadir DATADIR     Path prefix to particle stack if loading relative paths from a .star or .cs file
    --n-images N_IMAGES   Number of images to use (should only use for quick run)
    --padding PADDING     Real-space padding
    --halfsets HALFSETS   Path to a file with indices of split dataset (.pkl).
    --noise-model NOISE_MODEL
                            what noise model to use. Options are radial (default) computed from outside the masks, and white computed by
                            power spectrum at high frequencies

</details>

## Using the source code

I hope some developers find parts of the code useful for their projects. See [this notebook](recovar_coding_tutorial.ipynb) for a short tutorial. (OUT OF DATE, see [cryoJAX](https://github.com/mjo22/cryojax) for a much better documented JAX cryo-EM code.)

Some of the features which may be of interest:
- The basic building block operations of cryo-EM efficiently, in batch and on GPU: shift images, slice volumes, do the adjoint slicing, apply CTF. See [recovar.core](recovar/core.py). Though I have not tried it, all of these operations should be differentiable thus you could use JAX's autodiff.

- A heterogeneity dataset simulator that includes variations in contrast, realistic CTF and pose distribution (loaded from real dataset), junk proteins, outliers, etc. See [recovar.simulator](recovar/simulator.py).


- A code to go from atomic positions to volumes or images. Does not run on GPU. Thanks to [prody](http://prody.csb.pitt.edu/), if you have an internet connection, you can generate the volume from only the PDB ID. E.g., you can do `recovar.simulate_scattering_potential.generate_molecule_spectrum_from_pdb_id('6VXX', 2, 256)` to generate the volume of the spike protein with voxel size 2 on a 256^3 grid. Note that this code exactly evaluates the Fourier transform of the potential, thus it is exact in Fourier space, which can produce some oscillations artifact in the spatial domain. Also see [cryoJAX](https://github.com/mjo22/cryojax)

- Some other features that aren't very well separated from the rest of the code, but I think could easily be stripped out: trajectory computation [recovar.trajectory](recovar/trajectory.py), per-image mask generation [recovar.covariance_core](recovar/covariance_core.py), regularization schemes [recovar.regularization](recovar/regularization.py), various noise estimators [recovar.noise](recovar/noise.py).

- Some features that are not there (but soon, hopefully): pose search, symmetry handling.

## Tomography

I am developping the tomography extension. You can try it out by passing the the ``--tilt-series``, ``--tilt-series-ctf=v2``, ``--angle-per-tilt=[insert angle here]`` to ``pipeline``. The input should be identical to cryoDRGN-ET ([see here](https://ez-lab.gitbook.io/cryodrgn/cryodrgn-et-subtomogram-analysis)).

E.g, on the M-file:

    recovar pipeline M_particles.star --ctf ctf.pkl --poses pose.pkl -o v2_nocont_$ntilts --datadir=128 --mask=path_to_mask.mrc --tilt-series-ctf=v2  --ntilts=10 --tilt-series  --angle-per-tilt=3.0 --dose-per-tilt=2.93

You can use all tilts by not passing the argument --ntilts.

## Limitations

- *Symmetry*: there is currently no support for symmetry. If you got your poses through symmetric refinement, it will probably not work. It should probably work if you make a symmetry expansion of the particle stack, but I have not tested it.
- *Memory*: RECOVAR uses a lot of memory by default. For a stack of images of size 256, you need approximately 200 GB + size of dataset. You can also use the --lazy option, which will do lazy loading, in which case you need a little more than 200. If you run out of memory, you can use the --low-memory-option, in which case you need 60GB. If even that fails, you can try --very-low-memory-option. Finally, you can downsample the data and run RECOVAR. You can generate states from the full resolution images even if you ran the RECOVAR pipeline on downsampled images by using `compute_state` and passing the `--particles` argument. The memory requirement for the compute_state is approximately `n_bins * volume_size * 32` bytes where n_bins is 50 by default (~26 GB for size 256 images).
- *ignore-zero-frequency*: I haven't thought much about the best way to do this. I would advise against using it for now.
- *Other ones, probably?*: if you run into issues, please let me know. 

## Acknowledgement
A lot of code in RECOVAR is based on cryoDRGN (https://cryodrgn.cs.princeton.edu/) and RELION (https://relion.readthedocs.io/en/release-5.0/). Hopefully, it is clear from the code naming/comments.
Much of this documentation is generated using chatGTP4.

## Citation

If you use this software for analysis, please cite:

    @article{gilles2023bayesian,
      author = {Gilles, Marc Aur{\`e}le and Singer, Amit},
      title = {Cryo-EM Heterogeneity Analysis using Regularized Covariance Estimation and Kernel Regression},
      elocation-id = {2023.10.28.564422},
      year = {2024},
      doi = {10.1101/2023.10.28.564422},
      publisher = {Cold Spring Harbor Laboratory},
    }



## Contact

You can reach me (Marc) at gilles@princeton.edu with questions or comments, or open an issue on Github.

## Dataset from manuscript
Input files to replicate the results in the manuscript can be found in this [dropbox](https://www.dropbox.com/scl/fo/fjo8lwfj0xemqmki1dc30/AKISv4oMtt6q0gKGz59gFxM?rlkey=pgeyhw02w6ngzsv8vzno0vf3j&st=ifpgweyu&dl=0).


## Other relevant github repos:

* [ASPIRE](https://github.com/ComputationalCryoEM/ASPIRE-Python)
* [cryoDRGN](https://github.com/ml-struct-bio/cryodrgn)
* [cryoJAX](https://github.com/mjo22/cryojax) 
* [Dynamight](https://github.com/3dem/DynaMight)
* [Flexibility Hub](https://github.com/scipion-em/scipion-em-continuousflex)
* [RELION](https://github.com/3dem/relion)
