Metadata-Version: 2.4
Name: hicompass
Version: 1.0.2
Summary: Hi-Compass: Depth-aware deep learning framework for cell-type-specific chromatin interaction prediction from ATAC-seq
Home-page: https://github.com/EndeavourSyc/Hi-Compass/
Author: Yuanchen Sun
Author-email: 2247143021@qq.com
License: MIT
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-image
Requires-Dist: pyBigWig
Requires-Dist: cooler
Requires-Dist: pysam
Requires-Dist: piq
Requires-Dist: matplotlib
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary


# Hi-Compass

Hi-Compass is a depth-aware multi-modal deep learning framework for predicting cell-type-specific chromatin interactions from ATAC-seq data.

## Overview

Three-dimensional genome organization controls cell-type-specific gene expression through chromatin interactions. Hi-Compass addresses the challenge of predicting 3D genome structure from chromatin accessibility data by integrating four key inputs:

- ATAC-seq signal 
- ATAC-seq sequencing depth
- DNA sequence (fixed input, provided for hg38 and mm10)
- Generalized CTCF binding profile (fixed input, provided for hg38 and mm10)

The model incorporates a depth-aware module that dynamically accommodates sequencing depth variations, enabling robust predictions across the full spectrum of data resolution—from sparse single-cell to high-coverage bulk profiles. Hi-Compass predicts Hi-C contact matrices at 10 kb resolution with 2 Mb window size.

## Key Features

- Cell-type-specific Hi-C prediction requiring only ATAC-seq as user-provided input; all other reference data (DNA sequence, generalized CTCF, etc.) are provided or reusable across samples
- Complete preprocessing pipeline and training functionality, allowing users to train models on their own datasets
- Support for human (hg38), mouse (mm10), and other species (custom species require user-provided generalized CTCF data)
- Direct output of balanced .cool files compatible with downstream analysis tools such as cooltools, HiGlass, and Juicebox

- ## Installation

  ### Step 1: Install PyTorch

  Hi-Compass requires PyTorch but does not install it automatically, as the correct version depends on your system and CUDA configuration.

  Please install PyTorch first following the instructions at [pytorch.org](https://pytorch.org/get-started/locally/).

  ### Step 2: Install Hi-Compass

  ```bash
  pip install hicompass
  ```

  ### Step 3: Install training dependencies (for training only)

  If you plan to train your own models, please install the [PyTorch Lightning](https://lightning.ai/) with version that match your torch.

  ### Step 4: Install preprocessing tools (for preprocessing only)

  The preprocessing commands require the following external tools:

  ```bash
  # Using conda
  conda install -c bioconda samtools bedtools ucsc-bedgraphtobigwig
  ```


## Training Data Preparation

Training Required Data

Hi-Compass requires the following input data:

| Data Type          | User Provided       | Pre-built Available | Description                                                  |
| ------------------ | ------------------- | ------------------- | ------------------------------------------------------------ |
| ATAC-seq           | Yes                 | -                   | Cell-type-specific BAM file (for training) or BigWig file (for prediction) |
| Hi-C               | Yes (training only) | -                   | 10 kb resolution .cool file                                  |
| DNA sequence       | No                  | hg38, mm10          | One-hot encoded sequences per chromosome                     |
| Generalized CTCF   | No                  | hg38, mm10          | Pan-tissue CTCF binding profile                              |
| Chromosome sizes   | No                  | hg38, mm10          | Chromosome size file                                         |
| Centromere regions | No                  | hg38, mm10          | BED file for filtering (optional)                            |

### Download Pre-built Reference Data

Pre-built reference data for human (hg38) and mouse (mm10) are available at:

[ZENODO_LINK]

Download and organize the files according to the directory structure below.

### Directory Structure

Hi-Compass expects data organized in the following structure for training (take hg38 as example):

```
/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed
```

ATAC-seq BigWig files follow the naming convention:

```
{CellType}~ATAC~{Depth}.bw
```

Examples:

- `GM12878~ATAC~bulk.bw` - Bulk ATAC-seq (full depth)
- `GM12878~ATAC~1e6.bw` - Subsampled to 1,000,000 reads
- `GM12878~ATAC~5e5.bw` - Subsampled to 500,000 reads

This naming convention is automatically generated by the preprocessing pipeline and used by the training module to extract cell type and depth information.



## Preprocessing

Hi-Compass provides preprocessing commands to prepare your data for training.

### ATAC-seq Preprocessing

Convert bulk ATAC-seq BAM files to multi-depth BigWig files through stratified subsampling:

```bash
hicompass preprocess-atac \
    --input GM12878_bulk_ATAC.bam \
    --cell-type GM12878 \
    --output data/ATAC/hg38 \
    --chrom-sizes hg38.chrom.sizes
```

This command generates BigWig files at multiple sequencing depths, enabling the model to learn depth-aware representations.

#### Parameters

| Parameter             | Required | Default | Description                                  |
| --------------------- | -------- | ------- | -------------------------------------------- |
| `--input`, `-i`       | Yes      | -       | Input BAM/SAM file                           |
| `--cell-type`, `-c`   | Yes      | -       | Cell type name (used in output filenames)    |
| `--output`, `-o`      | Yes      | -       | Output directory                             |
| `--chrom-sizes`, `-s` | Yes      | -       | Chromosome sizes file                        |
| `--depths`, `-d`      | No       | -       | Custom depths (comma-separated or @file.txt) |
| `--min-depth`         | No       | 2e5     | Minimum depth for range mode                 |
| `--max-depth`         | No       | 2e7     | Maximum depth for range mode                 |
| `--step`              | No       | 2e4     | Step size for range mode                     |
| `--no-bulk`           | No       | False   | Skip bulk BigWig generation                  |
| `--seed`              | No       | 42      | Random seed for reproducibility              |

#### Output

The command generates BigWig files following the naming convention `{CellType}~ATAC~{Depth}.bw`:

```
data/ATAC/hg38/
├── GM12878~ATAC~bulk.bw
├── GM12878~ATAC~2e5.bw
├── GM12878~ATAC~2.2e5.bw
├── GM12878~ATAC~2.4e5.bw
└── ...
```

### Hi-C Preprocessing

Hi-C preprocessing consists of two steps:

#### Step 1: Contrast Stretching Normalization

Apply contrast stretching to enhance Hi-C matrix features:

```bash
hicompass preprocess-hic-norm \
    --input GM12878_raw.cool \
    --output GM12878_normalized.cool \
    --genome hg38
```

| Parameter             | Required | Default | Description                              |
| --------------------- | -------- | ------- | ---------------------------------------- |
| `--input`, `-i`       | Yes      | -       | Input cool file                          |
| `--output`, `-o`      | Yes      | -       | Output cool file                         |
| `--genome`, `-g`      | No       | hg38    | Genome assembly (hg38, mm10, or custom)  |
| `--chrom-sizes`, `-s` | No       | -       | Required when --genome=custom            |
| `--resolution`, `-r`  | No       | 10000   | Resolution in bp (10 kb recommended)     |
| `--percentile-max`    | No       | 98.0    | Upper percentile for contrast stretching |

Note: Only 10 kb resolution is currently supported. A warning will be issued for other resolutions.

#### Step 2: Convert to NPZ Format

Convert the normalized cool file to NPZ format for training:

```bash
hicompass preprocess-hic-to-npz \
    --input GM12878_normalized.cool \
    --output data/HiC/hg38/GM12878
```

| Parameter            | Required | Default | Description                    |
| -------------------- | -------- | ------- | ------------------------------ |
| `--input`, `-i`      | Yes      | -       | Input cool file (from step 1)  |
| `--output`, `-o`     | Yes      | -       | Output directory for NPZ files |
| `--resolution`, `-r` | No       | 10000   | Resolution in bp               |
| `--window`, `-w`     | No       | 256     | Number of diagonals to extract |

#### Output

```
data/HiC/hg38/GM12878/
├── chr1.npz
├── chr2.npz
├── ...
└── chr22.npz
```

### DNA Sequence Preparation

DNA sequences should be provided as gzipped FASTA files per chromosome.

If you have a whole-genome FASTA file, split it using:

```bash
# Index the genome
samtools faidx hg38.fa

# Extract each chromosome
for chr in chr{1..22}; do
    samtools faidx hg38.fa $chr | gzip > DNA/hg38/${chr}.fa.gz
done
```

Expected structure:

```
DNA/hg38/
├── chr1.fa.gz
├── chr2.fa.gz
├── ...
└── chr22.fa.gz
```

### Generating Generalized CTCF Profile for Custom Species

For species other than human (hg38) and mouse (mm10), you need to generate your own generalized CTCF binding profile. This profile represents the pan-tissue CTCF binding probability across multiple samples.

#### Data Collection

1. Collect CTCF ChIP-seq peak files (BED format) from multiple samples/tissues of your species
   - Recommended: 50+ samples for robust generalization
   - Data sources: Cistrome DB, ENCODE, GEO, or your own experiments

2. Ensure all BED files use the same genome assembly

#### Processing Steps

```bash
# Step 1: List all CTCF peak BED files
ls /path/to/ctcf_peaks/*.bed > peak_files.txt

# Step 2: Merge all peaks using bedtools multiinter
#         This creates a file where each position shows how many samples have CTCF binding
bedtools multiinter -i $(cat peak_files.txt | tr '\n' ' ') > ctcf_merged.bed

# Step 3: Convert to bedGraph format with normalized scores
#         The 4th column of multiinter output is the count of overlapping samples
#         Normalize by total sample count to get probability (0-1)
TOTAL_SAMPLES=$(wc -l < peak_files.txt)
awk -v n="$TOTAL_SAMPLES" 'BEGIN{OFS="\t"} {print $1, $2, $3, $4/n}' ctcf_merged.bed > ctcf_normalized.bedGraph

# Step 4: Sort the bedGraph file
sort -k1,1 -k2,2n ctcf_normalized.bedGraph > ctcf_sorted.bedGraph

# Step 5: Convert to bigWig format
bedGraphToBigWig ctcf_sorted.bedGraph your_genome.chrom.sizes CTCF.bw
```

#### Output

The resulting `CTCF.bw` file contains binding probability scores ranging from 0 to 1:

- **0**: No CTCF binding observed in any sample
- **1**: CTCF binding observed in all samples

Place this file in your data directory:

```
/your/data_root/CTCF/{genome}/CTCF.bw
```

## Training

### Complete Data Structure Example

Before training, ensure your data is organized as follows. Here we use `/home/user/hicompass_data` as an example data root, 

```
/home/user/hicompass_data/
├── ATAC/
│   └── hg38/
│       ├── GM12878~ATAC~bulk.bw
│       ├── GM12878~ATAC~1e6.bw
│       ├── GM12878~ATAC~5e5.bw
│       ├── K562~ATAC~bulk.bw
│       ├── K562~ATAC~1e6.bw
│       └── K562~ATAC~5e5.bw
├── HiC/
│   └── hg38/
│       ├── GM12878/
│       │   ├── chr1.npz
│       │   ├── chr2.npz
│       │   └── ...
│       └── K562/
│           ├── chr1.npz
│           ├── chr2.npz
│           └── ...
├── DNA/
│   └── hg38/
│       ├── chr1.fa.gz
│       ├── chr2.fa.gz
│       └── ...
├── CTCF/
│   └── hg38/
│       └── generalized_CTCF.bw
└── centromere/
    └── hg38/
    		└── centromere.bed
```

### Multi-Cell-Type & Multi-Depth Training

```bash
hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type GM12878 K562 \
    --train-chr 1-17 \
    --valid-chr 18-19 \
    --train-depth bulk 1e6 5e5 \
    --valid-depth bulk 1e6\
    --genome hg38 \
    --save-path /home/user/hicompass_checkpoints
```

If you only input one cell type, the discriminator of Hi-Compass will not be activated.

### Training Parameters

| Parameter           | Required | Default       | Description                                                |
| ------------------- | -------- | ------------- | ---------------------------------------------------------- |
| `--data-root`       | Yes      | -             | Root directory containing all data subdirectories          |
| `--cell-type`       | Yes      | -             | Training cell type(s), space-separated                     |
| `--train-chr`       | Yes      | -             | Training chromosomes (e.g., `1-17`, `chr1-chr17`, `1 2 3`) |
| `--valid-chr`       | Yes      | -             | Validation chromosomes                                     |
| `--genome`          | No       | hg38          | Genome assembly (hg38, mm10, or custom)                    |
| `--cell-type-valid` | No       | same as train | Validation cell types                                      |
| `--train-depth`     | No       | bulk          | Training depths, space-separated                           |
| `--valid-depth`     | No       | bulk          | Validation depths                                          |
| `--batch-size`      | No       | 2             | Batch size per GPU                                         |
| `--max-epochs`      | No       | 100           | Maximum training epochs                                    |
| `--gpu-id`          | No       | auto          | GPU ID(s) to use (e.g., `0` or `0 1 2 3`)                  |
| `--save-path`       | No       | checkpoints   | Directory for saving model checkpoints                     |
| `--ckpt-path`       | No       | -             | Path to checkpoint for resuming training                   |

### Chromosome Specification

The `--train-chr` and `--valid-chr` parameters support flexible formats:

| Format            | Example     | Expands to                   |
| ----------------- | :---------- | :--------------------------- |
| Range             | `1-5`       | chr1, chr2, chr3, chr4, chr5 |
| Range with prefix | `chr1-chr5` | chr1, chr2, chr3, chr4, chr5 |
| List              | `1 3 5`     | chr1, chr3, chr5             |

### Custom Genome Training

For genomes other than hg38 and mm10:

1. Prepare chromosome sizes file at `{data-root}/chromsize/custom_genome_name/custom_genome_name.chrom.sizes`
2. Prepare generalized CTCF BigWig at `{data-root}/CTCF/custom_genome_name/generalized_CTCF.bw`
3. Specify `--genome custom_genome_name`

```bash
hicompass training \
    --data-root /home/user/hicompass_data \
    --cell-type CellTypeA \
    --train-chr 1-15 \
    --valid-chr 16-17 \
    --genome custom_genome_name \
    --save-path /home/user/hicompass_checkpoints
```

### Output

Training outputs are saved to the specified `--save-path`.

## Prediction

### Required Files

For prediction, you need the following files (can be placed anywhere):

1. **ATAC-seq BigWig file**: Your cell-type-specific ATAC-seq data
2. **Model weights**: Provided pre-trained or your own trained model (.pth file)
3. **Generalized CTCF BigWig**: Provided for hg38/mm10, or generate your own
4. **DNA sequence directory**: Directory containing chr*.fa.gz files
5. **Centromere BED file** (optional): For filtering centromeric regions

### Calculating ATAC-seq Depth

Before prediction, you need to know the sequencing depth of your ATAC-seq data. The depth is the total number of mapped reads.

```bash
# From BAM file
samtools view -c your_sample.bam
```

### Prediction

```bash
hicompass predicting \
		--genome hg38 \
    --model-path /path/to/hicompass_hg38.pth \
    --atac-path /path/tohg38//my_sample_ATAC.bw \
    --ctcf-path /path/to/hg38/generalized_CTCF.bw \
    --dna-dir /path/to/DNA/hg38 \
    --output /path/to/output/my_sample_predicted.cool \
    --centromere-bed /path/to/hg38/centromere.bed \
    --depth 8000000 \
    --chromosomes 1-22 \
    --device cuda:0
```



### Prediction Parameters

| Parameter          | Required | Default | Description                                        |
| ------------------ | -------- | ------- | -------------------------------------------------- |
| `--model-path`     | Yes      | -       | Path to trained model checkpoint                   |
| `--atac-path`      | Yes      | -       | Path(s) to ATAC-seq BigWig file(s)                 |
| `--ctcf-path`      | Yes      | -       | Path to generalized CTCF BigWig                    |
| `--dna-dir`        | Yes      | -       | Directory containing chr*.fa.gz files              |
| `--output`         | Yes      | -       | Output path for predicted .cool file               |
| `--depth`          | Yes      | -       | Sequencing depth (total mapped reads)              |
| `--genome`         | No       | hg38    | Genome assembly (hg38, mm10, or custom)            |
| `--chrom-sizes`    | No       | -       | Chromosome sizes file (required for custom genome) |
| `--centromere-bed` | No       | -       | BED file for centromere/telomere filtering         |
| `--chromosomes`    | No       | 1-22    | Chromosomes to predict (e.g., `1,2,3` or `1-22`)   |
| `--stride`         | No       | 50      | Stride for sliding window (bins)                   |
| `--device`         | No       | cpu     | Computation device (cpu, cuda, cuda:0)             |
| `--batch-size`     | No       | 2       | Batch size for prediction                          |
| `--num-workers`    | No       | 16      | Number of data loading workers                     |

### Output Format

The output is a **.cool file** with balanced weights, compatible with downstream analysis tools:

```python
import cooler

# Load predicted Hi-C
clr = cooler.Cooler('/path/to/output/my_sample_predicted.cool')

# Use with cooltools
import cooltools

# Calculate insulation score
insulation = cooltools.insulation(clr, window_bp=100000)
```

## Acknowledgments

We thank the developers of [C.Origami](https://github.com/tanjimin/C.Origami) for their pioneering work on cell-type-specific Hi-C prediction. The idea of using NPZ format conversion strategy for Hi-C data preprocessing and the basic backbone of dataset in Hi-Compass was adapted from their implementation.

## License

Hi-Compass is released under the MIT License. See [LICENSE](LICENSE) for details.

## Contact

For questions and feedback, please open an issue on [GitHub](https://github.com/EndeavourSyc/Hi-Compass/issues) or contact Yuanchen Sun (suneddiesyc@gmail.com).
