Metadata-Version: 2.4
Name: methurator
Version: 0.1.4
Summary: Python package designed to estimate sequencing saturation for reduced-representation bisulfite sequencing (RRBS) data.
Author-email: Edoardo Giuili <edoardogiuili@gmail.com>
License: MIT License
        
        Copyright (c) 2025 Edoardo Giuili, TOBI lab
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Changelog, https://github.com/VIBTOBIlab/methurator/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/VIBTOBIlab/methurator/README.md
Project-URL: Issues, https://github.com/VIBTOBIlab/methurator/issues
Project-URL: Repository, https://github.com/VIBTOBIlab/methurator
Keywords: bioinformatics,biology,BSseq,methylation,RRBS
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: matplotlib
Requires-Dist: numpy
Requires-Dist: packaging>=24
Requires-Dist: pandas
Requires-Dist: pyfaidx
Requires-Dist: pysam
Requires-Dist: rich
Requires-Dist: rich-click
Requires-Dist: scipy
Requires-Dist: tqdm
Requires-Dist: plotly
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Dynamic: license-file

# 🧬 methurator

[![Python Versions](https://img.shields.io/badge/python-≥3.10%20&%20≤3.13-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Tested with pytest](https://img.shields.io/badge/tested%20with-pytest-blue.svg)](https://pytest.org/)

**Methurator** is a Python package designed to estimate **sequencing saturation** for **reduced-representation bisulfite sequencing (RRBS)** data.

Although optimized for RRBS, **methurator** can also be used for whole-genome bisulfite sequencing (**WGBS**) or other genome-wide methylation data (e.g. **EMseq**). However, this data we advise you to use [Preseq package](https://smithlabresearch.org/software/preseq/).

---

## 🧠 Dependencies and Notes

- methurator uses [SAMtools](https://www.htslib.org/) and [MethylDackel](https://github.com/dpryan79/MethylDackel) internally for BAM subsampling, thus they need to be installed.
- When `--genome` is provided, the corresponding FASTA file will be automatically fetched and cached.
- Temporary intermediate files are deleted by default unless `--keep-temporary-files` is specified.

---

## 📦 Pip installation

```bash
pip install methurator
```

---

## 🚀 Quick Start

### Step 1 — Downsample BAM files

The `downsample` command performs BAM downsampling according to the specified percentages and coverage.

```bash
methurator downsample --genome hg19 --bam test_data/SRX1631721.markdup.sorted.csorted.bam
```

This command generates two summary files:

- **CpG summary** — number of unique CpGs detected in each downsampled BAM
- **Reads summary** — number of reads in each downsampled BAM

Example outputs can be found in [`tests/data`](https://github.com/VIBTOBIlab/methurator/tree/main/tests/data).

---

### Step 2 — Plot the sequencing saturation curve

Use the `plot` command to visualize sequencing saturation:

```bash
methurator plot \
  --cpgs_file tests/data/cpgs_summary.csv \
  --reads_file tests/data/reads_summary.csv
```

---

## ⚙️ Command Reference

### 🧩 `downsample` command

| Argument                            | Description                                                                                                        | Default             |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------ | ------------------- |
| `--bam`                             | Path to a single `.bam` file.                                                                                      | —                   |
| `--bamdir`                          | Directory containing multiple BAM files.                                                                           | —                   |
| `--outdir`                          | Output directory.                                                                                                  | `./output`          |
| `--fasta`                           | Path to the reference genome FASTA file. If not provided, it will be automatically downloaded based on `--genome`. | —                   |
| `--genome`                          | Genome used for alignment. Available: `hg19`, `hg38`, `GRCh37`, `GRCh38`, `mm10`, `mm39`.                          | —                   |
| `--downsampling-percentages`, `-ds` | Comma-separated list of downsampling percentages between 0 and 1 (exclusive).                                      | `0.1,0.25,0.5,0.75` |
| `--minimum-coverage`                | Minimum CpG coverage to consider for saturation. Can be a single integer or a list (e.g. `1,3,5`).                 | `3`                 |
| `--keep-temporary-files`            | If set, temporary files will be kept after analysis.                                                               | `False`             |

---

### 📊 `plot` command

| Argument       | Description                              | Default    |
| -------------- | ---------------------------------------- | ---------- |
| `--cpgs_file`  | Path to the CpG coverage summary file.   |            |
| `--reads_file` | Path to the reads coverage summary file. |            |
| `--outdir`     | Output directory.                        | `./output` |

---

## 📘 Example Workflow

```bash
# Step 1: Downsample BAM file
methurator downsample --genome hg19 --bam my_sample.bam

# Step 2: Plot saturation curve
methurator plot \
  --cpgs_file output/cpgs_summary.csv \
  --reads_file output/reads_summary.csv
```
