Metadata-Version: 2.4
Name: methylome_miner
Version: 1.1.8
Summary: A tool for handling DNA methylation data in extended bedMethyl format obtained from the nanopore sequencing data by Dorado and Modkit.
Author-email: Marketa Jakubickova <jakubickova@vut.cz>, Katerina Sabatova <sabatovak@vut.cz>
Maintainer-email: Katerina Sabatova <sabatovak@vut.cz>
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-Expression: GPL-3.0-or-later
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
License-File: LICENSE
Requires-Dist: click>=8.1
Requires-Dist: pandas>=2.2.3
Requires-Dist: biopython>=1.84
Requires-Dist: bcbio-gff>=0.7.1
Requires-Dist: pyarrow>=18.1.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: scipy>=1.16.1
Requires-Dist: fastcluster>=1.3.0
Project-URL: Homepage, https://github.com/BioSys-BUT/MethylomeMiner
Project-URL: Issues, https://github.com/BioSys-BUT/MethylomeMiner/issues

# MethylomeMiner

`MethylomeMiner` is a tool for handling DNA methylation data in extended bedMethyl format obtained from the nanopore 
sequencing data by [Dorado](https://github.com/nanoporetech/dorado) and 
[Modkit](https://github.com/nanoporetech/modkit).

![MethylomeMiner](graphical_abstract.png "MethylomeMiner workflow")

## Content

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
  - [MethylomeMiner](#MethylomeMiner)
  - [PanMethylomeMiner](#PanMethylomeMiner)
- [Contact](#contact)
- [License](#license)
- [Citation](#citation)

## Features

* Filter bedMethyl tables based on valid coverage and the methylation rate.
* Sort base modifications into coding and non-coding types using genome annotation.
* Split base modifications according to reference sequences.
* Provide count statistics for each base modification, coding/non-coding types and reference sequence.
* Link coding base modifications with results from pangenome analysis conducted by 
  [Roary](https://sanger-pathogens.github.io/Roary/).

## Installation

1. Create a Python 3.12 or newer virtual environment.
2. Install `MethylomeMiner` using:

```commandline
python -m pip install methylome-miner
```

## Usage
`MethylomeMiner` functionality is divided into two parts. The first part named `MethylomeMiner` deals with manipulation
of bedMethyl files, the other, `PanMethylomeMiner`, integrates pangenome analysis with DNA methylation data processed
by `MethylomeMiner` to create panmethylome.

## MethylomeMiner

The primary use of `MethylomeMiner` is to filter out base modifications with low coverage and low rate
of base modification and to sort methylations into coding and non-coding groups. So-called *coding* methylations are
base modifications that are located in some coding region of the genome. On the other hand *non-coding* methylations
are found between coding parts of the genome.

The basic command to run `MethylomeMiner` from command line is following:

```commandline
MethylomeMiner --input_bed_file path\to\bed_file.bed --input_annot_file path\to\annot_file.gff --min_coverage 35
```

#### Required parameters
##### `--input_bed_file`
Path to a bedMethyl file with genome-wide single-base methylation data.

##### `--input_annot_file`
Path to a file with genome annotation in `gff` (v3) or `gbk` file format.

##### `--min_coverage` and `--input_bed_dir`
Parameters `--min_coverage` and `--input_bed_dir` are mutually exclusive as they are used to assess the minimum coverage
of base modification to be retained. 

The user can either set the threshold of minimum coverage by using parameter `--min_coverage` and an integer value
of requested coverage, or provide a path to directory with bedMethyl files (`--input_bed_dir` parameter)
and the threshold would be set to the median value of median values of coverage from individual bedMethyl files within
the directory.


#### Optional parameters
##### `--min_percent_modified`
Sets minimum required percentage of reads supporting base modification. The value is float in the range `<0, 100>`.
Default: `90`.

##### `--work_dir`
Specifies the output directory where `MethylomeMiner` will save all result files.
If not provided, a folder named `MethylomeMiner_output` will be automatically created in the current working directory.

##### `--file_name`
Specifies a custom name to use for the output files generated by `MethylomeMiner`.
By default, `MethylomeMiner` use the name of the input bedMethyl file and add a specific suffix for each output 
(see [Output files](#output-files))

##### `--write_filtered_bed`
A flag. When set, the script writes the filtered bedMethyl data to a new output file in working directory.
Use this option if you want to save the filtered results.

##### `--filtered_bed_format`
Specifies the format of the filtered bedMethyl output file.
Available options: `json`, `csv`, `tsv`, `bed`.
Default: `csv`.
Use this option to control the format in which the filtered data is saved.

##### `--split_by_reference`
A flag. If set, all output files (except for filtered bedMethyl file) will be split and saved separately for each 
reference sequence.

#### Output files
The output of `MethylomeMiner` is a set of new files distinguished by file suffixes:
* `_coding.csv` - extended bedMethyl table with information about coding region, where methylations are located.
  
  | reference_seq       | start_index | end_index | modified_base_code | score | strand | ... | percent_modified | ... | gene_reference_seq  | gene_id                            | gene_start | gene_end | gene_strand | gene_product                                   |
  |---------------------|-------------|-----------|--------------------|-------|--------|-----|------------------|-----|---------------------|------------------------------------|------------|----------|-------------|------------------------------------------------|
  | chromosome_contig_1 | 1108        | 1109      | 6mA                | 36    | +      | ... | 94.44            | ... | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_1 | 0          | 1335     | +           | chromosomal replication initiator protein DnaA |
  | chromosome_contig_1 | 1753        | 1754      | 6mA                | 36    | +      | ... | 97.25            | ... | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_2 | 1339       | 2440     | +           | DNA polymerase III subunit beta                |
  | chromosome_contig_1 | 1867        | 1868      | 5mC                | 37    | +      | ... | 100.0            | ... | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_2 | 1339       | 2440     | +           | DNA polymerase III subunit beta                |
  | chromosome_contig_1 | 2699        | 2700      | 6mA                | 36    | +      | ... | 97.25            | ... | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_3 | 2647       | 3721     | +           | DNA replication/repair protein RecF            |
  | chromosome_contig_1 | 2728        | 2729      | 5mC                | 36    | +      | ... | 100.0            | ... | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_3 | 2647       | 3721     | +           | DNA replication/repair protein RecF            |
  | ...                 |

* `_non_coding.csv` - extended bedMethyl table with information about coding regions, that are located in front of and 
  after methylation. Modifications in front of the first coding feature have empty annotation marked as `prev` 
  (previous). Modifications after the last coding feature have empty annotation marked as `next`.
  
  | reference_seq       | start_index | end_index | modified_base_code | score | strand | ... | percent_modified | ... | prev_gene_reference_seq | prev_gene_id                          | prev_gene_start | prev_gene_end | prev_gene_strand | prev_gene_product                | next_gene_reference_seq | next_gene_id                          | next_gene_start | next_gene_end | next_gene_strand | next_gene_product            |
  |---------------------|-------------|-----------|--------------------|-------|--------|-----|------------------|-----|-------------------------|---------------------------------------|-----------------|---------------|------------------|----------------------------------|-------------------------|---------------------------------------|-----------------|---------------|------------------|------------------------------|
  | chromosome_contig_1 | 1482        | 1483      | 6mA                | 36    | -      | ... | 100.0            | ... |                         |                                       |                 |               |                  |                                  | chromosome_contig_1     | 0792ceb9f153a28951804de2a656f49c_14   | 15874           | 16207         | -                | YceK/YidQ family lipoprotein |
  | chromosome_contig_1 | 1601        | 1602      | 6mA                | 36    | -      | ... | 100.0            | ... |                         |                                       |                 |               |                  |                                  | chromosome_contig_1     | 0792ceb9f153a28951804de2a656f49c_14   | 15874           | 16207         | -                | YceK/YidQ family lipoprotein |
  | ...                 |
  | chromosome_contig_1 | 3553006     | 3553007   | 6mA                | 41    | +      | ... | 97.56            | ... | chromosome_contig_1     | 0792ceb9f153a28951804de2a656f49c_3407 | 3544141         | 3545086       | +                | CDF family zinc transporter ZitB | chromosome_contig_1     | 0792ceb9f153a28951804de2a656f49c_3435 | 3568768         | 3570052       | +                | citrate synthase             |
  | ...                 |
  | plasmid_contig_3    | 153394      | 153395    | 5mC                | 45    | +      | ... | 100.0            | ... | plasmid_contig_3        | 0792ceb9f153a28951804de2a656f49c_5106 | 151555          | 152305        | +                | hypothetical protein             |                         |                                       |                 |               |                  |
  | plasmid_contig_3    | 153396      | 153397    | 5mC                | 36    | -      | ... | 100.0            | ... | plasmid_contig_3        | 0792ceb9f153a28951804de2a656f49c_5099 | 146025          | 146250        | -                | IS1-like element transposase     |                         |                                       |                 |               |                  |                              |
 
* `_all_annot_with_coding_methylations.csv` - extended genome annotation table, where each coding region has
  a list of found methylations' positions distinguished by methylation type.
  
  | record_id           | gene_id                            | start | end  | strand | product                                        | note                                                                                                                                      | 4mC | 5mC                                                          | 6mA                                                    |
  |---------------------|------------------------------------|-------|------|--------|------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|-----|--------------------------------------------------------------|--------------------------------------------------------|
  | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_1 | 0     | 1335 | +      | chromosomal replication initiator protein DnaA | WP_015703901.1 chromosomal replication initiator protein DnaA (Klebsiella) [pid:94.8%, q_cov:100.0%, s_cov:95.1%, Eval:5.9e-240]          | []  | []                                                           | [1108]                                                 |
  | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_2 | 1339  | 2440 | +      | DNA polymerase III subunit beta                | WP_015369032.1 DNA polymerase III subunit beta (Klebsiella aerogenes) [pid:98.1%, q_cov:100.0%, s_cov:100.0%, Eval:5.2e-204]              | []  | [1867]                                                       | [1753]                                                 |
  | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_3 | 2647  | 3721 | +      | DNA replication/repair protein RecF            | WP_015703900.1 DNA replication/repair protein RecF (Klebsiella aerogenes) [pid:97.5%, q_cov:100.0%, s_cov:100.0%, Eval:1.5e-200]          | []  | [2728, 3115]                                                 | [2699, 2827, 3179, 3640]                               |
  | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_4 | 3749  | 6164 | +      | DNA topoisomerase (ATP-hydrolyzing) subunit B  | WP_015703899.1 DNA topoisomerase (ATP-hydrolyzing) subunit B (Klebsiella aerogenes) [pid:95.1%, q_cov:100.0%, s_cov:100.0%, Eval:0.0e+00] | []  | [3962, 4265, 4307, 4535, 4979, 5012, 5069, 5606, 5825, 5915] | [4454, 4944, 5309, 5406, 5417, 5697, 5991, 6015, 6069] |
  | chromosome_contig_1 | 0792ceb9f153a28951804de2a656f49c_5 | 6365  | 7178 | +      | sugar-phosphatase                              | WP_015369035.1 sugar-phosphatase (Klebsiella aerogenes) [pid:94.8%, q_cov:100.0%, s_cov:100.0%, Eval:1.3e-142]                            | []  | [6857]                                                       | [6696, 6801, 6809, 7023]                               |
  | ...                 |

* `_methylations_statistics.csv` - table of filtered methylation counts showing the number of methylation sites detected
  in coding and non-coding regions, along with their totals and the overall number of methylations.

  | genome | 4mC_coding_count | 4mC_non_coding_count | 4mC_total_count | 5mC_coding_count | 5mC_non_coding_count | 5mC_total_count | 6mA_coding_count | 6mA_non_coding_count | 6mA_total_count | total_count |
  |--------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|-------------|
  | KP825  | 38               | 36                   | 74              | 9425             | 10809                | 20234           | 15797            | 16386                | 32183           | 52491       |

* `_methylations_statistics_per_ref.csv` - as above, but the counts are divided according to reference
  sequences in the annotation.

  | genome | reference_seq       | 4mC_coding_count | 4mC_non_coding_count | 4mC_total_count | 5mC_coding_count | 5mC_non_coding_count | 5mC_total_count | 6mA_coding_count | 6mA_non_coding_count | 6mA_total_count | total_count |
  |--------|---------------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|-------------|
  | KP825  | chromosome_contig_1 | 33               | 33                   | 66              | 8874             | 10033                | 18907           | 14967            | 15399                | 30366           | 49339       |
  | KP825  | plasmid_contig_2    | 2                | 1                    | 3               | 252              | 333                  | 585             | 418              | 459                  | 877             | 1465        |
  | KP825  | plasmid_contig_3    | 3                | 2                    | 5               | 299              | 443                  | 742             | 412              | 528                  | 940             | 1687        |

* `_filtered.*` - optional, filtered bedMethyl table in CSV, TSV, JSON or BED format.

If `--split_by_reference` is used flag:
* `_coding.csv` file is replaced by `<reference sequence 1>_coding.csv`, `<reference sequence 2>_coding.csv` etc. for
  each reference sequence present in the annotation file,
* `_non_coding.csv` file is replaced by `<reference sequence 1>_non_coding.csv`, `<reference sequence 2>_non_coding.csv`
  etc. for each reference sequence present in the annotation file.


#### Examples
* To run `MethylomeMiner` with minimum coverage not specified, but instead path to directory with bedMethyl files is
  provided and minimum coverage is calculated as a median coverage value from all bedMethyl files
  in the directory:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --input_bed_dir input_files
  ```
  
* To set minimum coverage and minimum percentage of reads supporting base modification:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --min_coverage 36 --min_percent_modified 95
  ```
  
* To set up working directory and custom name for output files:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --min_coverage 36 --work_dir path\to\output_dir --file_name output_file_name_prefix
  ```

* To write filtered bedMethyl table to default CSV file:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --write_filtered_bed
  ```

* To write filtered bedMethyl file and select its format:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --write_filtered_bed --filtered_bed_format bed
  ```

* To split and save all output files (except for filtered bedMethyl file) by reference sequences:

  ```commandline
  MethylomeMiner --input_bed_file input_files\KP825_4mC_5mC_6mA_calls.bed --input_annot_file input_files\KP825_genome.gff --split_by_reference
  ```


## PanMethylomeMiner

`PanMethylomeMiner` is designed to integrate the results from `MethylomeMiner` with pangenome analysis performed
by Roary. Its main purpose is to generate a comprehensive panmethylome table, in which each gene from the pangenome
is annotated with the detected methylations across multiple genomes.

The basic command to run `PanMethylomeMiner` from command line is following:

```commandline
PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv
```

`PanMethylomeMiner` uses `MethylomeMiner` to process all input files (bedMethyl and annotation files).
Therefore, some parameters in `PanMethylomeMiner` are similar to those in `MethylomeMiner`, allowing control over its 
internal settings.

In the basic use case, `MethylomeMiner` filters base modifications using a default threshold of 90 % for the minimum 
percentage of modified reads. The threshold for minimum coverage is automatically set to the median value calculated
from all bedMethyl files provided via the `--input_bed_dir` parameter.

#### Required parameters
##### `--input_bed_dir`
Path to a directory with bedMethyl files, which should be integrated into pangenome analysis.

##### `--input_annot_dir`
Path to a directory with annotation files, that were used as input into pangenome analysis. Genome annotations could be
in `gff` (v3) or `gbk` file format, but the locus tags for individual CDS have been renamed to correspond with 
the gene names used in the Roary output.

> **Important note:**
> 
> `PanMethylomeMiner` pairs bedMethyl and annotation files according to the file names, more precisely by a prefix.
> In this case, a prefix is every character in front of the first underscore `_` character. This prefix is later used
> to distinguish the individual genomes in the pangenome analysis.
> 
> Make sure the files are named accordingly, for example:
> 
> | prefix   | bedMethyl file                 | annotation file     |
> |----------|--------------------------------|---------------------|
> | `KP825`  | `KP825_4mC_5mC_6mA_calls.bed`  | `KP825_genome.gff`  |
> | `KP1651` | `KP1651_4mC_5mC_6mA_calls.bed` | `KP1651_genome.gff` |
> | `KP1824` | `KP1824_4mC_5mC_6mA_calls.bed` | `KP1824_genome.gff` |
>


##### `--roary_file`
Path to output file from Roary tool named `gene_presence_absence.csv`.


#### Optional parameters
##### `--min_coverage`
Sets minimum (as integer value) required read coverage for a base modification to be retained.
This option is only used if `MethylomeMiner` was not previously run for the input bedMethyl files
and no precomputed results are present in the `work_dir` folder.
If not provided, the median coverage will be automatically calculated from files in the `input_bed_dir` folder.

##### `--min_percent_modified`
Sets minimum required percentage of reads supporting base modification. The value is float in the range `<0, 100>`.
Default: `90`.
This option is only used if `MethylomeMiner` was not previously run for the input bedMethyl files
and no precomputed results are present in the `work_dir` folder.

##### `--matrix_values`
Defines the format of values used in the output panmethylome matrix.

Options:
* `presence`: `0` for no detected base modification, `1` for detected base modification, no value for missing gene
  in the corresponding genome.
* `positions`: A list of genomic positions where base modifications were identified within each pangenome gene
  or no value for missing gene in the corresponding genome.

Default: `presence`.

If a particular gene from pangenome was not found in a genome, value in the matrix is left empty.

##### `--work_dir`
Specifies the output directory where `PanMethylomeMiner` will look for results from `MethylomeMiner` and save all result 
files.
If not provided, a folder named `MethylomeMiner_output` will be automatically created in the current working directory.

##### `--write_all_results`
A flag. If set, writes all result tables produced by `MethylomeMiner` to output files.
This includes:
* All detected methylation sites split into coding and non-coding regions.
* The extended genome annotation file with detected methylations located in coding regions.
* Filtered bedMethyl files after quality thresholds are applied.

Use this option if you want to save full intermediate results.

##### `--heatmap_type`
Specifies which type of heatmap visualization should be generated from the panmethylome presence matrix.
Heatmaps are only created if the matrix values (via `--matrix_values)` are set to `presence`.

Options:
* `compact`: Generates a lower-resolution heatmap without gene names on the y-axis. Suitable for a quick overview across
  genomes.
* `full`: Generates a full-resolution heatmap with gene names. This version is more detailed but may result in a large
  image depending on the number of genes.
* `both`: Creates both compact and full heatmaps.

If no value is provided, no heatmap will be generated.

##### `--heatmap_file_format`
Sets the output format for generated heatmap(s). This applies to all generated heatmaps.
Aviable options are `png`, `svg` and `pdf`.

Default: `pdf`.

##### `--heatmap_min_percent_presence`
Sets the minimum percentage of genomes in which a gene must be present in order to be included in the heatmap(s).
This helps filter out genes that are absent from most genomes. For example `--heatmap_min_percent_presence 90` will
include only genes present in at least 90 % of genomes. Value 100 % will create heatmap for core genes.

The value is float in the range `<0, 100>` and default is `95`.


#### Output files
`PanMethylomeMiner` results are provided as a set of CSV files, each named according to the type of base modification 
(e.g., 5mC, 6mA, 4mC) identified in the dataset. 
For each detected modification, a panmethylome matrix is generated in the format specified by the user:

##### `presence`

The presence/absence matrix indicates whether a given pangenome gene contains any detected base modifications
of the specified type.

**File name format:** `<modification>_panmethylome_presence.csv`

Example: `5mC_panmethylome_presence.csv`

**Matrix values:**
* `1` – base modification detected in the gene
* `0` – no modification detected
* no value – the gene is absent in the corresponding genome

| Gene       | KP1248 | KP1267 | KP1344 | KP1622 | KP1651 | KP1658 | KP1665 | KP1666 | KP1785 | KP1814 | KP1821 | KP1822 | KP1824 | KP1829 | KP1832 | KP1833 | KP1834 | KP1842 | KP1906 | KP825 |
|------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-------|
| recD       | 1      | 1      | 0      | 1      | 1      | 1      | 0      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 0      | 0      |       |
| paaF       | 0      | 1      | 1      | 1      | 1      | 1      | 0      | 1      | 0      | 1      | 1      | 1      | 0      | 1      | 1      | 1      | 1      | 0      | 0      |       |
| tynA       | 0      | 1      | 0      | 1      | 1      | 1      | 0      | 1      | 0      | 1      | 1      | 0      | 1      | 0      | 1      | 1      | 1      | 0      | 0      |       |
| group_1004 | 0      | 1      | 0      | 1      | 1      | 1      | 0      | 1      | 1      | 1      | 1      | 1      | 1      | 1      | 0      | 1      | 1      | 0      | 0      |       |
| ompC       | 0      | 0      | 0      | 0      | 0      | 0      | 0      | 1      | 1      | 1      | 1      | 0      | 0      | 0      | 0      | 1      | 1      | 0      | 0      |       |
|...|


##### `positions`
Matrix stores precise genomic positions of modified bases located within each pangenome gene.

**File name format:** `<modification>_panmethylome_positions.csv`

Example: `5mC_panmethylome_positions.csv`

**Matrix values:**
* a list of integers - genomic positions of detected base modifications in the gene
* empty list - no modifications were detected
* no value - the gene is absent in the corresponding genome

| Gene       | KP1248                   | KP1267                                                                   | KP1344    | KP1622                                                                                     | KP1651                                                                                              | KP1658                                                   | KP1665 | KP1666                                                   | KP1785                           | KP1814                                                                   | KP1821                                                                            | KP1822                                   | KP1824                                                 | KP1829                                   | KP1832                                                 | KP1833                                                 | KP1834                                                                   | KP1842 | KP1906 | KP825 |
|------------|--------------------------|--------------------------------------------------------------------------|-----------|--------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|----------------------------------------------------------|--------|----------------------------------------------------------|----------------------------------|--------------------------------------------------------------------------|-----------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------|------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------------------------|--------|--------|-------|
| recD       | [923982, 924145, 924181] | [994950, 995152, 995220, 995315]                                         | []        | [901881, 902044, 902535]                                                                   | [902535, 902698]                                                                                    | [901537, 901881, 902044, 902440, 902535, 902698, 902734] | []     | [901539, 901883, 902046, 902442, 902537, 902700, 902736] | [888326, 888468, 888670, 888833] | [900277, 900419, 900621, 900784, 901180, 901275, 901438, 901474]         | [909850, 910143, 910539, 910634, 910797]                                          | [909638, 909852, 910145, 910541, 910636] | [909636, 909850, 910143, 910539, 910634, 910797]       | [909639, 909853, 910146, 910637, 910800] | [909849, 910142, 910633, 910796]                       | [909791, 909988, 910084, 910480, 910575]               | [909794, 910087, 910483, 910578, 910741]                                 | []     | []     |       |
| paaF       | []                       | [2966497, 2966854]                                                       | [2931769] | [2868686, 2869043]                                                                         | [2859549, 2859906]                                                                                  | [2859393, 2859750]                                       | []     | [2859429]                                                | []                               | [2898167, 2898524]                                                       | [2922405, 2922762]                                                                | [2922416, 2922773]                       | []                                                     | [2922408, 2922765]                       | [2922406]                                              | [2922341, 2922698]                                     | [2922351, 2922708]                                                       | []     | []     |       |
| tynA       | []                       | [2978635, 2978860, 2978918, 2979184, 2979446, 2979719, 2979927, 2979977] | []        | [2880824, 2881049, 2881107, 2881635, 2881908, 2881929, 2882116, 2882166, 2882472, 2882835] | [2871687, 2871912, 2871970, 2872236, 2872498, 2872771, 2872792, 2872979, 2873029, 2873335, 2873698] | [2871531, 2872615, 2872823, 2873179, 2873542]            | []     | [2872651, 2872859, 2872909, 2873578]                     | []                               | [2911116, 2911389, 2911597, 2911647, 2911953, 2912071, 2912304, 2912316] | [2934543, 2934768, 2934826, 2935354, 2935627, 2935648, 2935835, 2935885, 2936191] | []                                       | [2934768, 2934826, 2935092, 2935627, 2936191, 2936228] | []                                       | [2934827, 2935628, 2935649, 2935836, 2936192, 2936229] | [2934479, 2934762, 2935563, 2935584, 2935771, 2936127] | [2934489, 2934714, 2934772, 2935300, 2935573, 2935594, 2935781, 2936137] | []     | []     |       | 
| group_1004 | []                       | [2986640, 2986681, 2986802, 2986969, 2988160]                            | []        | [2888828, 2888869, 2888990, 2889157, 2890348]                                              | [2879682, 2879723, 2879844, 2880011, 2881202]                                                       | [2879526, 2879567, 2879688, 2879855, 2881046]            | []     | [2879562, 2879603, 2879724, 2879891, 2881082]            | [2800569]                        | [2918309, 2918350, 2918471, 2918638, 2919829, 2919919]                   | [2942547, 2942588, 2942709, 2944067]                                              | [2942599]                                | [2942547, 2942588, 2942709]                            | [2942550, 2942591, 2942712, 2944070]     | []                                                     | [2942483, 2942524]                                     | [2942493, 2942534, 2943482, 2944013]                                     | []     | []     |       |                                                                      
| ompC       | []                       | []                                                                       | []        | []                                                                                         | []                                                                                                  | []                                                       | []     | [2889656, 2889662]                                       | [2809144]                        | [2928048, 2928088, 2928433, 2928439]                                     | [2952274, 2952314, 2952641, 2952647]                                              | []                                       | []                                                     | []                                       | []                                                     | [2952210, 2952250, 2952577, 2952583]                   | [2952587, 2952593]                                                       | []     | []     |       |                                                                      
| ...        |


##### Statistics
Additionally, methylation statistics from `MethylomeMiner` are merged to provide an overview across all genomes
in pangenome dataset. Therefore, two CSV files are created:

* `all_methylations_statistics.csv` - table of filtered methylation counts showing the number of methylation sites
  detected in coding and non-coding regions, along with their totals and the overall number of methylations

  | genome | 4mC_coding_count | 4mC_non_coding_count | 4mC_total_count | 5mC_coding_count | 5mC_non_coding_count | 5mC_total_count | 6mA_coding_count | 6mA_non_coding_count | 6mA_total_count | total_count |
  |--------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|-------------|
  | KP1248 | 20               | 18                   | 38              | 6860             | 7973                 | 14833           | 11027            | 11756                | 22783           | 37654       |
  | KP1267 | 113              | 141                  | 254             | 17397            | 20724                | 38121           | 28528            | 30944                | 59472           | 97847       |
  | KP1344 | 6                | 15                   | 21              | 3651             | 4483                 | 8134            | 5799             | 6805                 | 12604           | 20759       |


* `all_methylations_statistics_per_ref.csv` - as above, but the counts are divided according to reference
  sequences in the annotation

  | genome | reference_seq       | 4mC_coding_count | 4mC_non_coding_count | 4mC_total_count | 5mC_coding_count | 5mC_non_coding_count | 5mC_total_count | 6mA_coding_count | 6mA_non_coding_count | 6mA_total_count | total_count |
  |--------|---------------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|------------------|----------------------|-----------------|-------------|
  | KP825  | chromosome_contig_1 | 33               | 33                   | 66              | 8874             | 10033                | 18907           | 14967            | 15399                | 30366           | 49339       |
  | KP825  | plasmid_contig_2    | 2                | 1                    | 3               | 252              | 333                  | 585             | 418              | 459                  | 877             | 1465        |
  | KP825  | plasmid_contig_3    | 3                | 2                    | 5               | 299              | 443                  | 742             | 412              | 528                  | 940             | 1687        |
  | KP1248 | chromosome_contig_1 | 19               | 17                   | 36              | 6482             | 7474                 | 13956           | 10514            | 11148                | 21662           | 35654       |
  | KP1248 | plasmid_contig_2    | 0                | 1                    | 1               | 119              | 141                  | 260             | 152              | 197                  | 349             | 610         |
  | KP1248 | plasmid_contig_3    | 1                | 0                    | 1               | 259              | 358                  | 617             | 361              | 411                  | 772             | 1390        |
  | KP1267 | chromosome_contig_1 | 112              | 138                  | 250             | 16958            | 20098                | 37056           | 27883            | 30075                | 57958           | 95264       |
  | KP1267 | plasmid_contig_2    | 1                | 3                    | 4               | 439              | 626                  | 1065            | 645              | 869                  | 1514            | 2583        |
  | KP1344 | chromosome_contig_3 | 6                | 13                   | 19              | 3332             | 4046                 | 7378            | 5329             | 6255                 | 11584           | 18981       |
  | KP1344 | plasmid_contig_1    | 0                | 2                    | 2               | 319              | 437                  | 756             | 470              | 550                  | 1020            | 1778        |


#### Examples
* To run set custom filtering thresholds for `MethylomeMiner`
  
  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --min_coverage 36 --min_percent_modified 95
  ```

* To change type of values in panmethylome matrix to `positions`
  
  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --matrix_values positions
  ```

* To set up working directory
  
  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --work_dir path\to\output_dir
  ```

* To request all results from `MethylomeMiner` to be saved to files

  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --write_all_results
  ```

* To generate `compact` heatmap in png file format for genes with 95 % presence

  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --heatmap_type compact
  ```

* To generate `full` heatmap in svg file format for genes with 95 % presence

  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --heatmap_type full --heatmap_file_format svg
  ```

* To generate `both` heatmaps (compact and full) in svg file format for genes with 100 % presence in analyzed 
  genomes = core genes

  ```commandline
  PanMethylomeMiner --input_bed_dir path\to\bed_dir --input_annot_dir path\to\annot_dir --roary_file path\to\gene_presence_absence.csv --heatmap_type both --heatmap_file_format svg --heatmap_min_percent_presence 100
  ```

## Citation
If you use `MethylomeMiner` in your research, please cite the following article:

> Jakubickova, M., Sabatova, K., Zbudilova, M., Bezdicek, M., Lengerova, M., Vitkova, H.
> MethylomeMiner: A novel tool for high-resolution analysis of bacterial methylation patterns from nanopore sequencing.
> Computational and Structural Biotechnology Journal, 27 (2025), pp. 4753–4759.
> [https://doi.org/10.1016/j.csbj.2025.10.047](https://doi.org/10.1016/j.csbj.2025.10.047)

Formal citation metadata are available in the `CITATION.cff` file and via the <kbd>Cite this repository</kbd> button.

## Contact
`MethylomeMiner`'s maintainers:
* Katerina Sabatova - [@KSabatova](https://github.com/KSabatova)
* Marketa Jakubickova - [@MakretaJakubickova](https://github.com/MarketaJakubickova)

## License
See the [LICENSE](LICENSE) file for license rights and limitations (GNU General Public License v3.0).

