Metadata-Version: 2.4
Name: tfclass_predict
Version: 1.1.3
Summary: tfclass_predict allows to estimate transcription factor bindingsites in the TFClass hierarchy.
Author-email: Christian Ickes <christian.ickes@med.uni-goettingen.de>, Hazal Timucin <hazal.timucin@med.uni-goettingen.de>
Project-URL: Homepage, https://gitlab.gwdg.de/hti/tfclass_dnabert
Project-URL: Issues, https://gitlab.gwdg.de/hti/tfclass_dnabert/-/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Development Status :: 4 - Beta
Classifier: Operating System :: Unix
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: transformers<=4.37.2
Requires-Dist: tensorflow>=2.13.0
Requires-Dist: pysam
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: pyBigWig
Requires-Dist: tf-keras
Requires-Dist: tqdm
Dynamic: license-file

# TFClassPredict

## Description

TFClassPredict predicts transcription factor binding sites (TFBSs) on the human genome accordingly to their DNA-binding domain encapsulated in 23 DBD-Classes defined by the TFClass Class-level. By leveraging the DNABERT model, TFClassPredict reached high performances.

## Package Workflow Structure

::: {style="text-align:center"}
<img src="https://gitlab.gwdg.de/hti/tfclass_dnabert/-/raw/main/workflow_schema.drawio.png" alt="./workflow_schema.drawio.png"/>
:::

## Installation

The package can be installed via pip:

```         
pip install tfclass_predict
```

To use the package the usage of the precompiled data set is **recommended** as it shortens runtime drastically and can be on basically any system.

[TFCP_precompiled](https://owncloud.gwdg.de/index.php/s/hvS3ntYSGNU6mg3)

Alternatively, the human genome (v38) and the TFClassPredict (TFCP) model can be used directly.

[HG38](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz) from UCSC

[TFCP_model](https://owncloud.gwdg.de/index.php/s/kUqk2XXaC8bh8HM)

All downloads need to be **unzipped** so that the path to `TFCP_precompiled` directory or the path to the `hg38.fa` file and directory `TFCP_model` can be passed to the command-line tool or `PredictionManager`.

## Usage

The tool can be used from the command line with the following parameters:

```         
usage: tfclass_predict [-h] [--genome GENOME] [--precompiled PRECOMPILED] [--model MODEL] [--gpus GPUS] [--cpus CPUS] bed_file output_dir

tfclass_predict allows to estimate transcription factor bindingsites in the TFClass hierarchy. Please specifiy either the path to the precompiled (--precompiled) files or to the published
model (--model + --genome). If both are defined, the precompiled dataset is preferred!

positional arguments:
  bed_file              Path to bed file of ATAC-seq or other NGS experiment.
  output_dir            Path to output directory.

options:
  -h, --help            show this help message and exit
  --genome GENOME       Path to human genome reference (rec.: hg38) (.fa).
  --precompiled PRECOMPILED
                        Path to precompiled hg38 predictions (unzipped folder).
  --model MODEL         Path to TFClass model archive (unzipped folder).
  --gpus GPUS           Number of GPUs that should be used in parallel.
  --cpus CPUS           Number of CPUs that should be used at maximum in parallel. 
```

Or directly in python scripts:

``` python
from tfclass_predict import PredictionManager

bed_file = 'path_to_example/bed.bed'  # smaller bed file for testing
genome_file = "hg38.fa"
tfclass_model = "model/TFCP_model" #see Installation
tfclass_precompiled = "model/TFCP_precompiled" #see Installation
res_dir = "res"

# using the precompiled data set(recommended)
pred_manager = PredictionManager(bed_file, res_dir, precompiled=precompiled)
pred_manager.predict()
pred_manager.save_results()

# using genome and model
pred_manager = PredictionManager(bed_file, res_dir, genome=genome_file, tfcp_model=tfclass_model)
pred_manager.predict()
pred_manager.save_results()
```

In case the `TFCP_model` is used, it runs by default a mirrored strategy so that predictions are done in parallel on several GPUs if `--gpus > 1`. `--cpus` can be defined to use more than one core for parallel tokenization.

## Further Documentation

Find more information about the API at [ReadTheDocs](https://tfclass-predict.readthedocs.io/en/v0.0.4/).\
(Currently outdated! - 26/05/21)
