Metadata-Version: 2.4
Name: toxolib
Version: 0.1.0
Summary: A tool for metagenomic taxonomic profiling and abundance matrix generation
Home-page: https://github.com/dhruvac29/toxolib
Author: Dhruvil Chodvadiya
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.0.0
Requires-Dist: numpy>=1.18.0
Requires-Dist: scikit-bio>=0.5.0
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: paramiko>=2.7.0
Requires-Dist: pyyaml>=5.1
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# toxolib

A Python package for metagenomic taxonomic profiling and abundance matrix generation.

## Installation

### Using pip

```bash
pip install toxolib
```

### Install directly from GitHub

```bash
pip install git+https://github.com/dhruvac29/toxolib.git
```

### Using conda

We recommend using conda to install all dependencies. An environment file is included in the package:

```bash
# Clone the repository
git clone https://github.com/dhruvac29/toxolib.git
cd toxolib

# Create and activate the conda environment
conda env create -f environment.yml
conda activate taxonomy_env

# Install the package
pip install -e .
```

## Requirements

This package requires the following external tools to be installed and available in your PATH:

- Kraken2
- Bracken
- Krona (for visualization)
- fastp (for preprocessing)
- bowtie2 (for host removal)
- samtools

All these dependencies are included in the conda environment file.

You'll also need to set the environment variable `KRAKEN2_DB_DIR` to the path of your Kraken2 database:

```bash
export KRAKEN2_DB_DIR=/path/to/kraken2/database
```

## Usage

### Local Usage

#### Generate abundance matrix from raw data

```bash
toxolib abundance -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o output_directory
```

This will:
1. Run Kraken2 on the raw data
2. Run Bracken on the Kraken2 results
3. Generate an abundance matrix from the Bracken results

#### Create abundance matrix from existing Bracken files

```bash
toxolib matrix -i sample1_species.bracken sample2_species.bracken -o abundance_matrix.csv
```

### HPC Usage

Toxolib can run the analysis pipeline on an HPC cluster using SLURM for job scheduling.

#### 1. Set up HPC connection

```bash
toxolib hpc-setup --hostname your-hpc-server.edu --username your-username --key-file ~/.ssh/id_rsa
```

This will save your HPC connection details to `~/.toxolib/hpc_config.yaml`.

#### 2. Run the pipeline on HPC

```bash
toxolib hpc -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o /path/on/hpc/output_dir \
    --kraken-db /path/on/hpc/kraken2_db \
    --corn-db /path/on/hpc/corn_db
```

This will:
1. Upload your raw data files to the HPC
2. Create a Snakemake workflow file
3. Submit a SLURM job to run the analysis
4. Return a job ID for tracking

#### 3. Check job status

```bash
toxolib hpc-status --job-id your_job_id
```

#### 4. Download results when complete

```bash
toxolib hpc-download --job-id your_job_id --output-dir ./local_results
```

## Database Setup

### Kraken2 Database

You can download the standard Kraken2 database from:
[https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz](https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz)

```bash
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz
tar -xzf k2_standard_20240112.tar.gz -C /path/to/kraken2/database
export KRAKEN2_DB_DIR=/path/to/kraken2/database
```

### Corn Genome Database

For host removal, you can download the corn genome reference from:
[https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip](https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip)

```bash
wget https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip
unzip corn_db.zip -d /path/to/corn_db
```

### Setting up databases on HPC

When using the HPC functionality, you'll need to upload and extract these databases on your HPC system:

```bash
# On your local machine, download the databases
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_20240112.tar.gz
wget https://glwasoilmetagenome.s3.us-east-1.amazonaws.com/corn_db.zip

# Upload to HPC (using scp)
scp k2_standard_20240112.tar.gz your-username@your-hpc-server.edu:/path/on/hpc/
scp corn_db.zip your-username@your-hpc-server.edu:/path/on/hpc/

# SSH into HPC and extract
ssh your-username@your-hpc-server.edu
mkdir -p /path/on/hpc/kraken2_db
tar -xzf /path/on/hpc/k2_standard_20240112.tar.gz -C /path/on/hpc/kraken2_db
mkdir -p /path/on/hpc/corn_db
unzip /path/on/hpc/corn_db.zip -d /path/on/hpc/corn_db
```

Then when running toxolib, specify these paths:

```bash
toxolib hpc -r raw_data_1.fastq.gz raw_data_2.fastq.gz -o /path/on/hpc/output_dir \
    --kraken-db /path/on/hpc/kraken2_db \
    --corn-db /path/on/hpc/corn_db
```

## License

MIT
