Metadata-Version: 2.4
Name: eplace
Version: 0.1.2
Summary: ePLACE: environmental Phylogenetic Localisation and Clade Estimation - A library for analyzing eDNA sequences
Author-email: Rob Edwards <raedwards@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/linsalrob/eplace
Project-URL: Repository, https://github.com/linsalrob/eplace
Project-URL: Issues, https://github.com/linsalrob/eplace/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0.0; extra == "docs"
Requires-Dist: myst-parser>=2.0.0; extra == "docs"
Dynamic: license-file

[![Edwards Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.flinders.edu.au/)
[![DOI](https://zenodo.org/badge/1120756704.svg)](https://doi.org/10.5281/zenodo.18181123)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![GitHub language count](https://img.shields.io/github/languages/count/linsalrob/eplace)
[![Documentation Status](https://readthedocs.org/projects/eplace/badge/?version=latest)](https://eplace.readthedocs.io/en/latest/)

# eplace

ePLACE: environmental Phylogenetic Localisation and Clade Estimation

A Python library for analyzing environmental DNA (eDNA) sequences through BLAST comparison and taxonomic classification.

## Documentation

For all the features available, [please check out readthedocs](https://eplace.readthedocs.io/en/latest/)

## Features

- **NCBI Database Management**: Download and manage NCBI BLAST databases (core_nt)
- **FASTA File Processing**: Read and validate FASTA files
- **BLAST Search**: Run blastn searches with configurable parameters
- **Result Filtering**: Filter BLAST results by identity and coverage thresholds
- **Taxonomic Analysis**: Extract representative sequences per taxonomic rank
- **Sequence Extraction**: Retrieve sequences from BLAST databases
- **Sequence Trimming**: Trim reference sequences to aligned regions based on BLAST coordinates
- **Multiple Sequence Alignment**: Align sequences using MAFFT with auto-orientation
- **Phylogenetic Trees**: Build and label phylogenetic trees using IQTree
- **Results Summary Output**: Creates a tab separated output that summarises the per-sequence matches.

## Installation

```bash
# Clone the repository
git clone https://github.com/linsalrob/eplace.git
cd eplace

# Install the package
pip install .

# Or install in development mode
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"
```

After installation, the `eplace` command will be available in your environment.

## Requirements

- Python 3.8 or higher
- BLAST+ tools (blastn, blastdbcmd) must be installed separately:
  ```bash
  # Ubuntu/Debian
  sudo apt-get install ncbi-blast+
  
  # macOS with Homebrew
  brew install blast
  ```
- TaxonKit (for taxonomy lookup):
  ```bash
  # Download from: https://github.com/shenwei356/taxonkit/releases
  # Or install via conda:
  conda install -c bioconda taxonkit
  ```
- MAFFT (optional, for sequence alignment):
  ```bash
  # Ubuntu/Debian
  sudo apt-get install mafft
  
  # macOS with Homebrew
  brew install mafft
  
  # Or via conda:
  conda install -c bioconda mafft
  ```
- IQTree (optional, for phylogenetic tree building):
  ```bash
  # Ubuntu/Debian
  sudo apt-get install iqtree
  
  # macOS with Homebrew
  brew install iqtree
  
  # Or via conda:
  conda install -c bioconda iqtree
  ```

## Quick Start

ePLACE provides a unified command-line interface with three main commands:

### 1. Download NCBI Database

```bash
# Download the core_nt database to default location
eplace download

# Force redownload even if database exists
eplace download --force
```

### 2. Run Individual BLAST Analysis

Run BLAST search and build one phylogenetic tree per query sequence:

```bash
# Basic usage with default parameters
eplace blast query.fasta output_dir

# With custom parameters
eplace blast query.fasta output_dir \
    --rank genus \
    --min-identity 95 \
    --min-coverage 85 \
    --num-threads 4

# Skip alignment and tree building (BLAST and extraction only)
eplace blast query.fasta output_dir --skip-alignment

# Show help
eplace blast --help
```

### 3. Run Grouped BLAST Analysis

Run BLAST search and group queries by taxonomic rank for joint phylogenetic analysis:

```bash
# Basic usage (group by class, default)
eplace grouped query.fasta output_dir

# Group by different taxonomic rank
eplace grouped query.fasta output_dir --group-rank order

# Specify both representative and grouping ranks
eplace grouped query.fasta output_dir --rank genus --group-rank family

# Show help
eplace grouped --help
```

### Using the Library API

You can also use ePLACE as a Python library:

```python
from eplace_lib import setup_ncbi_database

# Download the core_nt database
success, message = setup_ncbi_database()
```

```python
from pathlib import Path
from eplace_lib import run_blast_search, process_blast_results_for_taxonomy

# Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,    # 90% identity threshold
    min_coverage=80.0     # 80% query coverage threshold
)

# Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="species"  # Options: phylum, class, order, family, genus, species
)
```

## Command-Line Interface

The `eplace` command provides three subcommands:

### eplace download

Download and setup the NCBI core_nt BLAST database.

**Usage:**
```bash
eplace download [--force]
```

**Options:**
- `--force`: Force redownload even if database exists

**Notes:**
- Database will be stored in `$BLASTDB` if set, otherwise `~/blastdb`
- The download is large (several GB) and may take time
- MD5 checksums are verified automatically

### eplace blast

Run BLAST search with individual taxonomy analysis. Creates one phylogenetic tree per query sequence.

**Usage:**
```bash
eplace blast QUERY_FASTA OUTPUT_DIR [OPTIONS]
```

**Required Arguments:**
- `QUERY_FASTA`: Path to query FASTA file
- `OUTPUT_DIR`: Output directory for results

**Optional Arguments:**
- `--rank {phylum,class,order,family,genus,species}`: Taxonomic rank for representative selection (default: genus)
- `--tree-label-rank {phylum,class,order,family,genus,species}`: Taxonomic rank for tree labeling (default: genus)
- `--min-identity FLOAT`: Minimum percent identity for BLAST hits (default: 90.0)
- `--min-coverage FLOAT`: Minimum query coverage percentage (default: 80.0)
- `--database NAME`: BLAST database name (default: core_nt)
- `--blastdb-path PATH`: Path to BLAST database directory
- `--num-threads INT`: Number of threads for BLAST and alignment (default: 1)
- `--overwrite-existing-blast`: Overwrite existing BLAST results
- `--skip-alignment`: Skip alignment and tree building steps
- `--output-classification PATH`: Path to output classification TSV file

### eplace grouped

Run BLAST search with grouped taxonomy analysis. Groups queries by taxonomic rank and creates one phylogenetic tree per group.

**Usage:**
```bash
eplace grouped QUERY_FASTA OUTPUT_DIR [OPTIONS]
```

**Required Arguments:**
- `QUERY_FASTA`: Path to query FASTA file
- `OUTPUT_DIR`: Output directory for results

**Optional Arguments:**
- `--rank {phylum,class,order,family,genus,species}`: Taxonomic rank for representative selection (default: genus)
- `--group-rank {phylum,class,order,family,genus,species}`: Taxonomic rank for grouping sequences (default: class)
- `--tree-label-rank {phylum,class,order,family,genus,species}`: Taxonomic rank for tree labeling (default: genus)
- `--min-identity FLOAT`: Minimum percent identity for BLAST hits (default: 90.0)
- `--min-coverage FLOAT`: Minimum query coverage percentage (default: 80.0)
- `--database NAME`: BLAST database name (default: core_nt)
- `--blastdb-path PATH`: Path to BLAST database directory
- `--num-threads INT`: Number of threads for BLAST and alignment (default: 1)
- `--overwrite-existing-blast`: Overwrite existing BLAST results
- `--skip-alignment`: Skip alignment and tree building steps
- `--alignment-tolerance INT`: Maximum coordinate difference for alignment consistency (default: 50)
- `--output-classification PATH`: Path to output classification TSV file

## Documentation

Full documentation is available at [Read the Docs](https://eplace.readthedocs.io/).

- [Installation Guide](https://eplace.readthedocs.io/en/latest/installation.html) - Complete installation instructions
- [Quick Start Guide](https://eplace.readthedocs.io/en/latest/quickstart.html) - Get started quickly
- [Command-Line Interface](https://eplace.readthedocs.io/en/latest/cli.html) - Complete CLI reference
- [Workflows](https://eplace.readthedocs.io/en/latest/workflows.html) - Individual and grouped workflow details
- [API Reference](https://eplace.readthedocs.io/en/latest/api.html) - Python API documentation
- [NCBI Database Download](https://eplace.readthedocs.io/en/latest/ncbi_download.html) - Database management guide
- [BLAST Workflow](https://eplace.readthedocs.io/en/latest/blast_workflow.html) - BLAST analysis guide
- [Alignment](https://eplace.readthedocs.io/en/latest/alignment.html) - Sequence alignment documentation
- [Phylogenetic Trees](https://eplace.readthedocs.io/en/latest/phylogenetic_trees.html) - Tree building guide
- [Contributing](https://eplace.readthedocs.io/en/latest/contributing.html) - Contribution guidelines

### Local Documentation

You can also build the documentation locally:

```bash
cd docs
make html
# Open docs/build/html/index.html in your browser
```

## Workflow Comparison

### Individual Workflow (`eplace blast`)

The individual workflow processes each query sequence independently:
- Creates one output directory per query sequence
- Extracts representative sequences for each query at the specified taxonomic rank
- Builds one multiple sequence alignment per query
- Creates one phylogenetic tree per query

**Use when:** You want to analyze each query sequence in its own phylogenetic context.

### Grouped Workflow (`eplace grouped`)

The grouped workflow combines queries by taxonomic classification:
- Groups all queries that match to the same taxonomic rank (e.g., class, order)
- Creates one FASTA file per group containing all queries and unique reference sequences
- Removes redundant reference sequences within each group
- Builds one alignment and phylogenetic tree per group (instead of per query)

**Use when:** You want to analyze multiple related queries together in a single phylogenetic context.

### Examples

```bash
# Group queries by class (default)
eplace grouped query.fasta output_dir

# Group by a different taxonomic rank
eplace grouped query.fasta output_dir --group-rank order

# Specify both representative rank and grouping rank
eplace grouped query.fasta output_dir --rank genus --group-rank family
```

## Testing

Run the test suite:

```bash
# Run all tests
pytest tests/ -v

# Run specific test modules
pytest tests/test_blast_analysis.py -v
pytest tests/test_taxonomy.py -v
pytest tests/test_workflow.py -v

# Run with coverage
pytest tests/ --cov=eplace_lib --cov-report=html
```

## Project Structure

```
eplace/
├── src/
│   └── eplace_lib/
│       ├── __init__.py
│       ├── blast_analysis.py    # BLAST operations
│       ├── ncbi_download.py     # Database management
│       ├── sequences.py         # Sequence analysis utilities
│       └── taxonomy.py          # Taxonomy extraction
├── tests/
│   ├── test_blast_analysis.py
│   ├── test_ncbi_download.py
│   ├── test_taxonomy.py
│   └── test_workflow.py
├── examples/
│   ├── blast_workflow_example.py
│   └── download_ncbi_example.py
├── docs/
│   ├── blast_workflow.md
│   └── ncbi_download.md
└── pyproject.toml
```

## Workflow Overview

1. **Download Database**: Use `setup_ncbi_database()` to download NCBI core_nt database
2. **Prepare Query**: Create a FASTA file with your query sequences
3. **Run BLAST**: Use `run_blast_search()` to search against the database
4. **Filter Results**: Automatically filter by identity and coverage thresholds
5. **Extract Representatives**: Select representative sequences per taxonomic rank
6. **Trim Sequences**: Extract aligned regions from reference sequences based on BLAST coordinates
7. **Align Sequences**: Use MAFFT to align query with trimmed reference sequences (optional)
8. **Build Tree**: Build phylogenetic tree using IQTree with taxonomic labels (optional)
9. **Output**: Get FASTA files, alignments, and trees (one set per query)

### Grouped Workflow Overview

The grouped workflow adds an additional step:
1-5. Same as standard workflow through representative extraction
6. **Group by Rank**: Group all queries by specified taxonomic rank (e.g., class)
7. **Create Grouped FASTA**: Combine all queries and unique references for each group
8. **Trim Sequences**: Trim references to aligned regions
9. **Check Consistency**: Verify BLAST hits align to similar locations on references
10. **Align and Build Trees**: Create one alignment and tree per taxonomic group

## Output Structure

### Standard Workflow Output

```
output_dir/
├── blast_results.txt              # Raw BLAST results
├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
├── query1_id/
│   ├── query1_id_representatives.fasta          # Representative sequences
│   ├── query1_id_with_query.fasta              # Query + representatives
│   ├── query1_id_trimmed.fasta                 # Trimmed to aligned regions
│   ├── query1_id_aligned.fasta                 # Multiple sequence alignment
│   ├── query1_id_tree.treefile                 # Phylogenetic tree
│   ├── query1_id_tree_labeled.treefile         # Tree with taxonomic labels
│   └── query1_id_tree.* (other IQTree files)
├── query2_id/
│   └── query2_id_representatives.fasta
└── ...
```

### Grouped Workflow Output

```
output_dir/
├── blast_results.txt              # Raw BLAST results
├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
├── query1_id/                     # Per-query representative sequences (from step 5)
│   └── query1_id_representatives.fasta
├── query2_id/
│   └── query2_id_representatives.fasta
├── Taxonomic_Group_1/             # One directory per taxonomic group
│   ├── Taxonomic_Group_1_combined.fasta        # All queries + unique references
│   ├── Taxonomic_Group_1_trimmed.fasta         # Trimmed to aligned regions
│   ├── Taxonomic_Group_1_aligned.fasta         # Multiple sequence alignment
│   ├── Taxonomic_Group_1_tree.treefile         # Phylogenetic tree
│   ├── Taxonomic_Group_1_tree_labeled.treefile # Tree with taxonomic labels
│   └── Taxonomic_Group_1_tree.* (other IQTree files)
├── Taxonomic_Group_2/
│   └── ...
└── ...
```

## License

MIT License - See LICENSE file for details

## Authors

- Rob Edwards (raedwards@gmail.com)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Citation

If you use ePLACE in your research, please cite:

```
Edwards, R. (2024). ePLACE: environmental Phylogenetic Localisation and Clade Estimation.
GitHub repository: https://github.com/linsalrob/eplace
```

## Support

For issues, questions, or suggestions, please open an issue on GitHub:
https://github.com/linsalrob/eplace/issues
