Metadata-Version: 2.4
Name: extract-soft-clipped
Version: 0.1.0
Summary: Extract soft-clipped sequences from BAM/SAM files
Author-email: Terry Jones <tcj25@cam.ac.uk>
License: MIT License
        
        Copyright (c) 2025 Terry Jones
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: pysam>=0.24.0
Description-Content-Type: text/markdown

# Extract Soft-Clipped Sequences

A Python tool to extract soft-clipped sequences from BAM/SAM files.

## Installation

### From PyPI (recommended)

```bash
pip install extract-soft-clipped
```

### With uvx (no installation required)

If you use `uv`, you can run the tool directly without installing:

```bash
uvx extract-soft-clipped --left input.bam
```

### For Development

This project uses `uv` for dependency management. Clone and install dependencies with:

```bash
git clone <repo-url>
cd extract-soft-clipped
uv sync
```

## Usage

### Basic Usage

Extract left-clipped sequences:
```bash
extract-soft-clipped --left input.bam
```

Extract right-clipped sequences:
```bash
extract-soft-clipped --right input.sam
```

### Alternative Usage Methods

**With uvx (no installation):**
```bash
uvx extract-soft-clipped --left input.bam
```

**Development usage:**
```bash
uv run extract-soft-clipped --left input.bam
```

### Filter by Length

Only extract soft-clipped sequences of a minimum length:
```bash
# Only extract left clips of at least 20 bases
extract-soft-clipped --left --min-length 20 input.bam

# Only extract right clips of at least 10 bases
extract-soft-clipped --right --min-length 10 input.sam
```

### FASTQ Output

Output sequences in FASTQ format with query IDs and quality scores:
```bash
# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq input.bam

# Extract right clips in FASTQ format with minimum length
extract-soft-clipped --right --fastq --min-length 15 input.sam

# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fastq --preserve-ids input.bam
```

### FASTA Output

Output sequences in FASTA format with query IDs:
```bash
# Extract left clips in FASTA format
extract-soft-clipped --left --fasta input.bam

# Extract right clips in FASTA format with minimum length
extract-soft-clipped --right --fasta --min-length 15 input.sam

# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fasta --preserve-ids input.bam
```

### Region Filtering

Filter soft-clipped sequences by reference coordinate overlap:
```bash
# Extract clips that overlap reference positions 1000-1025
extract-soft-clipped --left --region 1000-1025 input.bam

# Multiple regions can be specified
extract-soft-clipped --right --region 1000-1025 --region 2000-2100 input.bam

# Region coordinates are included in FASTA/FASTQ headers when filtering
extract-soft-clipped --left --fasta --region 1000-1025 input.bam
```

### Summarize Clipped Sequences

To get a frequency summary of N bases at the relevant end of clipped regions:

```bash
# Summarize last 10 bases of left-clipped sequences
extract-soft-clipped --left --summarize 10 input.bam

# Summarize first 15 bases of right-clipped sequences
extract-soft-clipped --right --summarize 15 input.bam
```

### Examples

```bash
# Extract all left-clipped sequences
extract-soft-clipped --left reads.bam > left_clips.txt

# Using uvx (no installation required)
uvx extract-soft-clipped --left reads.bam > left_clips.txt

# Get frequency of 8-mers at the end of left clips
extract-soft-clipped --left --summarize 8 reads.bam

# Extract right clips and save to file
extract-soft-clipped --right reads.sam > right_clips.txt

# Extract left clips of at least 15 bases
extract-soft-clipped --left --min-length 15 reads.bam

# Combine length filtering with summarization
extract-soft-clipped --right --min-length 10 --summarize 5 reads.bam

# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq reads.bam > left_clips.fastq

# Extract right clips with quality filtering in FASTQ format
extract-soft-clipped --right --min-length 20 --fastq reads.bam > right_clips.fastq

# Extract left clips in FASTA format
extract-soft-clipped --left --fasta reads.bam > left_clips.fasta

# Extract right clips with length filtering in FASTA format
extract-soft-clipped --right --min-length 10 --fasta reads.bam > right_clips.fasta

# Extract clips overlapping specific reference regions
extract-soft-clipped --left --region 1000-1025 reads.bam > region_clips.txt

# Multiple filters combined with FASTA output
extract-soft-clipped --right --min-length 15 --region 2000-3000 --fasta reads.bam > filtered_clips.fasta

# Preserve original query IDs in output
extract-soft-clipped --left --fasta --preserve-ids reads.bam > original_ids.fasta
```

## Library Usage

You can also use extract-soft-clipped as a Python library:

```python
from extract_soft_clipped import extract_soft_clips, extract_soft_clips_iter

# Extract all left-clipped sequences
clips = extract_soft_clips("reads.bam", extract_left=True, extract_right=False)

for clip in clips:
    print(f"Read: {clip.query_name}")
    print(f"Sequence: {clip.sequence}")
    print(f"Quality: {clip.quality}")
    print(f"Is left clip: {clip.is_left_clip}")

# Use the generator for memory efficiency with large files
for clip in extract_soft_clips_iter("reads.bam", extract_left=True, extract_right=False, min_length=10):
    print(clip.sequence)
```

## How it Works

The script parses the CIGAR string in SAM/BAM alignments to identify soft-clipped regions:

- **Left clips**: Soft-clipped bases at the start of reads (before alignment)
- **Right clips**: Soft-clipped bases at the end of reads (after alignment)

For the `--summarize N` option:
- **Left clips**: Analyzes the last N bases of each left-clipped sequence (closest to aligned portion)
- **Right clips**: Analyzes the first N bases of each right-clipped sequence (closest to aligned portion)

For the `--region START-END` option:
- **Coordinates**: 1-based inclusive coordinates (e.g., 1000-1025 includes positions 1000 and 1025)
- **Left clips**: Calculates where soft-clipped bases would map before the alignment start
- **Right clips**: Calculates where soft-clipped bases would map after the alignment end
- **Output**: When using `--fasta` or `--fastq` with region filtering, headers include the full reference range (e.g., `region-990-1030`)

## Requirements

- Python ≥3.10
- pysam ≥0.24.0
