Metadata-Version: 2.4
Name: sam-dealer
Version: 0.1.0
Summary: High-throughput, zero-IO parallel dispatcher for SAM/BAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.
License: MIT
Author: Ben Skubi
Author-email: skubi@ohsu.edu
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Description-Content-Type: text/markdown

# sam-dealer

High-throughput, zero-IO parallel dispatcher for SAM/BAM/CRAM streams. Distributes reads by index (round-robin) to persistent workers with backpressure management.

## Install
`sam-dealer` relies on lightweight system-level concurrency tools (`GNU Parallel` and `mbuffer`) to achieve its performance. The following installs sam-dealer with `mamba`. You can also use `conda` as a (slower) drop-in replacement. `click` is a widely used Python package and can be installed with `pip`. `samtools`, `parallel`, and `mbuffer` can be manually installed if not using `mamba/conda`.

```bash
mamba install -c conda-forge samtools parallel mbuffer click
pip install sam-dealer
```

## Example

This command initiates 10 parallel persistent `wc -l` jobs receiving the input.bam header and a continuous stream of records, 100,000 at a time. 

```bash
sam-dealer input.bam -N 100000 --jobs 10 "wc -l"
```

## Help

```
sam-dealer --help
```

## Description

`sam-dealer` facilitates parallel dispatch over blocks of consecutive SAM/BAM/CRAM records. SAM/BAM/CRAM files facilitate fetching genomic regions or individual read names, but not blocks of consecutive records. However, some bioinformatics tools, such as <a href="https://pairtools.readthedocs.io/en/latest/cli_tools.html#pairtools-parse">`pairtools parse`</a>, which extracts Hi-C pairs from alignments, require name-sorted input. Even when records can be treated independently, batching over genomic regions is non-trivial and requires careful engineering for load-balancing to avoid job idling.

Existing solutions to this problem introduce their own problems. Splitting alignment files on disk results in substantial write amplification, requires disk I/O made slower by the need to seek between many inputs, and may delay data processing until the file has been completely split, slowing the development cycle. Although standard utilities like `pysam` can stream and dispatch records after serializing them as strings, this is slow and requires complex manual implementation.

`sam-dealer` combines standard tools to stream-dispatch SAM/BAM/CRAM records to persistent jobs simply, rapidly, and with low memory pressure. Conceptually, it works as follows:

* Spin up `J` persistent jobs (user-specified CLI commands that each receive distinct block of records from the input SAM/BAM/CRAM file)
* The input file is divided into `J` input streams, one for each job.
* Each input stream is formed by round-robin batching of `N` linearly consecutive records at a time. The first job gets the first `N` records, the second job gets the next `N` records, and so on in a loop until all records have been processed.
* Input streams flow through a memory buffer (one buffer per job) that spills to disk under memory pressure. This allows all jobs to run independently at maximum speed as long as their buffers are not full, without waiting for the previous job to finish consuming its next batch.
* Importantly, we do *not* initiate one job per batch. Instead, we initiate `J` jobs that receive the concatenation of the header and all the batches that belong to the job as a continuous stream via `stdin`. From the job's perspective, it is streaming in every `J`th block of `N` records over the entire input file.
