Metadata-Version: 2.4
Name: fasta-checksum-utils
Version: 0.5.0
Summary: Library and command-line utility for checksumming FASTA files and individual contigs.
License: LGPL-3.0
License-File: LICENSE
Author: David Lougheed
Author-email: david.lougheed@gmail.com
Requires-Python: >=3.10.0,<3.14
Classifier: License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: aiofiles (>=23.2.1,<26)
Requires-Dist: aiohttp (>=3.9.3,<4.0.0)
Requires-Dist: pysam (>=0.22.0,<0.24)
Description-Content-Type: text/markdown

# fasta-checksum-utils

Asynchronous library and command-line utility for checksumming FASTA files and individual contigs.
Implements two checksumming algorithms: `MD5` and `GA4GH`, in order to fulfill the needs of the 
[Refget v2](http://samtools.github.io/hts-specs/refget.html) API specification.


## Installation

To install `fasta-checksum-utils`, run the following `pip` command:

```bash
pip install fasta-checksum-utils
```


## CLI Usage

To generate a text report of checksums in the FASTA document, run the following command:

```bash
fasta-checksum-utils ./my-fasta.fa[.gz]
```

This will print output in the following tab-delimited format:

```
file  [file size in bytes]    md5 [file MD5 hash]           ga4gh  [file GA4GH hash]
chr1  [chr1 sequence length]  md5 [chr1 sequence MD5 hash]  ga4gh  [chr1 sequence GA4GH hash]
chr2  [chr2 sequence length]  md5 [chr2 sequence MD5 hash]  ga4gh  [chr2 sequence GA4GH hash]
...
```

The following example is the output generated by specifying the SARS-CoV-2 genome FASTA from NCBI:

```
file	    30428	md5	825ab3c54b7a67ff2db55262eb532438	ga4gh	SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd
NC_045512.2	29903	md5	105c82802b67521950854a851fc6eefd	ga4gh	SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D
```

If the `--out-format bento-json` arguments are passed, the tool will instead output the report in a JSON
format, designed to be compatible with the requirements of the 
[Bento Reference Service](https://github.com/bento-platform/bento_reference_service). The following example
is the output generated by specifying the SARS-CoV-2 genome:

```json
{
  "fasta": "sars_cov_2.fa",
  "fasta_size": 30428,
  "md5": "825ab3c54b7a67ff2db55262eb532438",
  "ga4gh": "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd",
  "contigs": [
    {
      "name": "NC_045512.2",
      "md5": "105c82802b67521950854a851fc6eefd",
      "ga4gh": "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D",
      "length": 29903
    }
  ]
}
```

If an argument like `--fai [path or URL]` is passed, an additional `"fai": "..."` property will be added to the JSON 
object output.

If an argument like `--genome-id GRCh38` is provided, an additional `"id": "GRCh38"` property will be added to the
JSON object output.


## Library Usage

Below are some examples of how `fasta-checksum-utils` can be used as an asynchronous Python library:

```python
import asyncio
import fasta_checksum_utils as fc
import pysam
from pathlib import Path


async def demo():
    covid_genome: Path = Path("./sars_cov_2.fa")
    
    # calculate an MD5 checksum for a whole file
    file_checksum: str = await fc.algorithms.AlgorithmMD5.checksum_file(covid_genome)
    print(file_checksum)
    # prints "863ee5dba1da0ca3f87783782284d489"
    
    all_algorithms = (fc.algorithms.AlgorithmMD5, fc.algorithms.AlgorithmGA4GH)
    
    # calculate multiple checksums for a whole file
    all_checksums: tuple[str, ...] = await fc.checksum_file(file=covid_genome, algorithms=all_algorithms)
    print(all_checksums)
    # prints tuple: ("863ee5dba1da0ca3f87783782284d489", "SQ.mMg8qNej7pU84juQQWobw9JyUy09oYdd")
    
    # calculate an MD5 and GA4GH checksum for a specific contig in a PySAM FASTA file:
    fh = pysam.FastaFile(str(covid_genome))
    try:
        contig_checksums: tuple[str, ...] = await fc.checksum_contig(
            fh=fh, 
            contig_name="NC_045512.2", 
            algorithms=all_algorithms,
        )
        print(contig_checksums)
        # prints tuple: ("105c82802b67521950854a851fc6eefd", "SQ.SyGVJg_YRedxvsjpqNdUgyyqx7lUfu_D")
    finally:
        fh.close()  # always close the file handle


asyncio.run(demo())
```

