Metadata-Version: 2.4
Name: seqstatx
Version: 0.1.0
Summary: Fast sequence statistics for FASTA/FASTQ files — N50, GC%, length distributions and more
Author-email: Wendy Bui <wendybuinta@gmail.com>
License: MIT
Keywords: bioinformatics,genomics,fasta,fastq,sequence,qc
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"

# seqstats

[![CI](https://github.com/perhapsstrawberries/seqstats/actions/workflows/ci.yml/badge.svg)](https://github.com/perhapsstrawberries/seqstats/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/seqstatx)](https://pypi.org/project/seqstatx/)
![Python](https://img.shields.io/badge/python-3.10%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)

Fast sequence statistics for FASTA and FASTQ files — works on plain or gzipped inputs, no dependencies.

```
file                            seqs      total_bp      gc%      mean_len    min_len   max_len   N50         N90
-----------------------------------------------------------------------------------------------------------------
GRCh38.primary_assembly.fa      194       3,088,286,401  40.93   15,918,992  970       248,956,422  153,373,213  40,103,529
SRR10045678_1.fastq.gz          10000000  1,510,000,000  50.21   151.0       151       151          151          151
```

## Install

```bash
pip install seqstatx
```

Or for development:

```bash
git clone https://github.com/perhapsstrawberries/seqstats.git
cd seqstats
pip install -e .
```

## Usage

```bash
# single file
seqstatx genome.fa

# multiple files, gzipped FASTQ
seqstatx sample1.fastq.gz sample2.fastq.gz

# TSV output for downstream parsing
seqstatx --tsv *.fa > stats.tsv

# pipe to column for alignment
seqstatx --tsv *.fastq.gz | column -t
```

## Metrics

| Column | Description |
|--------|-------------|
| `seqs` | Number of sequences / reads |
| `total_bp` | Total base pairs |
| `gc%` | GC content (%) |
| `mean_len` | Mean sequence length |
| `min_len` / `max_len` | Shortest / longest sequence |
| `N50` | 50% of total assembly is in sequences ≥ this length |
| `N90` | 90% of total assembly is in sequences ≥ this length |

## Supported formats

| Extension | Format |
|-----------|--------|
| `.fa` `.fna` `.fasta` | FASTA |
| `.fq` `.fastq` | FASTQ |
| `.fa.gz` `.fastq.gz` etc. | gzipped variants |

## Why

Existing tools (seqkit, seqtk) are great but require installation of compiled binaries.  
`seqstats` is pure Python 3.10+, zero dependencies, pip-installable from any HPC or Conda environment.

## License

MIT
