Metadata-Version: 2.4
Name: sgffp
Version: 0.15.2
Summary: Read, write, and manipulate SnapGene files (.dna, .rna, .prot)
Project-URL: Homepage, https://github.com/merv1n34k/sgffp
Project-URL: Repository, https://github.com/merv1n34k/sgffp
Project-URL: Issues, https://github.com/merv1n34k/sgffp/issues
Author-email: Oleksii Stroganov <merv1n@proton.me>
License: MIT
License-File: LICENSE
Keywords: bioinformatics,dna,molecular-biology,parser,protein,rna,snapgene
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: File Formats
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: xmltodict>=0.14
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Description-Content-Type: text/markdown

# SnapGene File Format Parser

A reverse-engineered parser and writer for SnapGene `.dna` files (DNA, RNA, protein). Supports all 15 known block types with typed Python models, a chainable builder pattern, and a history operations API.

> [!Important]
> Found an unknown block type? Run `sff check your_file.dna -l` and look for `[NEW]` markers. Please report them in [#1](https://github.com/merv1n34k/sgffp/issues/1) with a dump (`sff check your_file.dna -d`).

## Installation

```bash
pip install sgffp
```

Requires Python 3.12+.

## Quick Start

```python
from sgffp import SgffReader, SgffWriter, SgffObject

# Read a SnapGene file
sgff = SgffReader.from_file("plasmid.dna")

# Access data via typed properties
print(sgff.sequence.value)
print(sgff.features[0].name)

# Modify and write back
sgff.sequence.topology = "circular"
SgffWriter.to_file(sgff, "output.dna")

# Create a new file from scratch
sgff = (
    SgffObject.new("ATGCATGCATGC", topology="circular")
    .add_feature("GFP", "CDS", 0, 8)
    .add_primer("fwd", "ATGC", bind_position=0)
)
SgffWriter.to_file(sgff, "new_plasmid.dna")
```

### History Operations

Record cloning operations with automatic history tracking:

```python
sgff.ops.insert_fragment("ATCGATCG")
sgff.ops.digest("GGCC", InputSummary={"manipulation": "digest"})

# Or build an entire tree from multiple source files
vector = SgffReader.from_file("vector.dna")
insert = SgffReader.from_file("insert.dna")

sgff.ops.build_from_spec(
    [
        {"id": 1, "operation": "insertFragment", "sequence": "...",
         "name": "Final", "children": [2, 3]},
        {"id": 2, "source": vector},
        {"id": 3, "source": insert},
    ],
    final_sequence="...",
)
```

## How It Works

SnapGene files use a **TLV (Type-Length-Value)** binary format after a 19-byte header. Each block has a 1-byte type ID and a 4-byte length, with encoding varying by type: UTF-8 for sequences, XML for annotations, 2-bit GATC encoding for compressed DNA, LZMA for history, and ZTR for chromatogram traces.

`SgffReader` parses blocks via the `SCHEME` dispatch table and stores them in `SgffObject.blocks` (a `Dict[int, List]`). Typed model properties (`sgff.sequence`, `sgff.features`, `sgff.history`, etc.) are lazily loaded from the blocks dict and sync changes back automatically. `SgffWriter` serializes blocks back to binary in sorted order.

### Supported Block Types

| ID | Block Type | Format | Model |
|----|------------|--------|-------|
| 0 | DNA Sequence | UTF-8 | SgffSequence |
| 1 | Compressed DNA | 2-bit GATC | SgffSequence |
| 5 | Primers | XML | SgffPrimerList |
| 6 | Notes | XML | SgffNotes |
| 7 | History Tree | LZMA + XML | SgffHistory |
| 8 | Sequence Properties | XML | SgffProperties |
| 10 | Features | XML | SgffFeatureList |
| 11 | History Nodes | Binary + TLV | SgffHistory |
| 14 | Custom Enzyme Sets | XML | |
| 16 | Trace Container | Binary + TLV | SgffTraceList |
| 17 | Alignable Sequences | XML | SgffAlignmentList |
| 18 | ZTR Trace (in 16) | ZTR | SgffTrace |
| 20 | Strand Colors | XML | |
| 21 | Protein Sequence | UTF-8 | SgffSequence |
| 28 | Enzyme Visibilities | XML | |
| 29 | History Modifier | LZMA + XML | SgffHistory |
| 30 | History Content | LZMA + TLV | SgffHistory |
| 32 | RNA Sequence | UTF-8 | SgffSequence |
| 34 | RNA Structure | LZMA + JSON | |

Blocks 2, 3, 13 are auto-generated by SnapGene and skipped.

## CLI

```bash
sff parse plasmid.dna           # Export to JSON
sff info plasmid.dna -v         # Show detailed file info
sff tree plasmid.dna            # Display history timeline
sff check plasmid.dna -l        # List block types
sff filter plasmid.dna -k 0,10 -o minimal.dna
```

All read commands accept stdin (`cat file.dna | sff info`).

## Development

```bash
git clone https://github.com/merv1n34k/sgffp.git
cd sgffp
uv sync --dev

# Run tests
uv run pytest tests/ -v

# Docs (VitePress)
cd docs && bun install && bun run docs:dev
```

## Documentation

Full guides, API reference, CLI reference, and binary format specification:

**[merv1n34k.github.io/sgffp](https://merv1n34k.github.io/sgffp/)**

## Acknowledgments

This project would not have been possible without previous work done by
- **Damien Goutte-Gattat**, see his PDF on SGFF structure: https://incenp.org/dvlpt/docs/binary-sequence-formats/binary-sequence-formats.pdf
- **Isaac Luo**, for his version of SnapGene reader: https://github.com/IsaacLuo/SnapGeneFileReader
- **Kale Kundert**, for autosnapgene, a SnapGene automation tool: https://github.com/kalekundert/autosnapgene

## License

Distributed under MIT licence, see `LICENSE` for more.
