Metadata-Version: 2.4
Name: gff4tools
Version: 0.4.0
Summary: Command-line toolkit for GFF4 graph annotation files
Author: Qingguo Zeng
License-Expression: MIT
Project-URL: Homepage, https://github.com/Qgzeng-Bio/Granno
Project-URL: Repository, https://github.com/Qgzeng-Bio/Granno
Project-URL: Issues, https://github.com/Qgzeng-Bio/Granno/issues
Project-URL: Documentation, https://github.com/Qgzeng-Bio/Granno#readme
Keywords: bioinformatics,pangenome,GFF4,GFA,annotation,graph
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: build>=1; extra == "dev"
Dynamic: license-file

# GFF4tools

GFF4tools is a command-line toolkit for validating, indexing, querying,
inspecting, extracting, and exporting GFF4 graph annotation files.

GFF4 is the graph-coordinate annotation exchange format. GFF4tools is the user
toolkit around that format. The Python import module remains `gff4`, while the
primary CLI binary is `gff4tools`; `gff4` is retained as a compatibility alias.

Status: `0.4.0` is the current stable release target. The current v0.3
real-data gate, v0.4 compressed benchmark gate, and TestPyPI release-candidate
install smoke pass.

## Current Formats

GFF4 currently supports two output shapes:

- `--format single`: the v0.2 canonical single-file `.gff4` table, recommended
  for user-facing demos and handoff. This is the default.
- `--format single` also supports `.gff4.gz` paths for compressed read/write of
  the same canonical table layout.
- `--format package`: the v0.1 directory debug package, retained for internal
  inspection and existing smoke workflows.

GFF4tools accepts GFA1 graphs with embedded paths plus GFF3 annotations on those
paths, then writes graph-projected feature records that can be validated,
queried, indexed, inspected, and exported back to path-specific GFF3.

Supported:

- parse GFA1 `S`, `L`, and `P` records
- build embedded path step indexes
- convert path intervals to graph walks
- import GFF3 features onto graph coordinates
- validate graph walks and feature hierarchy
- query by node, edge, path interval, or gene
- query by feature type or sample ID
- summarize sample-level annotation-footprint presence/absence by graph node,
  node interval, or graph walk
- calculate footprint coverage using gene-span, exon, or CDS bases and emit
  long TSV or sample-by-feature matrices
- export path-specific GFF3 for round-trip checks
- inspect `.gfa`/`.gfa.gz` files with GFA1.1 `W` lines
- subset a W-line pangenome graph into a small P-line GFA region
- build and use a sidecar SQLite `.gfi.sqlite` query index
- reject stale indexes using source size, modification time, and SHA-256
- read, write, validate, query, index, and export compressed `.gff4.gz`
  single-file tables

Not currently in scope: de novo gene prediction, graph-aware alignment, snarls,
orthogroup clustering, gene function inference, frameshift detection, reference
synteny block inference, production database storage, web visualization, or
polyploid-specific modeling.

## What GFF4tools Does Not Infer

GFF4tools reports graph-coordinate annotation records and
annotation-footprint overlap over sample paths. It does not infer
orthogroups, pangene membership, de novo genes, gene function, expression,
frameshift status, or biological proof of gene presence. The `footprint-pav`
command and its `pav` compatibility alias should be interpreted as
annotation-footprint overlap summaries, not orthogroup-level gene PAV.

Later releases will add copy/allele modeling, anchors, snarls, PAV/CNV matrices,
graph-SV/GWAS annotation, and production storage.

## Quick Start From Source

Use this path from a source checkout to understand and verify the project in
about 10 minutes. The commands below reference files under `examples/`, which
are shipped in the source distribution and repository but are not installed as
package data in the wheel.

Install from a checkout:

```bash
python -m pip install -e .
gff4tools --help
```

1. Build the v0.2 multi-sample PAV demo as a single `.gff4` file:

   ```bash
   gff4tools import-gff3 \
     --gfa examples/pav_multi_sample/pav_multi_sample.gfa \
     --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \
     --path-map examples/pav_multi_sample/path_map.tsv \
     --out /tmp/pav_multi_sample.gff4
   ```

2. Inspect, validate, index, and query the generated file:

   ```bash
   gff4tools stats /tmp/pav_multi_sample.gff4

   gff4tools view /tmp/pav_multi_sample.gff4 --section features --head 5

   gff4tools paths /tmp/pav_multi_sample.gff4

   gff4tools validate /tmp/pav_multi_sample.gff4 \
     --gfa examples/pav_multi_sample/pav_multi_sample.gfa

   gff4tools index /tmp/pav_multi_sample.gff4

   gff4tools query /tmp/pav_multi_sample.gff4 --feature-type gene --format tsv
   ```

3. Summarize sample-level annotation-footprint overlap directly on graph
   coordinates:

   ```bash
   gff4tools footprint-pav /tmp/pav_multi_sample.gff4 --node n2

   gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
     --node-interval n2:45-80

   gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
     --walk '>n2:85-100>n3:0-25'

   gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
     --node n2 \
     --coverage-basis CDS \
     --min-overlap-bp 1 \
     --matrix status
   ```

4. Use the same workflow with compressed v0.4 exchange files:

   ```bash
   gff4tools import-gff3 \
     --gfa examples/pav_multi_sample/pav_multi_sample.gfa \
     --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \
     --path-map examples/pav_multi_sample/path_map.tsv \
     --out /tmp/pav_multi_sample.gff4.gz

   gff4tools validate /tmp/pav_multi_sample.gff4.gz \
     --gfa examples/pav_multi_sample/pav_multi_sample.gfa

   gff4tools index /tmp/pav_multi_sample.gff4.gz

   gff4tools query /tmp/pav_multi_sample.gff4.gz \
     --feature-type gene \
     --format tsv

   gff4tools export-gff3 /tmp/pav_multi_sample.gff4.gz \
     --path SampleA#1#chr1 \
     --out /tmp/pav_multi_sample.export.gff3
   ```

5. Run the test suite:

   ```bash
   python -m pytest
   ```

6. Run all demo workflows:

   ```bash
   bash scripts/smoke_demos.sh
   ```

   This writes demo outputs to `/tmp/gff4_smoke/` and runs:

   - toy import, validate, and edge query
   - real-small import, validate, and reverse-edge query
   - Arabidopsis public package and single-file import, validate, query, and
     GFF3 export
   - multi-sample import, validate, node footprint-PAV, node-interval
     footprint-PAV, and walk footprint-PAV

7. Smoke-check an installed wheel or PyPI package against your own GFF4 file:

   ```bash
   gff4tools --version
   gff4tools --help
   gff4 --help
   gff4tools stats path/to/file.gff4
   gff4tools query path/to/file.gff4 --feature-type gene --format tsv
   ```

   To run the bundled demos after a wheel install, use a source checkout or
   unpack the source distribution so the `examples/` directory is available.

8. Read the toy tutorial for the full command-by-command walkthrough:

   - [v0.1 MVP tutorial](docs/tutorial_mvp.md)

9. For v0.3 real pangenome graph work, inspect and subset a W-line graph:

   ```bash
   gff4tools graph-inspect \
     --gfa data/real_cqu-pangenome/pangenome.gfa.gz \
     --out /tmp/quinoa_graph_stats.json

   gff4tools graph-subset \
     --gfa data/real_cqu-pangenome/pangenome.gfa.gz \
     --reference-sample CquZ \
     --seqid Cq1A \
     --start0 0 \
     --end0 250000 \
     --out-gfa /tmp/quinoa_region.gfa \
     --out-path-map /tmp/quinoa_path_map.tsv \
     --out-stats /tmp/quinoa_region.stats.json
   ```

   Or run the full local quinoa real-data gate:

   ```bash
   bash scripts/quinoa_realdata/run_quinoa_cqu_demo.sh
   ```

   Expected final signal:

   ```text
   QUINOA_CQU_V03_GATE: PASS
   ```

   Run the release-facing multi-region gate:

   ```bash
   bash scripts/quinoa_realdata/run_quinoa_cqu_demo.sh --multi-region
   ```

   Expected final signal:

   ```text
   QUINOA_CQU_V03_MULTI_REGION_GATE: PASS
   ```

   Run the v0.4 compressed single-file benchmark gate:

   ```bash
   bash scripts/quinoa_realdata/run_quinoa_cqu_v04_benchmark.sh
   ```

   Expected final signal:

   ```text
   QUINOA_CQU_V04_BENCHMARK_GATE: PASS
   ```

   The v0.4 benchmark table records:

   | Field | Meaning |
   | --- | --- |
   | `plain_gff4_bytes` | Size of the uncompressed canonical `.gff4` file. |
   | `gzip_gff4_bytes` | Size of the compressed `.gff4.gz` file. |
   | `compression_ratio` | `gzip_gff4_bytes / plain_gff4_bytes`; lower is smaller. |
   | `wall_seconds` | Elapsed time for the benchmarked command. |
   | `user_seconds` / `sys_seconds` | CPU time used by the benchmarked command. |
   | `peak_rss_kib` | Peak resident memory in KiB, normalized by the wrapper. |
   | query/export parity | The gate asserts indexed vs scan query parity and plain vs gzip export parity. |

10. Inspect the single-file layout:

   ```text
   ##gff4-version 0.2
   ##format gff4-feature-table
   ##section manifest
   #key	value
   gff4_version	0.2
   ##section sources
   #source_id	source_kind	source_role	...
   ##section features
   #feature_uid	annotation_set_id	source_feature_id	...
   ##section locations
   #location_id	feature_uid	projection_set_id	...
   ##section location_spans
   #location_id	span_rank	path_id	step_rank	node_id	...
   ##section nodes
   #node_id	node_length
   ##section edges
   #edge_id	from_node_id	from_orient	to_node_id	to_orient	...
   ##section paths
   #path_id	sample_id	haplotype_id	contig_id	path_length	path_role
   ##section path_steps
   #path_id	step_rank	node_id	orient	...
   ```

## Demo Data

- `examples/toy/`: hand-checkable graph for learning the v0.1 coordinate model.
- `examples/real_small/`: semi-real demo with PanSN path names, GFF3 seqid
  aliases, UTR records, and a reverse-oriented path step.
- `examples/arabidopsis_public/`: public reference-backed Arabidopsis
  TAIR10/Araport11 AT1G01010 region projected onto a small single-path GFA.
- `examples/pav_multi_sample/`: v0.2 landing demo with multiple samples, long
  nodes containing multiple genes, a cross-node gene, and sample-level PAV.
- `examples/quinoa_cqu_real/`: v0.3 real pangenome graph gate for the local
  quinoa Minigraph-Cactus W-line graph.

Each demo has its own README with import, validate, query, and export commands.

## Documentation

- [MVP scope](docs/MVP_scope.md): what v0.1 does and does not attempt.
- [v0.1 package spec](docs/GFF4_v0.1_spec.md): debug TSV package layout,
  coordinate rules, query surface, and `path_map.tsv`.
- [v0.2 single-file spec](docs/GFF4_single_file_spec.md): canonical
  single-file `.gff4` graph annotation table layout.
- [Roadmap](docs/roadmap.md): five major milestones from v0.2 format hardening
  to indexed queries and release-ready ecosystem integration.
- [v0.2 release notes](docs/release_notes_v0.2.md): stable MVP status,
  feature summary, real-data gate, scope, and known limitations.
- [v0.2 release checklist](docs/release_checklist_v0.2.md): standard and
  real-data stable-gate checks.
- [v0.3 spec](docs/GFF4_v0.3_spec.md): W-line graph input and sidecar index
  profile.
- [v0.3 release notes](docs/release_notes_v0.3.md): real pangenome graph
  readiness release line.
- [v0.3 release checklist](docs/release_checklist_v0.3.md): quinoa real-data
  gate checks.
- [v0.4 release checklist](docs/release_checklist_v0.4.md): compressed
  single-file benchmark checks.
- [v0.4.0 release notes](docs/release_notes_gff4tools_v0.4.0.md):
  stable compressed I/O and benchmark hardening release scope.
- [v0.4.0rc1 release notes](docs/release_notes_gff4tools_v0.4.0rc1.md):
  compressed I/O, benchmark hardening, semantic guardrails, and release scope.
- [v0.5 store/indexed PAV RFC](docs/rfc_v0.5_store_and_indexed_pav.md):
  proposed internal store boundary and index tables for future indexed
  `footprint-pav`.
- [GFF4tools CLI guide](docs/gff4tools_cli.md): productized command-line usage
  for inspection, query, indexing, export, and validation.
- [MVP tutorial](docs/tutorial_mvp.md): executable toy workflow.
- [Release checklist](docs/release_checklist.md): verification steps before a
  v0.1 release-facing commit or tag.
- [v0.1 release notes](docs/release_notes_v0.1.md): release highlights,
  verification, demo coverage, and known limits.

## Development

```bash
python -m pytest
```

Run the demo smoke workflow before release-facing changes:

```bash
bash scripts/smoke_demos.sh
```

The initial development target is the hand-checkable toy graph under `examples/toy/`.
