Advanced Usage

This section describes the various features and options of gokit.


Enrichment

gokit performs Gene Ontology enrichment analysis using Fisher’s exact test for overrepresentation analysis (ORA). To run a single-study enrichment analysis, use the following command:

gokit enrich \
  --study study.txt \
  --population population.txt \
  --assoc assoc.txt \
  --out results/goea

Defaults that reduce flags:

  • --obo defaults to ./go-basic.obo

  • --assoc-format defaults to auto

  • --test-direction defaults to both


Batch Mode

gokit supports batch enrichment of multiple study sets with cross-study semantic similarity comparisons.

gokit enrich \
  --studies studies.tsv \
  --population population.txt \
  --assoc assoc.txt \
  --out results_batch \
  --out-formats tsv,jsonl \
  --compare-semantic \
  --semantic-metric wang \
  --semantic-top-k 5

studies.tsv accepts either:

  • study_name<TAB>/path/to/study.txt

  • /path/to/study.txt (name inferred from filename)


Validation

Before running enrichment, it is recommended to validate your input files. The validate command checks that study, population, and association files are properly formatted and consistent.

gokit validate \
  --study study.txt \
  --population population.txt \
  --assoc assoc.txt

Download

gokit can download the Gene Ontology files needed for enrichment analysis. By default, this downloads go-basic.obo and goslim_generic.obo into the current directory.

gokit download

This is equivalent to:

wget http://current.geneontology.org/ontology/go-basic.obo
wget http://current.geneontology.org/ontology/subsets/goslim_generic.obo

Plotting

gokit supports several plot types for visualizing enrichment results.

Term bar plot:

gokit plot \
  --input results_batch/all_studies.tsv \
  --study-id study_a \
  --kind term-bar \
  --direction both \
  --top-n 20 \
  --out figures/study_a_terms \
  --format png

Direction summary plot:

gokit plot \
  --input results_batch/all_studies.tsv \
  --study-id study_a \
  --kind direction-summary \
  --alpha 0.05 \
  --out figures/study_a_direction_summary.png

Semantic network plot:

gokit plot \
  --input results_batch/semantic_similarity.tsv \
  --kind semantic-network \
  --min-similarity 0.25 \
  --max-edges 40 \
  --out figures/semantic_network.png

Auto-plot emission from enrich:

gokit enrich \
  --studies studies.tsv \
  --population population.txt \
  --assoc assoc.txt \
  --out results_batch \
  --compare-semantic \
  --emit-plots term-bar,direction-summary,semantic-network \
  --plot-format png

Report

Generate a consolidated markdown report from an enrichment run.

gokit report --run results/goea

Semantic Similarity

gokit computes pairwise semantic similarity between study sets using several established metrics.

Available semantic metrics (--semantic-metric):

  • jaccard: Jaccard index (raw or ancestor-expanded)

  • resnik: Resnik semantic similarity (information content of MICA)

  • lin: Lin semantic similarity

  • wang: Wang semantic similarity

Additional semantic options:

  • --semantic-top-k: number of top terms to use per study

  • --semantic-namespace: restrict to a specific GO namespace (all, BP, MF, CC)

  • --semantic-min-padjsig: minimum adjusted p-value threshold for term inclusion


Input File Formats

gokit requires three input files for enrichment analysis:

study.txt: one study gene ID per line.

geneA
geneB

population.txt: one background gene ID per line.

geneA
geneB
geneC
geneD

assoc.txt: one gene-to-GO mapping per line. Multiple GO terms on one line are supported using semicolons. Tabs are also accepted.

geneA GO:0008150;GO:0003674
geneB GO:0008150
geneC GO:0005575

Supported association formats:

  • id2gos: simple gene-to-GO ID mapping (default)

  • gaf: Gene Association File format (GAF 2.x)

  • gpad: Gene Product Association Data format (GPAD 1.x/2.x)

  • gene2go: NCBI gene2go format

  • auto: automatic format detection (default)


All Options

Enrichment options:

Option

Usage and meaning

-h/--help

Print help message.

--study

Path to study gene set file.

--studies

Path to batch studies manifest (TSV).

--population

Path to population/background gene set file.

--assoc

Path to gene-to-GO association file.

--assoc-format

Association file format. Default: auto.

--obo

Path to OBO ontology file. Default: ./go-basic.obo.

--out

Output path prefix.

--out-formats

Comma-separated output formats (tsv, jsonl, parquet). Default: tsv.

--method

Multiple testing correction method: fdr_bh, fdr_by, bonferroni, holm, none. Default: fdr_bh.

--test-direction

Direction of test: both, over, under. Default: both.

--id-type

ID normalization mode: auto, str, int. Default: auto.

--compare-semantic

Enable cross-study semantic similarity comparison. Default: off.

--semantic-metric

Semantic similarity metric: jaccard, resnik, lin, wang. Default: jaccard.

--semantic-top-k

Number of top terms per study for semantic comparison.

--semantic-namespace

GO namespace filter for semantic comparison: all, BP, MF, CC.

--semantic-min-padjsig

Minimum adjusted p-value threshold for semantic term inclusion.

--emit-plots

Comma-separated plot types to auto-emit: term-bar, direction-summary, semantic-network.

--plot-format

Format for auto-emitted plots. Default: png.

Command aliases:

Alias

Equivalent command

gk_enrich

gokit enrich

gk_validate

gokit validate

gk_plot

gokit plot

gk_download

gokit download

gk_report

gokit report

gk_explain

gokit explain