Semantic Similarity

GO3 supports term-level and gene-level semantic similarity across multiple methods.

Quick reference

Similarity methods (method argument)

Method

Key

Family

Typical range

Resnik

resnik

IC-based

>= 0

Lin

lin

IC-based

[0, 1]

Jiang-Conrath similarity

jc

IC-based

>= 0

SimRel

simrel

IC-based

[0, 1]

Information Coefficient

iccoef

IC-based

>= 0

GraphIC

graphic

Hybrid

>= 0

Wang

wang

Topological

[0, 1]

TopoICSim

topoicsim

Hybrid

[0, 1]

Term-level APIs

semantic_similarity(id1, id2, method, counter)

  • Computes one score for one term pair.

  • Raises ValueError if method is unknown.

batch_similarity(list1, list2, method, counter)

  • Computes one score per aligned pair.

  • Requires len(list1) == len(list2).

  • Raises ValueError if list sizes differ or method is unknown.

Set-level API

termset_similarity(terms1, terms2, term_similarity="lin", groupwise="bma", counter=...)

Groupwise strategies:

  • bma

  • max

  • avg

  • hausdorff

  • simgic

Notes:

  • simgic is set-based and does not use the pairwise method in the same way as other strategies.

  • For empty sets, GO3 returns 0.0.

Gene-level APIs

compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)

  • Ontology must be one of BP, MF, CC.

  • Raises ValueError if either gene is missing from loaded annotations.

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

  • Fast path for large pair lists.

  • Missing/empty per-gene term mappings yield 0.0 for those pairs.

Practical behavior and edge cases

  • Invalid or missing GO IDs in similarity calls generally return 0.0.

  • Terms from different namespaces produce 0.0.

  • For normalized methods (for example lin and wang), self-similarity is typically near 1.0.

Distance-oriented workflow

For clustering/embedding workflows, use:

  • gene_distance_matrix

  • tsne_genes

  • umap_genes

These functions convert similarity to distance using distance_transform rules (see Visualization (t-SNE / UMAP)).

Mathematical definitions

Resnik

\[\mathrm{Sim}_{Resnik}(t_1, t_2) = IC(\mathrm{MICA}(t_1, t_2))\]

Lin

\[\mathrm{Sim}_{Lin}(t_1, t_2) = \frac{2\,IC(\mathrm{MICA}(t_1, t_2))}{IC(t_1)+IC(t_2)}\]

Jiang-Conrath (distance-derived similarity)

\[d_{JC} = IC(t_1) + IC(t_2) - 2\,IC(\mathrm{MICA})\]
\[\mathrm{Sim}_{JC} = \frac{1}{1 + d_{JC}}\]

SimRel

\[\mathrm{Sim}_{Rel} = \left(\frac{2\,IC(\mathrm{MICA})}{IC(t_1)+IC(t_2)}\right)\left(1-e^{-IC(\mathrm{MICA})}\right)\]

Wang

Wang similarity uses weighted ancestor contributions from GO graph topology and does not require annotation IC frequencies in the same way as IC-only methods.

TopoICSim

TopoICSim combines topology-aware paths and IC-derived weights to produce a bounded similarity.

Bibliography

API reference

batch_similarity(list1, list2, method, counter)

Compute pairwise semantic similarity in batch using a selected method.

Parameters:
  • list1 (list of str) – First list of GO term IDs.

  • list2 (list of str) – Second list of GO term IDs.

  • method (str) – Name of the similarity method.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If input lists differ in length or method is unknown.

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes in batches.

Parameters:
  • pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If method or combine are unknown.

compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes.

Parameters:
  • gene1 (str) – Gene symbol of the first gene.

  • gene2 (str) – Gene symbol of the second gene.

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If method or combine are unknown.

semantic_similarity(id1, id2, method, counter)

Compute semantic similarity between two GO terms using a selected method.

Parameters:
  • id1 (str) – First GO term ID.

  • id2 (str) – Second GO term ID.

  • method (str) – Name of the similarity method. Options: “resnik”, “lin”, etc.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If the method is unknown.

term_ic(go_id, counter)

Compute the Information Content (IC) of a GO term.

Parameters:
  • go_id (str) – GO term identifier.

  • counter (TermCounter) – Precomputed term counter with IC values.

Returns:

The IC of the GO term.

Return type:

float

termset_similarity(terms1, terms2, term_similarity='lin', groupwise='bma', counter=None)

Compute semantic similarity between two sets of GO terms.

Parameters:
  • terms1 (list of str) – First list of GO term IDs.

  • terms2 (list of str) – Second list of GO term IDs.

  • term_similarity (str) – Name of the pairwise similarity method.

  • groupwise (str) – Groupwise combination method. Options: “bma”, “max”, “avg”, “hausdorff”, “simgic”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float