Gilda modules reference

API

gilda.api.annotate(text, sent_split_fun=None, organisms=None, namespaces=None, context_text=None)[source]

Annotate a given text with Gilda (i.e., do named entity recognition).

Parameters:
  • text (str) – The text to be annotated.

  • sent_split_fun (Callable, optional) – A function that splits the text into sentences. The default is nltk.tokenize.sent_tokenize(). The function should take a string as input and return an iterable of strings corresponding to the sentences in the input text.

  • organisms (list[str], optional) – A list of organism names to pass to the grounder. If not provided, human is used.

  • namespaces (list[str], optional) – A list of namespaces to pass to the grounder to restrict the matches to. By default, no restriction is applied.

  • context_text (Optional[str]) – A longer span of text that serves as additional context for the text being annotated for disambiguation purposes.

Returns:

A list of matches where each match is a tuple consisting of the matches text span, the list of ScoredMatches, and the start and end character offsets of the text span.

Return type:

list[tuple[str, list[ScoredMatch], int, int]]

gilda.api.get_grounder()[source]

Initialize and return the default Grounder instance.

Return type:

Grounder

Returns:

A Grounder instance whose attributes and methods can be used directly.

gilda.api.get_models()[source]

Return a list of entity texts for which disambiguation models exist.

Returns:

The list of entity texts for which a disambiguation model is available.

Return type:

list[str]

gilda.api.get_names(db, id, status=None, source=None)[source]

Return a list of entity texts corresponding to a given database ID.

Parameters:
  • db (str) – The database in which the ID is an entry, e.g., HGNC.

  • id (str) – The ID of an entry in the database.

  • status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.

  • source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.

gilda.api.ground(text, context=None, organisms=None, namespaces=None)[source]

Return a list of scored matches for a text to ground.

Parameters:
  • text (str) – The entity text to be grounded.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – A list of taxonomy identifiers to use as a priority list when surfacing matches for proteins/genes from multiple organisms.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict the matches to. By default, no restriction is applied.

Returns:

A list of ScoredMatch objects representing the groundings.

Return type:

list[gilda.grounder.ScoredMatch]

Examples

Ground a string corresponding to an entity name, label, or synonym

>>> import gilda
>>> scored_matches = gilda.ground('mapt')

The matches are sorted in descending order by score, and in the event of a tie, by the namespace of the primary grounding. Each scored match has a gilda.term.Term object that contain information about the primary grounding.

>>> scored_matches[0].term.db
'hgnc'
>>> scored_matches[0].term.id
'6893'
>>> scored_matches[0].term.get_curie()
'hgnc:6893'

The score for each match can be accessed directly:

>>> scored_matches[0].score
0.7623

The rationale for each match is contained in the match attribute whose fields are described in gilda.scorer.Match:

>>> match_object = scored_matches[0].match

Give optional context to be used by Gilda’s disambiguation models, if available

>>> scored_matches = gilda.ground('ER', context='Calcium is released from the ER.')

Only return results from a certain namespace, such as when a family and gene have the same name

>>> scored_matches = gilda.ground('ESR', namespaces=["hgnc"])
gilda.api.make_grounder(terms)[source]

Create a custom grounder from a list of Terms.

Parameters:

terms (Union[str, List[Term], Mapping[str, List[Term]]]) – Specifies the grounding terms that should be loaded in the Grounder. If str, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If list, it is assumed to be a flat list of Terms. If dict, it is assumed to be a grounding terms dict with normalized entity strings as keys and lists of Term objects as values. Default: None

Return type:

Grounder

Returns:

A Grounder instance, initialized with either the default terms loaded from the resource file or a custom set of terms if the terms argument was specified.

Examples

The following example shows how to get an ontology with obonet and load custom terms:

from gilda import make_grounder
from gilda.process import normalize
from gilda import Term

prefix = "UBERON"
url = "http://purl.obolibrary.org/obo/uberon/basic.obo"
g = obonet.read_obo(url)
custom_obo_terms = []
it = tqdm(g.nodes(data=True), unit_scale=True, unit="node")
for node, data in it:
    # Skip entries imported from other ontologies
    if not node.startswith(f"{prefix}:"):
        continue

    identifier = node.removeprefix(f"{prefix}:")

    name = data["name"]
    custom_obo_terms.append(gilda.Term(
        norm_text=normalize(name),
        text=name,
        db=prefix,
        id=identifier,
        entry_name=name,
        status="name",
        source=prefix,
    ))

    # Add terms for all synonyms
    for synonym_raw in data.get("synonym", []):
        try:
            # Try to parse out of the quoted OBO Field
            synonym = synonym_raw.split('"')[1].strip()
        except IndexError:
            continue  # the synonym was malformed

        custom_obo_terms.append(gilda.Term(
            norm_text=normalize(synonym),
            text=synonym,
            db=prefix,
            id=identifier,
            entry_name=name,
            status="synonym",
            source=prefix,
        ))

custom_grounder = gilda.make_grounder(custom_obo_terms)
scored_matches = custom_grounder.ground("head")

Additional examples for loading custom content from OBO Graph JSON, pyobo, and more can be found in the Jupyter notebooks in the Gilda repository on GitHub.

Grounder

class gilda.grounder.Grounder(terms=None, *, namespace_priority=None)[source]

Bases: object

Class to look up and ground query texts in a terms file.

Parameters:
  • terms (Union[str, Path, Iterable[Term], Mapping[str, List[Term]], None]) –

    Specifies the grounding terms that should be loaded in the Grounder.

    • If None, the default grounding terms are loaded from the versioned resource folder.

    • If str or pathlib.Path, it is interpreted as a path to a grounding terms gzipped TSV file which is then loaded. If it’s a str and looks like a URL, will be downloaded from the internet

    • If dict, it is assumed to be a grounding terms dict with normalized entity strings as keys and gilda.term.Term instances as values.

    • If list, set, tuple, or any other iterable, it is assumed to be a flat list of gilda.term.Term instances.

  • namespace_priority (Optional[List[str]]) – Specifies a term namespace priority order. For example, if multiple terms are matched with the same score, will use this list to decide which are given by which namespace appears further towards the front of the list. By default, DEFAULT_NAMESPACE_PRIORITY is used, which, for example, prioritizes famplex entities over HGNC ones.

get_ambiguities(skip_names=True, skip_curated=True, skip_name_matches=True, skip_species_ambigs=True)[source]

Return a list of ambiguous term groups in the grounder.

Parameters:
  • skip_names (bool) – If True, groups of terms where one has the “name” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.

  • skip_curated (bool) – If True, groups of terms where one has the “curated” status are skipped. This makes sense usually since these are prioritized over synonyms anyway.

  • skip_name_matches (bool) – If True, groups of terms that all share the same standard name are skipped. This is effective at eliminating spurious ambiguities due to unresolved cross-references between equivalent terms in different namespaces.

  • skip_species_ambigs (bool) – If True, groups of terms that are all genes or proteins, and are all from different species (one term from each species) are skipped. This is effective at eliminating ambiguities between orthologous genes in different species that are usually resolved using the organism priority list.

Return type:

List[List[Term]]

get_models()[source]

Return a list of entity texts for which disambiguation models exist.

Returns:

The list of entity texts for which a disambiguation model is available.

Return type:

list[str]

get_names(db, id, status=None, source=None)[source]

Return a list of entity texts corresponding to a given database ID.

Parameters:
  • db (str) – The database in which the ID is an entry, e.g., HGNC.

  • id (str) – The ID of an entry in the database.

  • status (Optional[str]) – If given, only entity texts with the given status e.g., “synonym” are returned.

  • source (Optional[str]) – If given, only entity texts from the given source e.g., “uniprot” are returned.

Returns:

names – A list of entity texts corresponding to the given database/ID

Return type:

list[str]

ground(raw_str, context=None, organisms=None, namespaces=None)[source]

Return scored groundings for a given raw string.

Parameters:
  • raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.

Returns:

A list of ScoredMatch objects representing the groundings sorted by decreasing score.

Return type:

list[gilda.grounder.ScoredMatch]

ground_best(raw_str, context=None, organisms=None, namespaces=None)[source]

Return the best scored grounding for a given raw string.

Parameters:
  • raw_str (str) – A string to be grounded with respect to the set of Terms that the Grounder contains.

  • context (Optional[str]) – Any additional text that serves as context for disambiguating the given entity text, used if a model exists for disambiguating the given text.

  • organisms (Optional[List[str]]) – An optional list of organism identifiers defining a priority ranking among organisms, if genes/proteins from multiple organisms match the input. If not provided, the default [‘9606’] i.e., human is used.

  • namespaces (Optional[List[str]]) – A list of namespaces to restrict matches to. This will apply to both the primary namespace of a matched term, to any subsumed matches, and to the source namespaces of terms if they were created using cross-reference mappings. By default, no restriction is applied.

Returns:

The best ScoredMatch returned by ground() if any are returned, otherwise None.

Return type:

Optional[gilda.grounder.ScoredMatch]

lookup(raw_str)[source]

Return matching Terms for a given raw string.

Parameters:

raw_str (str) – A string to be looked up in the set of Terms that the Grounder contains.

Return type:

List[Term]

Returns:

A list of Terms that are potential matches for the given string.

print_summary(**kwargs)[source]

Print the summary of this grounder.

Return type:

None

summary_str()[source]

Summarize the contents of the grounder.

Return type:

str

class gilda.grounder.ScoredMatch(term, score, match, disambiguation=None, subsumed_terms=None)[source]

Bases: object

Class representing a scored match to a grounding term.

term

The Term that the scored match is for.

Type:

gilda.grounder.Term

score

The score associated with the match.

Type:

float

match

The Match object characterizing the match to the Term.

Type:

gilda.scorer.Match

disambiguation

Meta-information about disambiguation, when available.

Type:

Optional[dict]

subsumed_terms

A list of additional Term objects that also matched, have the same db/id value as the term associated with the match, but were further down the score ranking. In some cases examining the subsumed terms associated with a match can provide additional metadata in downstream applications.

Type:

Optional[list[gilda.grounder.Term]]

get_grounding_dict()[source]

Get the groundings as CURIEs and URLs.

Return type:

Mapping[str, str]

get_groundings()[source]

Return all groundings for this match including from mapped and subsumed terms.

Return type:

Set[Tuple[str, str]]

Returns:

A set of tuples representing groundings for this match including the grounding for the primary term as well as any subsumed terms, and groundings that come from having mapped an original source grounding during grounding resource construction.

get_namespaces()[source]

Return all namespaces for this match including from mapped and subsumed terms.

Return type:

Set[str]

Returns:

A set of strings representing namespaces for terms involved in this match, including the namespace for the primary term as well as any subsumed terms, and groundings that come from having mapped an original source grounding during grounding resource construction.

gilda.grounder.load_entries_from_terms_file(terms_file)[source]

Yield Terms from a compressed terms TSV file path.

Parameters:

terms_file (Union[str, Path]) – Path to a compressed TSV terms file with columns corresponding to the serialized elements of a Term.

Return type:

Iterator[Term]

Returns:

Terms loaded from the file yielded by a generator.

gilda.grounder.load_terms_file(terms_file)[source]

Load a TSV file containing terms into a lookup dictionary.

Parameters:

terms_file (Union[str, Path]) – Path to a compressed TSV terms file with columns corresponding to the serialized elements of a Term.

Return type:

Mapping[str, List[Term]]

Returns:

A lookup dictionary whose keys are normalized entity texts, and values are lists of Terms with that normalized entity text.

Scorer

class gilda.scorer.Match(query, ref, exact=None, space_mismatch=None, dash_mismatches=None, cap_combos=None)[source]

Bases: object

Class representing a match between a query and a reference string

gilda.scorer.generate_match(query, ref, beginning_of_sentence=False)[source]

Return a match data structure based on comparing a query to a ref str.

Parameters:
  • query (str) – The string to be compared against a reference string.

  • ref (str) – The reference string against which the incoming query string is compared.

  • beginning_of_sentence (bool) – True if the query_str appears at the beginning of a sentence, relevant for how capitalization is evaluated.

Returns:

A Match object characterizing the match between the two strings.

Return type:

Match

gilda.scorer.score_string_match(match)[source]

Return a score between 0 and 1 for the goodness of a match.

This score is purely based on the relationship of the two strings and does not take the status of the reference into account.

Parameters:

match (gilda.scorer.Match) – The Match object characterizing the relationship of the query and reference strings.

Returns:

A match score between 0 and 1.

Return type:

float

Term

class gilda.term.Term(norm_text, text, db, id, entry_name, status, source, organism=None, source_db=None, source_id=None)[source]

Bases: object

Represents a text entry corresponding to a grounded term.

norm_text

The normalized text corresponding to the text entry, used for lookups.

Type:

str

text

The text entry itself.

Type:

str

db

The database / name space corresponding to the grounded term.

Type:

str

id

The identifier of the grounded term within the database / name space.

Type:

str

entry_name

The standardized name corresponding to the grounded term.

Type:

str

status

The relationship of the text entry to the grounded term, e.g., synonym.

Type:

str

source

The source from which the term was obtained.

Type:

str

organism

When the term represents a protein, this attribute provides the taxonomy code of the species for the protein. For non-proteins, not provided. Default: None

Type:

Optional[str]

source_db

If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original db value before mapping.

Type:

Optional[str]

source_id

If the term’s db/id was mapped from a different, original db/id from a given source, this attribute provides the original ID value before mapping.

Type:

Optional[str]

get_curie()[source]

Get the compact URI for this term.

Return type:

str

get_groundings()[source]

Return all groundings for this term, including from a mapped source.

Return type:

Set[Tuple[str, str]]

Returns:

A set of tuples representing the main grounding for this term, as well as any source grounding from which the main grounding was mapped.

get_namespaces()[source]

Return all namespaces for this term, including from a mapped source.

Return type:

Set[str]

Returns:

A set of strings including the main namespace for this term, as well as any source namespace from which the main grounding was mapped.

to_json()[source]

Return the term serialized into a JSON dict.

to_list()[source]

Return the term serialized into a list of strings.

gilda.term.dump_terms(terms, fname)[source]

Dump a list of terms to a tsv.gz file.

Return type:

None

Process

Module containing various string processing functions used for grounding.

gilda.process.dashes = ['−', '-', '‐', '‑', '‒', '–', '—', '―']

A list of all kinds of dashes

gilda.process.depluralize(word)[source]

Return the depluralized version of the word, along with a status flag.

Parameters:

word (str) – The word which is to be depluralized.

Returns:

The original word, if it is detected to be non-plural, or the depluralized version of the word, and a status flag representing the detected pluralization status of the word, with non_plural (e.g., BRAF), plural_oes (e.g., mosquitoes), plural_ies (e.g., antibodies), plural_es (e.g., switches), plural_cap_s (e.g., MAPKs), and plural_s (e.g., receptors).

Return type:

list of str pairs

gilda.process.get_capitalization_pattern(word, beginning_of_sentence=False)[source]

Return the type of capitalization for the string.

Parameters:
  • word (str) – The word whose capitalization is determined.

  • beginning_of_sentence (Optional[bool]) – True if the word appears at the beginning of a sentence. Default: False

Returns:

The capitalization pattern of the given word. Returns one of the following: sentence_initial_cap, single_cap_letter, all_caps, all_lower, initial_cap, mixed.

Return type:

str

gilda.process.normalize(s)[source]

Normalize white spaces, dashes and case of a given string.

Parameters:

s (str) – The string to be normalized.

Returns:

The normalized string.

Return type:

str

gilda.process.remove_dashes(s)[source]

Remove all types of dashes in the given string.

Parameters:

s (str) – The string in which all types of dashes should be replaced.

Returns:

The string from which dashes have been removed.

Return type:

str

gilda.process.replace_dashes(s, rep='-')[source]

Replace all types of dashes in a given string with a given replacement.

Parameters:
  • s (str) – The string in which all types of dashes should be replaced.

  • rep (Optional[str]) – The string with which dashes should be replaced. By default, the plain ASCII dash (-) is used.

Returns:

The string in which dashes have been replaced.

Return type:

str

gilda.process.replace_greek_latin(s)[source]

Replace Greek spelled out letters with their latin character.

gilda.process.replace_greek_spelled_out(s)[source]

Replace Greek unicode character with latin spelled out.

gilda.process.replace_greek_uni(s)[source]

Replace Greek spelled out letters with their unicode character.

gilda.process.replace_unicode(s)[source]

Replace unicode with ASCII equivalent, except Greek letters.

Greek letters are handled separately and aren’t replaced in this context.

gilda.process.replace_whitespace(s, rep=' ')[source]

Replace any length white spaces in the given string with a replacement.

Parameters:
  • s (str) – The string in which any length whitespaces should be replaced.

  • rep (Optional[str]) – The string with which all whitespace should be replaced. By default, the plain ASCII space ( ) is used.

Returns:

The string in which whitespaces have been replaced.

Return type:

str

gilda.process.split_preserve_tokens(s)[source]

Return split words of a string including the non-word tokens.

Parameters:

s (str) – The string to be split.

Returns:

The list of words in the string including the separator tokens, typically spaces and dashes..

Return type:

list of str

Named Entity Recognition

Gilda implements a simple dictionary-based named entity recognition (NER) algorithm. It can be used as follows:

>>> from gilda.ner import annotate
>>> text = "MEK phosphorylates ERK"
>>> results = annotate(text)

The results are a list of Annotation objects each of which contains:

  • the text string matched

  • a list of gilda.grounder.ScoredMatch instances containing a sorted list of matches for the given text span (first one is the best match)

  • the start position in the text string where the entity starts

  • the end position in the text string where the entity ends

In this example, the two concepts are grounded to FamPlex entries.

>>> results[0].text, results[0].matches[0].term.get_curie(), results[0].start, results[0].end
('MEK', 'fplx:MEK', 0, 3)
>>> results[1].text, results[1].matches[0].term.get_curie(), results[1].start, results[1].end
('ERK', 'fplx:ERK', 19, 22)

If you directly look in the second part of the 4-tuple, you get a full description of the match itself:

>>> results[0].matches[0]
ScoredMatch(Term(mek,MEK,FPLX,MEK,MEK,curated,famplex,None,None,None),0.9288806431663574,Match(query=mek,ref=MEK,exact=False,space_mismatch=False,dash_mismatches=set(),cap_combos=[('all_lower', 'all_caps')]))

BRAT

Gilda implements a way to output annotation in a format appropriate for the BRAT Rapid Annotation Tool (BRAT).

>>> from gilda.ner import get_brat
>>> from pathlib import Path
>>> brat_string = get_brat(results)
>>> Path("results.ann").write_text(brat_string)
>>> Path("results.txt").write_text(text)

For brat to work, you need to store the text in a file with the extension .txt and the annotations in a file with the same name but extension .ann.

class gilda.ner.Annotation(text, matches, start, end)[source]

Bases: object

A class to represent an annotation.

text

The text span that was annotated.

Type:

str

matches

The list of scored matches for the text span.

Type:

List[ScoredMatch]

start

The start character offset of the text span.

Type:

int

end

The end character offset of the text span.

Type:

int

gilda.ner.annotate(text, *, grounder=None, sent_split_fun=None, organisms=None, namespaces=None, context_text=None)[source]

Annotate a given text with Gilda.

Parameters:
  • text (str) – The text to be annotated.

  • grounder (Optional[gilda.grounder.Grounder]) – The Gilda grounder to use for grounding.

  • sent_split_fun (Optional[Callable]) – A function that splits the text into sentences. The default is nltk.tokenize.sent_tokenize(). The function should take a string as input and return an iterable of strings corresponding to the sentences in the input text.

  • organisms (Optional[List[str]]) – A list of organism names to pass to the grounder. If not provided, human is used.

  • namespaces (Optional[List[str]]) – A list of namespaces to pass to the grounder to restrict the matches to. By default, no restriction is applied.

  • context_text (Optional[str]) – A longer span of text that serves as additional context for the text being annotated for disambiguation purposes.

Returns:

A list of Annotations where each contains as attributes the text span that was matched, the list of ScoredMatches, and the start and end character offsets of the text span.

Return type:

List[Annotation]

gilda.ner.get_brat(annotations, entity_type='Entity', ix_offset=1, include_text=True)[source]

Return brat-formatted annotation strings for the given entities.

Parameters:
  • annotations (List[Annotation]) – A list of named entity annotations in the text.

  • entity_type (Optional[str]) – The brat entity type to use for the annotations. The default is ‘Entity’. This is useful for differentiating between annotations in the same text extracted from different reading systems.

  • ix_offset (Optional[int]) – The index offset to use for the brat annotations. The default is 1.

  • include_text (Optional[bool]) – Whether to include the text of the entity in the brat annotations. The default is True. If not provided, the text that matches the span will be written to the annotation file.

Returns:

A string containing the brat-formatted annotations.

Return type:

str

Pandas Utilities

Utilities for Pandas.

gilda.pandas_utils.ground_df(df, source_column, *, target_column=None, grounder=None, **kwargs)[source]

Ground the elements of a column in a Pandas dataframe as CURIEs, in-place.

Parameters:
  • df (DataFrame) – A pandas dataframe

  • source_column (Union[str, int]) – The column to ground. This column contains text corresponding to named entities’ labels or synonyms

  • target_column (Union[None, str, int]) – The column where to put the groundings (either a CURIE string, or None). It’s possible to create a new column when passing a string for this argument. If not given, will create a new column name like <source column>_grounded.

  • grounder (Optional[Grounder]) – A custom grounder. If none given, uses the built-in grounder.

  • kwargs – Keyword arguments passed to Grounder.ground(), could include context, organisms, or namespaces.

Return type:

None

Examples

The following example shows how to use this function.

import pandas as pd
import gilda

url = "https://raw.githubusercontent.com/OBOAcademy/obook/master/docs/tutorial/linking_data/data.csv"
df = pd.read_csv(url)
gilda.ground_df(df, source_column="disease", target_column="disease_curie")
gilda.pandas_utils.ground_df_map(df, source_column, *, grounder=None, **kwargs)[source]

Ground the elements of a column in a Pandas dataframe as CURIEs.

Parameters:
  • df (DataFrame) – A pandas dataframe

  • source_column (Union[str, int]) – The column to ground. This column contains text corresponding to named entities’ labels or synonyms

  • grounder (Optional[Grounder]) – A custom grounder. If none given, uses the built-in ground.

  • kwargs – Keyword arguments passed to Grounder.ground(), could include context, organisms, or namespaces.

Returns:

A pandas series representing the grounded CURIE strings. Contains NaNs if grounding was not successful or if there was an NaN in the cell before.

Return type:

series