fm_index
An FM-index data structure for efficient substring search on text.
The FM-index is a compressed full-text index built on the Burrows–Wheeler Transform (BWT) and succinct rank/select structures. It supports fast pattern counting and locating while keeping memory usage low.
This implementation accepts Unicode strings and internally encodes the input as a sequence of integer symbols. Depending on the input string’s representation (e.g., UCS-1 / UCS-2 / UCS-4), it automatically selects an appropriate integer bit-width for symbols, using the smallest representation that can faithfully store all code points in the text.
Convert the FM-Index back to a string.
Complexity
- Time:
O(N log σ)
where:
N= length of the indexed stringσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
Examples
>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.item()
"mississippi"
Count the occurrences of the given pattern in the indexed string.
Complexity
- Time:
O(|pattern| log σ)
where:
|pattern|= length of the patternσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
Examples
>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.count("issi")
2
Locate all occurrences of the given pattern in the indexed string. Order of the returned positions is not guaranteed.
Complexity
- Time:
O((|pattern| + |count|) log σ)
where:
|pattern|= length of the pattern|count|= number of occurrences of the pattern in the indexed stringσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
Examples
>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.locate("issi")
[4, 1]
Check if the indexed string starts with the given prefix.
Complexity
- Time:
O(|prefix| log σ)
where:
|prefix|= length of the prefixσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
Examples
>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.starts_with("mi")
True
Check if the indexed string ends with the given suffix.
Complexity
- Time:
O(|suffix| log σ)
where:
|suffix|= length of the suffixσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
Examples
>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.ends_with("pi")
True
A Multi FM-index data structure for efficient substring search across multiple texts.
The MultiFMIndex builds a single FM-index over a collection of texts, enabling fast pattern counting and locating queries while keeping the index compact. Internally, the texts are concatenated with separators so that matches can be attributed back to the original text, making it suitable for multi-document search.
This implementation accepts Unicode strings and encodes each input text as a sequence of integer symbols. Depending on the input texts’ representation (e.g., UCS-1 / UCS-2 / UCS-4), it automatically selects an appropriate integer bit-width for symbols, using the smallest representation that can faithfully store all code points appearing in the collection.
Convert the MultiFM-Index back to a list of strings.
Complexity
- Time:
O(N log σ)
where:
N= total length of all indexed stringsσ= size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)
```
Count the total occurrences of the given pattern across all indexed strings.
Count the occurrences of the given pattern in each indexed string.
Locate all occurrences of the given pattern in each indexed string.