fm_index

class FMIndex:

An FM-index data structure for efficient substring search on text.

The FM-index is a compressed full-text index built on the Burrows–Wheeler Transform (BWT) and succinct rank/select structures. It supports fast pattern counting and locating while keeping memory usage low.

This implementation accepts Unicode strings and internally encodes the input as a sequence of integer symbols. Depending on the input string’s representation (e.g., UCS-1 / UCS-2 / UCS-4), it automatically selects an appropriate integer bit-width for symbols, using the smallest representation that can faithfully store all code points in the text.

FMIndex(data: str)
def item(self) -> str:

Convert the FM-Index back to a string.

Complexity

  • Time: O(N log σ)

where:

  • N = length of the indexed string
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

Examples

>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.item()
"mississippi"
def count(self, pattern: str) -> int:

Count the occurrences of the given pattern in the indexed string.

Complexity

  • Time: O(|pattern| log σ)

where:

  • |pattern| = length of the pattern
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

Examples

>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.count("issi")
2
def locate(self, pattern: str) -> list[int]:

Locate all occurrences of the given pattern in the indexed string. Order of the returned positions is not guaranteed.

Complexity

  • Time: O((|pattern| + |count|) log σ)

where:

  • |pattern| = length of the pattern
  • |count| = number of occurrences of the pattern in the indexed string
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

Examples

>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.locate("issi")
[4, 1]
def startswith(self, prefix: str) -> bool:

Check if the indexed string starts with the given prefix.

Complexity

  • Time: O(|prefix| log σ)

where:

  • |prefix| = length of the prefix
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

Examples

>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.starts_with("mi")
True
def endswith(self, suffix: str) -> bool:

Check if the indexed string ends with the given suffix.

Complexity

  • Time: O(|suffix| log σ)

where:

  • |suffix| = length of the suffix
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

Examples

>>> from fm_index import FMIndex
>>> fm_index = FMIndex("mississippi")
>>> fm_index.ends_with("pi")
True
class MultiFMIndex:

A Multi FM-index data structure for efficient substring search across multiple texts.

The MultiFMIndex builds a single FM-index over a collection of texts, enabling fast pattern counting and locating queries while keeping the index compact. Internally, the texts are concatenated with separators so that matches can be attributed back to the original text, making it suitable for multi-document search.

This implementation accepts Unicode strings and encodes each input text as a sequence of integer symbols. Depending on the input texts’ representation (e.g., UCS-1 / UCS-2 / UCS-4), it automatically selects an appropriate integer bit-width for symbols, using the smallest representation that can faithfully store all code points appearing in the collection.

MultiFMIndex(data: Iterable[str])
def item(self) -> list[str]:

Convert the MultiFM-Index back to a list of strings.

Complexity

  • Time: O(N log σ)

where:

  • N = total length of all indexed strings
  • σ = size of the alphabet (2⁸ for UCS-1, 2¹⁶ for UCS-2, etc.)

```

def count_all(self, pattern: str) -> int:

Count the total occurrences of the given pattern across all indexed strings.

def count(self, pattern: str) -> dict[int, int]:

Count the occurrences of the given pattern in each indexed string.

def locate(self, pattern: str) -> dict[int, list[int]]:

Locate all occurrences of the given pattern in each indexed string.

def startswith(self, prefix: str) -> list[int]:

List the indices of strings that start with the given prefix.

def endswith(self, suffix: str) -> list[int]:

List the indices of strings that end with the given suffix.