stride-align

Smith-Waterman & Needleman-Wunsch with a fast SIMD backend

stride-align is a blazing fast library to tell you how "similar" two strings are. It does this by implementing the Smith-Waterman and Needleman-Wunsch algorithms. Instead of giving you a lecture, we're going to learn by doing. Let's dive right into how it works.

Installation

pip install stride-align

On Loongson systems, install NumPy from your Linux distribution before installing stride-align, and grab the LoongArch64 wheel from the GitHub release instead of PyPI (PyPI does not yet accept the linux_loongarch64 or manylinux_2_38_loongarch64 platform tags):

sudo apt install python3-numpy

PY=$(python3 -c 'import sys; print(f"cp{sys.version_info.major}{sys.version_info.minor}")')
pip install \
  https://github.com/adamdeprince/stride-align/releases/download/v0.1.0/stride_align-0.1.0-${PY}-${PY}-linux_loongarch64.whl

Prebuilt LoongArch64 wheels are available for Python 3.10, 3.11, 3.12, 3.13, and 3.14. If you are on a different Python (or just want to build from source), pip install stride-align falls back to the source distribution on PyPI, which compiles the LSX/LASX kernels locally.

First, just a disclaimer: I'm not using religious texts here to push an agenda - for this demo I need multiple largish public domain documents that have the same meaning but are phrased differently. The Bible just happens to fit that demo requirement freakishly well.

Imagine we have two sentences - let's use the first sentence in Genesis for this:

In the American Standard Version we have: "In the beginning God created the heavens and the earth."

In the King James Version we have: "In the beginning God created the heaven and the earth."

We can see with our eyes there's a difference - heavens vs heaven. But how do we quantify this difference? We'd use this little bit of code:

import stride_align as sa

print(sa.smith_waterman_normalized_score(
      "In the beginning God created the heavens and the earth.",
      "In the beginning God created the heaven and the earth."))

When we run this it prints:

0.9907407407407407

Normalized scores are between 0 and 1. A score of 1 means the inputs are an exact match under the default scoring model. Scores near 0 mean the inputs have little in common, though Smith-Waterman may still find small local matches inside otherwise unrelated strings.

Now let's change the text and see what happens to the score.

import stride_align as sa

print(sa.smith_waterman_normalized_score(
      "In the beginning God created the heavens and the earth.",
      "The quick brown fox jumped over the lazy dog."))

and Python prints

0.12222222222222222

Starting to get the idea? The more similar the strings, the higher the score.

Let's build a bigger example, something that gives us a feel for the library's performance. You'll probably notice that we switch between Smith-Waterman and Needleman-Wunsch and may be wondering which to use when. Use Needleman-Wunsch when you want to compare the whole input against the whole input. Use Smith-Waterman when you want to find the best matching region inside larger inputs.

Okay, let's move on to the demo code. You need requests for this part of the demo:

pip install requests
import os, time, requests
import stride_align as sa

if not os.path.exists("kjv.txt"):
    response = requests.get("https://openbible.com/textfiles/kjv.txt")
    response.raise_for_status()
    response.encoding = "utf-8-sig"
    open("kjv.txt", "w", encoding="utf-8").write(response.text)

lines = [line.strip().lower() for line in open("kjv.txt")][2:]

while True:
    if not (query := input("Enter a snippet to match.  Press enter to end.\n")):
        break
    t = time.perf_counter()
    scores = sa.needleman_wunsch_normalized_scores(query.lower(), lines)
    best = int(scores.argmax())
    print()
    print("Score:", float(scores[best]))
    print(lines[best])
    print("Search time: %0.2fms" % ((time.perf_counter() - t) * 1000))
    print()
    print()

Now how can we use this? Suppose we have a random Bible verse and want to know what chapter and verse it comes from. grep you say? Oh, heavens, no, we made a mistake. The verse we have is from a different translation, say the Catholic Public Domain, and what we have on our computer is the King James Bible. grep's exact string matching won't work here. How do we find the chapter and verse? We search for the "closest" or "most similar" string using stride-align, of course.

In our demo the first part concerns itself with downloading and caching. The good folks at Open Bible put this text where it's HTTP-reachable, but we want to be respectful of their IT budget so we cache what we download. It's just good citizenship.

In the next part we load all of the lines into a list. We remove newlines and make everything lower case because we don't want to get all fiddly about whether we're holding the shift key.

Lastly that while True: loop collects a line of text, presumably the Bible verse from the Catholic version of the Bible we want to look up the chapter and verse for, and matches it against all of the lines in the King James Bible using the batch form of Needleman-Wunsch. It returns an array of scores. We use argmax() to find the best-scoring line and then print the line associated with that index. Let's try it.

I'm going to use Jeremiah 4:28 from the Catholic Bible - it's actually quite different from the same verse in the King James Bible. Let's see what happens ...

$ python3 demo2.py
Enter a snippet to match.  Press enter to end.
The earth will mourn, and the heavens will lament from above. For I have spoken, I have decided, and I have not regretted. Neither will I be turned away from it.

Score: 0.3598901098901099
jeremiah 4:28   for this shall the earth mourn, and the heavens above be black: because i have spoken [it], i have purposed [it], and will not repent, neither will i turn back from it.
Search time: 206.51ms

... and we found it! And pretty quickly too.

Now let's do another demo: spell checking.

This is a toy spell checker, not a production one. It ignores punctuation, capitalization, word frequency, proper nouns, and context. The point is to show the same one-query-against-many-candidates pattern on a familiar task.

import os, sys
import stride_align as sa

paths = ['/usr/share/dict/words',
         '/usr/dict/words',
         '/var/lib/dict/words',
         '/etc/dictionaries-common/words']

for path in paths:
    if os.path.exists(path):
        break
else:
    print("Sorry, I can't find your dictionary", file=sys.stderr)
    exit(1)


words = [line.strip().lower() for line in open(path)]


for line in sys.stdin:
    new_line = []
    for word in line.split():
        scores = sa.needleman_wunsch_normalized_scores(word.lower(), words)
        word = words[int(scores.argmax())]
        new_line.append(word)
    print(' '.join(new_line), flush=True)

The first thing this script does is try to find our operating system's list of correctly spelled words. Its location can vary from distribution to distribution. Once we've found it, we load it, strip off newlines and start the act of spell checking.

The spell checking looks a lot like the matching we did before. For each candidate word, we match it against all of the words in our list of correctly spelled words, use argmax() to find the highest-scoring candidate, and replace the word with that candidate. We could speed things up with some optimizations, like not searching for a match for correctly spelled words, but this is a demo and that optimization is left as an exercise for the reader.

Let's see how it works!

$ cat - | python3 demo3.py
this is a demonstrtion of a spel checker
it doesn't matter that I can't spell corectly

this is a demonstration of a spell checker
it doesn't matter that i can't spell correctly

Details

The current scaffold provides:

The native boundary accepts:

Direct bytes versus str pairs raise TypeError.

The current implementations are generic dynamic-programming kernels with preprocessing that serializes Python inputs into 8, 16, 32, or 64-bit token streams. SIMD-specialized backends can replace the backend translation units later without changing the Python API.

Score-only functions return numeric scores. The normalized variants return scores between 0 and 1. Path functions return alignment result objects containing the score, aligned values, operations, and CIGAR-style summaries where available.

API

import stride_align

score = stride_align.needleman_wunsch_score("ACGT", "ACCT")
scores = stride_align.Scores("ACGT", variant="needleman_wunsch").compare(["ACCT", "AGGT"])
result = stride_align.smith_waterman_path("ACCGT", "CCG")
wide_result = stride_align.smith_waterman_path("ACCGT", "CCG", width=64)
object_result = stride_align.needleman_wunsch_path(
    [frozenset({1}), frozenset({2})],
    [frozenset({1}), frozenset({3})],
)

print(score)
print(scores)
print(result.score, result.aligned_query, result.aligned_target, result.operations)
print(wide_result.score)
print(object_result.aligned_query, object_result.aligned_target)

Use Scores(...).compare([...]) or the *_scores() functions for one-query against many-target score workloads. That path prepares the query/profile once and is the preferred performance API for repeated English/Chinese text comparisons.

Traceback outputs preserve the paired fast-path type:

Pass width=8, 16, 32, or 64 to force the internal token/scoring width instead of using automatic selection.

Some functions expose CIGAR strings, short for "Concise Idiosyncratic Gapped Alignment Report". CIGAR is the compact alignment-operation notation used by SAM/BAM tooling. If you want the full formal version, see the SAM specification.

Optimizations and Benchmarks

Careful attention has been, and continues to be, paid to stride-align's performance story. The library includes SIMD optimization for a variety of common targets, including x86, Arm, and LoongArch.

LoongArch / Loongson. The Loongson optimization story is especially telling: for the checked benchmark case -- English text, 16-bit score width, score-only Smith-Waterman -- the LASX backend is 16x faster than the generic backend and 22.4x faster than Parasail.

If you are a researcher using Loongson servers and benefiting from this speedup, citations, bug reports, benchmark cases, and tiny inexpensive Chinese souvenirs are appreciated. Tea, calligraphy bookmarks, paper-cut ornaments, Chinese knot charms, panda keychains, and small dragon desk objects are all welcome. Please do not send anything expensive or anything that requires customs paperwork.

See complete benchmarks.

Native Microbench

For perf profiling without Python frames or benchmark orchestration, configure a native x86 microbench build:

nanobind_dir="$(.venv/bin/python -m nanobind --cmake_dir)"
cmake -S . -B build/perf \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DSTRIDE_ALIGN_BUILD_MICROBENCH=ON \
  -DSTRIDE_ALIGN_PERF_SYMBOLS=ON \
  -DPython_EXECUTABLE=.venv/bin/python \
  -Dnanobind_DIR="$nanobind_dir"
cmake --build build/perf --target stride_align_x86_microbench
build/perf/stride_align_x86_microbench --backend avx2 --shape 1:many --pass english --width 16
python tools/x86_microbench_regression.py \
  --binary build/perf/stride_align_x86_microbench \
  --cpu 2 \
  --backends avx2,avx512bwvl \
  --shapes 1:1,1:many \
  --passes english,chinese \
  --widths 16,32 \
  --write-json /tmp/stride-align-x86-microbench.json
.venv/bin/python tools/pinned_benchmark_sweep.py \
  --output-dir /tmp/stride-align-pinned \
  --cpu 2 \
  --iterations 15 \
  --warmups 3

STRIDE_ALIGN_PERF_SYMBOLS=ON keeps nanobind modules unstripped and adds debug symbols plus frame pointers while preserving -O3.

The checked-in native microbench baseline lives at benchmarks/x86_microbench_baseline.json. Treat it as a local guardrail with a loose threshold, not as a cross-machine SLA.

Citations

If you use my software in your research, please cite me.

@software{deprince_stride_align,
  author       = {DePrince, Adam},
  title        = {stride-align: Fast Smith-Waterman and Needleman-Wunsch alignment for Python},
  year         = {2026},
  publisher    = {GitHub},
  url          = {https://github.com/adamdeprince/stride-align},
  note         = {Python/C++ library for sequence and string alignment}
}