Stride-align is benchmarked across five hardware families:
Intel x86 (AVX2, AVX512BWVL), ARM Linux aarch64 on AWS Graviton4
(NEON/ASIMD, SVE, SVE2), ARM macOS arm64 on Apple M-series (NEON),
LoongArch64 on Loongson (LSX, LASX), and PowerPC64 VSX on Power8. Each
family ran on a different host with a different parasail build (or no
parasail at all), so numbers should be read within a family, not
across families. Raw CSVs live in benchmark.csv (x86) and
benchmarks/*.csv (everything else).
All ratios are median-runtime ratios. A ratio above 1.0x means stride-align
is faster than the named baseline for that row:
ratio = baseline_median_seconds / stride_align_median_seconds
At a glance
| Family | Best stride-align backend | Baseline | Rows | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|---|
| Intel x86 | x86_avx512bwvl |
parasail | 80 | 1.752x | 1.678x | 0.909x | 3.794x |
| Intel x86 | x86_avx2 |
parasail | 80 | 1.377x | 1.302x | 0.474x | 3.513x |
| ARM Graviton4 (Linux) | linux_aarch64_neon/asimd |
parasail | 80 | 1.138x | 1.138x | 0.260x | 2.637x |
| ARM Graviton4 (Linux) | linux_aarch64_sve2 |
parasail | 80 | 1.081x | 1.108x | 0.261x | 2.635x |
| ARM Graviton4 (Linux) | linux_aarch64_sve |
parasail | 80 | 1.042x | 1.101x | 0.259x | 2.647x |
| ARM Mac M4 (macOS) | macos_arm64_neon |
parasail | 80 | 1.065x | 1.046x | 0.592x | 2.400x |
| Loongson LoongArch64 | linux_loongarch64_lasx |
patched parasail (1:1 score) | 16 | 7.517x | 6.502x | 4.315x | 22.365x |
| Loongson LoongArch64 | linux_loongarch64_lasx |
generic (native) | 80 | 4.909x | 5.149x | 0.499x | 29.707x |
| Power8 VSX (Linux) | linux_powerpc64_vsx |
generic (no parasail) | 80 | 3.772x | 4.128x | 0.915x | 16.797x |
Intel x86 - 2026-05-18
Raw artifacts: benchmark.csv and the focused linear
sw-farrar-score exact-fill follow-up
benchmarks/x86-sw-farrar-exactfill-study-2026-05-14.csv.
Build context: 11th Gen Intel Core i7-1195G7, Python 3.13 in the project
virtualenv, host pinned with taskset -c 2, regenerated 2026-05-18 after the
CIGAR builder rework (to_chars-based digit emission + capacity reservation
in build_cigar and ReverseCigarBuilder). Parasail is the bundled
parasail==1.3.4 wheel. The CSV contains 320 data rows: English and Chinese
workloads, linear and affine scoring, widths 16 and 32, 1:1 and 1:many
shapes, and generic/x86_avx2/x86_avx512bwvl/parasail backends.
Command:
taskset -c 2 .venv/bin/python -m stride_align.benchmark \
--backends generic x86_avx2 x86_avx512bwvl parasail \
--variants sw-farrar-score sw-score nw-score sw-path-info nw-path-info sw-cigar nw-cigar \
--passes english chinese \
--shapes all \
--scoring-cases linear affine \
--widths 16 32 \
--iterations 15 \
--warmups 3 \
--timing-split \
--format csv > benchmark.csv
Overall vs parasail
| Backend | Rows | Wins | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|
x86_avx512bwvl |
80 | 77 | 1.752x | 1.678x | 0.909x | 3.794x |
x86_avx2 |
80 | 66 | 1.377x | 1.302x | 0.474x | 3.513x |
generic |
80 | 8 | 0.222x | 0.176x | 0.058x | 1.615x |
Score-only rows (16 sw-farrar-score + 16 sw-score + 16 nw-score):
| Backend | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
x86_avx512bwvl |
48 | 48 | 1.767x | 1.762x |
x86_avx2 |
48 | 40 | 1.297x | 1.173x |
Path/CIGAR rows (8 each of sw-path-info, nw-path-info, sw-cigar, nw-cigar):
| Backend | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
x86_avx512bwvl |
32 | 29 | 1.730x | 1.524x |
x86_avx2 |
32 | 26 | 1.508x | 1.524x |
By variant
AVX2:
| Variant | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
nw-path-info |
8 | 8 | 2.172x | 2.170x |
nw-cigar |
8 | 8 | 2.153x | 2.231x |
sw-score |
16 | 16 | 1.763x | 1.708x |
nw-score |
16 | 13 | 1.122x | 1.143x |
sw-farrar-score |
16 | 11 | 1.103x | 1.045x |
sw-path-info |
8 | 5 | 1.060x | 1.061x |
sw-cigar |
8 | 5 | 1.043x | 1.051x |
AVX512BWVL:
| Variant | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
sw-score |
16 | 16 | 2.295x | 2.274x |
nw-path-info |
8 | 8 | 2.210x | 2.337x |
nw-cigar |
8 | 8 | 2.193x | 2.327x |
nw-score |
16 | 16 | 1.673x | 1.694x |
sw-farrar-score |
16 | 16 | 1.436x | 1.314x |
sw-path-info |
8 | 7 | 1.373x | 1.412x |
sw-cigar |
8 | 6 | 1.346x | 1.414x |
Worst rows vs parasail
AVX2:
| Ratio | Pass | Case | Shape | Variant | Width |
|---|---|---|---|---|---|
| 0.474x | chinese | linear | 1:1 | sw-cigar |
16 |
| 0.493x | chinese | linear | 1:1 | sw-path-info |
16 |
| 0.763x | english | linear | 1:1 | sw-cigar |
16 |
| 0.799x | english | linear | 1:1 | sw-path-info |
16 |
| 0.888x | chinese | linear | 1:1 | nw-score |
32 |
AVX512BWVL:
| Ratio | Pass | Case | Shape | Variant | Width |
|---|---|---|---|---|---|
| 0.909x | chinese | linear | 1:1 | sw-cigar |
16 |
| 0.943x | chinese | linear | 1:1 | sw-path-info |
16 |
| 0.951x | chinese | linear | 1:1 | sw-cigar |
32 |
| 1.034x | chinese | linear | 1:1 | sw-path-info |
32 |
| 1.079x | chinese | affine | 1:1 | sw-cigar |
16 |
Takeaways
AVX512BWVL is the strongest Intel backend, winning 77 of 80 comparable
rows (every score-only row) at 1.752x parasail geomean. AVX2 wins 66 of 80 at
1.377x geomean. Both backends now beat parasail on every variant by geomean —
sw-cigar is the historically weakest variant but is positive (1.043x AVX2,
1.346x AVX512BWVL) instead of losing.
The 2026-05-18 sweep folded in the CIGAR builder rework (to_chars digit
emission and capacity reservation in build_cigar / ReverseCigarBuilder),
which dropped the path/CIGAR floor. AVX2 went from 1.237x → 1.377x overall
parasail geomean (1.11x), AVX512BWVL went from 1.451x → 1.752x (1.21x).
sw-cigar specifically moved from 0.760x → 1.043x on AVX2 and 1.090x →
1.346x on AVX512BWVL.
The 2026-05-14 focused linear sw-farrar-score exact-fill run still applies:
SSE4.1 reaches 6.04x generic, AVX2 10.67x, and AVX512BWVL 14.41x by
focused geomean.
AVX2's only remaining sub-parity rows are short linear SW 1:1 cigar/path-info
at width 16 (sw-cigar and sw-path-info); AVX512BWVL's worst row is
0.909x. Both backends now have only a handful of rows below 1.0x.
generic is for correctness/baseline reference, not as a parasail competitor.
It loses every score-only row badly; a handful of linear NW path/CIGAR rows
are competitive but not consistently.
ARM Graviton4 (Linux aarch64) - 2026-05-18
Raw artifacts:
| Artifact | Contents |
|---|---|
benchmarks/graviton4-arm-simd-parasail-2026-05-18.csv |
Full 400-row sweep after the CIGAR builder rework and the 5174571 SVE/SVE2 fix: generic, linux_aarch64_neon, linux_aarch64_sve, linux_aarch64_sve2, parasail, all 7 variants, 1:1 and 1:many, widths 16/32. |
benchmarks/graviton4-arm-simd-parasail-2026-05-16.csv |
Pre-5174571 SVE/SVE2 snapshot. |
Build context: AWS Graviton4 (Neoverse V2), Ubuntu noble, Python 3.14.4,
GCC, system CMake + venv-local ninja. Pinned with taskset -c 0. Backends
measured: linux_aarch64_neon, linux_aarch64_sve, linux_aarch64_sve2,
plus generic and parasail. linux_aarch64_asimd was merged into NEON
in commit 617d282; the public NEON backend covers both.
Overall vs parasail
| Backend | Rows | Wins | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|
linux_aarch64_neon |
80 | 51 | 1.138x | 1.138x | 0.260x | 2.637x |
linux_aarch64_sve2 |
80 | 49 | 1.081x | 1.108x | 0.261x | 2.635x |
linux_aarch64_sve |
80 | 47 | 1.042x | 1.101x | 0.259x | 2.647x |
Score-only:
| Backend | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
linux_aarch64_neon |
48 | 28 | 1.100x | 1.098x |
linux_aarch64_sve2 |
48 | 28 | 1.022x | 1.071x |
linux_aarch64_sve |
48 | 26 | 0.973x | 1.070x |
Path/CIGAR:
| Backend | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
linux_aarch64_neon |
32 | 23 | 1.198x | 1.230x |
linux_aarch64_sve2 |
32 | 21 | 1.178x | 1.188x |
linux_aarch64_sve |
32 | 21 | 1.155x | 1.206x |
By variant (vs parasail geomean)
| Variant | NEON | SVE | SVE2 |
|---|---|---|---|
nw-path-info |
2.052x | 1.953x | 2.163x |
sw-score |
1.465x | 1.268x | 1.269x |
nw-cigar |
1.318x | 1.300x | 1.275x |
sw-path-info |
1.144x | 1.066x | 1.059x |
nw-score |
0.953x | 0.883x | 1.020x |
sw-farrar-score |
0.952x | 0.822x | 0.824x |
sw-cigar |
0.665x | 0.657x | 0.659x |
Worst rows vs parasail (NEON)
| Ratio | Pass | Case | Shape | Variant | Width |
|---|---|---|---|---|---|
| 0.260x | chinese | affine | 1:1 | sw-cigar |
16 |
Takeaways
The 2026-05-18 sweep is the first post-5174571 SVE/SVE2 capture against
parasail. SVE/SVE2 jumped from 0.450x parasail geomean (pre-fix snapshot
in the 2026-05-16 CSV) to 1.042x / 1.081x respectively — within rounding
of the 1.04x / 1.08x numbers advertised by the SVE-fix commit.
NEON remains the headline Graviton4 backend at 1.138x parasail geomean.
NEON, SVE, and SVE2 cluster within ~10% of each other on this Neoverse V2.
SVE2 edges SVE on score-only thanks to its native svqadd_s* vector-vector
saturating add for affine sentinels.
Remaining weak spots common to all three backends:
* affine/linear sw-cigar width 16 — still the worst variant
(0.665x NEON geomean); the trace-table representation favors parasail
on aarch64 even after the CIGAR builder rework.
* sw-farrar-score width 16 1:1 — 0.82-0.95x geomean across the three.
Recommended Graviton4 next steps
- Target
sw-cigarwidth 16 next — still the lone variant where all three SIMD backends lose to parasail by geomean. - Investigate
sw-farrar-scorewidth 16 1:1; the 0.82-0.95x range suggests a striped-trace cache miss specific to the short query. - Either delete the merged
linux_aarch64_asimdimport alias or surface it explicitly inavailable_backends()so external users have a stable name.
ARM macOS arm64 (Apple M-series) - 2026-05-18
Raw macOS arm64 artifacts:
| Artifact | Contents |
|---|---|
benchmarks/macos-arm64-neon-2026-05-18.csv |
Full 240-row sweep after the CIGAR builder rework: generic, macos_arm64_neon, parasail, all 7 variants, 1:1 and 1:many, widths 16/32. |
benchmarks/macos-arm64-neon-score-native-2026-05-13.csv |
Native generic, swar, and NEON score rows for 1:1 and 1:many. |
benchmarks/macos-arm64-neon-score-parasail-2026-05-13.csv |
Score rows including locally installed parasail. |
benchmarks/macos-arm64-neon-path-parasail-2026-05-13.csv |
Path/CIGAR timing-split rows including parasail. |
benchmarks/macos-arm64-neon-focused-2026-05-14.csv |
Pre-CIGAR-fix focused comparison against parasail. |
benchmarks/macos-arm64-neon-microbench-2026-05-14.txt |
Native NEON microbench. |
benchmarks/macos-arm64-neon-sw-farrar-parasail-study-2026-05-14.csv |
Focused run after adding exact-fill linear SW Farrar score paths. |
benchmarks/macos-arm64-neon-linear-trace-onepass-parasail-study-2026-05-14.csv |
Negative-control one-pass striped linear SW trace experiment (reverted). |
benchmarks/macos-arm64-neon-2026-05-13.md, benchmarks/macos-arm64-neon-2026-05-14.md |
macOS-specific notes and recommendations. |
Build context: macOS 15.3.1 on Apple M-series (host wopr), Python 3.13 from
Homebrew, Apple clang 17. Parasail is locally installed parasail==1.3.4
backed by parasail library 2.6.2. Installing parasail from source on
homebrew required autotools/libtool with /opt/homebrew/bin on PATH for
glibtoolize. The 2026-05-18 sweep was regenerated after the CIGAR builder
rework (to_chars-based digit emission + capacity reservation).
This is a different chip and toolchain from Graviton4 — do not transfer ratios between the two ARM sections.
Overall vs parasail (2026-05-18)
| Group | Rows | Wins | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|
| Overall | 80 | 46 | 1.065x | 1.046x | 0.592x | 2.400x |
| Score-only | 48 | 21 | 0.947x | 0.931x | ||
| Path/CIGAR | 32 | 25 | 1.272x | 1.308x |
By variant (macos_arm64_neon vs parasail)
| Variant | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
nw-path-info |
8 | 8 | 1.645x | 1.518x |
nw-cigar |
8 | 8 | 1.616x | 1.575x |
sw-score |
16 | 16 | 1.206x | 1.215x |
sw-cigar |
8 | 5 | 1.010x | 1.042x |
sw-path-info |
8 | 4 | 0.973x | 0.974x |
sw-farrar-score |
16 | 5 | 0.951x | 0.905x |
nw-score |
16 | 0 | 0.740x | 0.753x |
Takeaways
NEON on Mac has crossed parasail parity. The 2026-05-18 sweep is 1.065x
parasail geomean across 80 rows (was 0.720x on the 2026-05-14 focused
sweep). The path/CIGAR group is now the strongest (1.272x geomean) — the
inverse of the pre-fix state where it was the worst. The CIGAR builder
rework lands cleanly here: nw-cigar reaches 1.616x and sw-cigar
crosses parity at 1.010x.
Score-only is still the weak group at 0.947x. nw-score is the only
variant where every row loses (0.740x geomean), driven by parasail's
striped score kernel on width 16. sw-farrar-score is also slightly behind
(0.951x). The exact-fill linear SW Farrar score path lands its earlier
gains here.
Recommended Mac M-series next steps
- Target affine
nw-scorenext — it is the largest remaining gap and the only variant where every comparison loses. - Investigate
sw-farrar-scorewidth 16: still trailing parasail at0.905xmedian. - Keep the exact-fill linear SW Farrar score path enabled.
- Do not reuse the shared masked-trace helpers for NEON linear SW path/CIGAR without redesigning the trace representation.
- Add a native parasail comparison mode to the arm64 microbench before instruction-level parity work.
- Keep SWAR off the mac performance path (geomean
0.64xgeneric on the 2026-05-13 native sweep) — correctness/reference only.
Loongson LoongArch64 (LSX/LASX) - 2026-05-18
Raw Loongson artifacts:
| Artifact | Contents |
|---|---|
benchmarks/loongson-native-2026-05-18.csv |
Full 320-row native sweep after the CIGAR builder rework: generic, swar, LSX, LASX, all 7 variants, 1:1 and 1:many, widths 16/32. |
benchmarks/loongson-score-native-2026-05-13.csv |
Earlier native score-only sweep. |
benchmarks/loongson-score-1to1-parasail-2026-05-13.csv |
sw-score/nw-score 1:1 comparison against patched generic LoongArch parasail. |
benchmarks/loongson-path-native-2026-05-13.csv |
Pre-CIGAR-fix path/CIGAR timing-split rows, no parasail. |
benchmarks/loongson-sw-farrar-exactfill-baseline-2026-05-14.csv |
Focused linear sw-farrar-score baseline before exact-fill hooks. |
benchmarks/loongson-sw-farrar-exactfill-study-2026-05-14.csv |
Focused linear sw-farrar-score run after exact-fill hooks. |
benchmarks/loongson-2026-05-13.md |
Loongson-specific notes and recommendations. |
Build context: Loongson 3A6000-class host, Python 3.13.13, GCC 15.2.0, CMake
4.3.2. The LoongArch Python extension modules were built with static C++
runtime linkage; ldd shows no dynamic libstdc++/libgcc dependency.
Numpy is sourced from a host-local source build (/data/home/adam/dev/numpy)
linked against the GCC 15.2 runtime at /opt/loongson-gcc-15.2.0/lib, since
no upstream loongarch64 wheel exists.
Parasail status: upstream pip install parasail failed on LoongArch. A
patched source build works for direct score calls after treating LoongArch as
a non-x86 cpuid stub target, but it is generic parasail, not LSX/LASX
optimized. Its profile API returned NULL profiles and trace/CIGAR was not
usable, so parasail is included only for direct sw-score/nw-score 1:1
score rows. The 2026-05-18 sweep is native-only.
Overall vs generic (2026-05-18 native)
| Backend | Rows | Wins | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|
linux_loongarch64_lasx |
80 | 68 | 4.909x | 5.149x | 0.499x | 29.707x |
linux_loongarch64_lsx |
80 | 72 | 2.876x | 3.081x | 0.350x | 16.085x |
Score-only vs generic (2026-05-18 native)
| Backend | Rows | Wins | Geomean | Median | Best |
|---|---|---|---|---|---|
linux_loongarch64_lasx |
48 | 48 | 11.624x | 13.370x | 29.707x |
linux_loongarch64_lsx |
48 | 48 | 5.185x | 5.334x | 16.085x |
Path / CIGAR vs generic (2026-05-18 native)
| Backend | Rows | Wins | Geomean | Median | Best |
|---|---|---|---|---|---|
linux_loongarch64_lasx |
32 | 20 | 1.347x | 1.069x | 7.127x |
linux_loongarch64_lsx |
32 | 24 | 1.188x | 1.249x | 3.752x |
By variant (vs generic)
| Variant | LSX geomean | LASX geomean |
|---|---|---|
sw-farrar-score |
6.440x | 19.281x |
sw-score |
6.089x | 17.770x |
nw-score |
3.556x | 4.585x |
sw-cigar |
1.194x | 1.915x |
nw-cigar |
1.677x | 1.797x |
nw-path-info |
1.257x | 1.182x |
sw-path-info |
0.791x | 0.809x |
Score-only vs patched LoongArch parasail (1:1)
The 2026-05-13 parasail comparison (16-row 1:1 score-only, unaffected by the CIGAR builder rework):
| Backend | Rows | Wins | Geomean | Median | Best |
|---|---|---|---|---|---|
linux_loongarch64_lasx |
16 | 16 | 7.517x | 6.502x | 22.365x |
linux_loongarch64_lsx |
16 | 16 | 5.307x | 5.395x | 13.205x |
swar |
16 | 8 | 1.061x | 1.161x | 2.201x |
Takeaways
LASX is the clear Loongson win: 11.6x geomean over generic on score-only,
7.5x over patched parasail on 1:1 direct score calls, and up to 29.7x on the
best row. LSX trails LASX by roughly the expected 2x register-width factor.
LSX/LASX score-only numbers shifted up from the 2026-05-13 sweep (LSX
4.234x → 5.185x, LASX 6.220x → 11.624x); the bulk of the LASX jump is the
exact-fill sw-farrar-score and sw-score paths landing at the 1024-cell
shape used in the sweep.
sw-cigar moved from path-trace-bound to comfortably ahead of generic: LSX
went from below 1.0x to 1.194x; LASX reaches 1.915x. The CIGAR builder
rework (to_chars + capacity reservation) is the proximate cause; the
sw-cigar path now spends almost all of its time in the SIMD score kernel
plus the affine reverse-build, not the digit serialization.
sw-path-info is the lone remaining weak variant on both LSX (0.791x) and
LASX (0.809x). profile_traceback materialization still dominates over the
SIMD score lift for that shape.
SWAR is essentially a regression on score-only (~0.63x generic on the
2026-05-13 native sweep) but is roughly at parity with parasail on the
patched 1:1 score comparison (1.06x geomean) — useful as a correctness
reference, not as a performance path.
Recommended Loongson next steps
- Keep exact-fill LSX/LASX score hooks enabled — large score-only win.
- Target a Loongson-specific linear SW trace/CIGAR redesign before doing instruction scheduling.
- Start with LASX width 16/32 trace-traffic reduction, then port to LSX.
- Add a native Loongson microbench/perf entrypoint before micro-optimizing LSX/LASX loops.
- Treat parasail as a generic LoongArch comparison until a maintained LSX/LASX parasail build becomes available.
PowerPC64 VSX (Power8 Linux) - 2026-05-18
Raw Power8 artifacts:
| Artifact | Contents |
|---|---|
benchmarks/power8-vsx-2026-05-18.csv |
Full 240-row sweep after the CIGAR builder rework: generic, swar, and linux_powerpc64_vsx for all 7 variants, 1:1 and 1:many, widths 16/32. |
benchmarks/power8-vsx-2026-05-17.csv |
Pre-CIGAR-fix sweep on the same host. |
benchmarks/power8-vsx-2026-05-17.md |
Power8-specific notes, semantic-delta writeup, and recommendations. |
Build context: real POWER8 silicon (PVR 004b 0201, 4.157 GHz), KVM-virtualized
as a single-core pSeries guest. Ubuntu 20.04 ppc64le, IBM Advance Toolchain
15.0 (GCC 11.4.1), Python 3.13.13 from miniforge, system CMake + Ninja.
Numpy 2.4.5 from pip. Parasail was not built (no upstream ppc64le
wheel; source build not attempted), so all ratios in this section are
against generic on the same machine. The 2026-05-18 sweep was pinned with
taskset -c 0.
Overall vs generic (2026-05-18)
| Backend | Rows | Wins | Geomean | Median | Worst | Best |
|---|---|---|---|---|---|---|
linux_powerpc64_vsx |
80 | 74 | 3.772x | 4.128x | 0.915x | 16.797x |
swar |
80 | 31 | 0.789x | 1.000x | 0.411x | 1.669x |
Score-only vs generic
| Backend | Rows | Wins | Geomean | Median | Best |
|---|---|---|---|---|---|
linux_powerpc64_vsx |
48 | 48 | 6.911x | 6.763x | 16.797x |
Path / CIGAR vs generic
| Backend | Rows | Wins | Geomean | Median | Best |
|---|---|---|---|---|---|
linux_powerpc64_vsx |
32 | 26 | 1.521x | 1.266x | 4.592x |
By variant (VSX vs generic)
| Variant | Rows | Wins | Geomean | Median |
|---|---|---|---|---|
sw-farrar-score |
16 | 16 | 7.759x | 7.072x |
sw-score |
16 | 16 | 7.544x | 7.103x |
nw-score |
16 | 16 | 5.639x | 5.156x |
sw-cigar |
8 | 5 | 1.958x | 2.183x |
nw-cigar |
8 | 7 | 1.825x | 1.862x |
nw-path-info |
8 | 6 | 1.225x | 1.266x |
sw-path-info |
8 | 8 | 1.224x | 1.173x |
Worst rows vs generic
| Ratio | Pass | Case | Shape | Variant | Width |
|---|---|---|---|---|---|
| 0.915x | english | linear | 1:1 | nw-path-info |
16 |
| 0.934x | english | linear | 1:1 | sw-cigar |
32 |
| 0.954x | chinese | linear | 1:1 | nw-cigar |
32 |
| 0.955x | chinese | linear | 1:1 | sw-cigar |
32 |
| 0.964x | chinese | linear | 1:1 | nw-path-info |
16 |
Takeaways
Power8 VSX is a uniform win or tie. The 0.96x worst case is a short 1:1
linear path-info row at width 16 where SIMD setup dominates. The biggest wins
(~16x) are sw-farrar-score and sw-score for both English and Chinese
1024x1024 at width 16. Affine sw-cigar and nw-cigar reach ~4.2x because
they route through the SIMD score kernel plus
profile_traceback::affine_cigar_with_score and skip the trace table.
Routing decisions that differ from x86_avx2 / Loongson: linear SW path /
path-info / linear CIGAR and affine path / path-info stay on
profile_traceback on Power8. The shared masked striped traceback wins on
ISAs with one-instruction lane bitmask extract and good striped-trace cache
behavior. On Power8 they measured 0.46-0.60x of the scalar byte-table path,
so the helper hooks (trace_mask_*, mask_or, store_masked_cells, the
vbpermq-based collapse described inline) are present but the public path /
path-info / linear CIGAR APIs route through profile_traceback. The affine
CIGAR entry stays on the SIMD score kernel + scalar reverse-build, beating
generic by 2-4x without a trace table.
SWAR regresses on Power8 across most rows (geomean 0.79x generic on the
2026-05-18 sweep). AT15 GCC 11.4 auto-vectorizes the generic score loop
well enough that SWAR's 64-bit packed lanes give no benefit. SWAR remains
useful as a correctness/reference backend only.
Recommended Power8 next steps
- Try a
vbpermq-basedtrace_mask_*on real hardware; combined with a row-major linear SW trace table this might bring the masked path above1.0xof generic. - Add a Power8
local_affine_score_exact_segment*_rawmirroring the NEON helpers if the 1024-character query shape becomes a target; currentsw-farrar-scoreis already 7-8x ahead of generic. - Re-bench on a multi-core / non-virtualized Power8 host to characterise SMT throughput and shared L2/L3 effects.
- Build parasail from source for ppc64le and add a parasail column to the next sweep — every other family in this file has at least one parasail point of reference.
- Investigate why SWAR loses to generic on Power8 via an asm dump of the generic score loop.
Notes on comparing across families
These numbers are intended for engineering direction, not publication-grade
claims. Different families used different baselines (parasail where
available, otherwise generic), different sweep sizes, different host pinning
strategies, and different parasail builds (bundled wheel, locally compiled,
patched-for-LoongArch, or absent). For any cross-family claim, rerun the
relevant sweeps with matched conditions and use native microbench rows for
the specific kernels under discussion.