Metadata-Version: 2.4
Name: spacy-accelerate
Version: 0.3.1
Summary: Accelerate spaCy transformers with TensorRT/ONNX Runtime
Project-URL: Homepage, https://github.com/nesergey/spacy-accelerate
Project-URL: Documentation, https://github.com/nesergey/spacy-accelerate#readme
Project-URL: Repository, https://github.com/nesergey/spacy-accelerate
Project-URL: Issues, https://github.com/nesergey/spacy-accelerate/issues
Project-URL: PyPI, https://pypi.org/project/spacy-accelerate/
Author-email: Siarhei Niaverau <nesergey@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: acceleration,nlp,onnx,spacy,tensorrt,transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: cupy-cuda12x==13.6.0
Requires-Dist: numpy==2.4.1
Requires-Dist: onnx==1.20.1
Requires-Dist: onnxruntime-gpu==1.23.2
Requires-Dist: onnxscript<0.2.0,>=0.1.0
Requires-Dist: spacy-transformers==1.3.9
Requires-Dist: spacy==3.8.2
Requires-Dist: tensorrt-cu12-bindings==10.15.1.29
Requires-Dist: tensorrt-cu12-libs==10.15.1.29
Requires-Dist: tensorrt-cu12==10.15.1.29
Requires-Dist: thinc==8.3.10
Requires-Dist: torch==2.5.1
Requires-Dist: transformers==4.41.2
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: datasets>=2.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Description-Content-Type: text/markdown

# spacy-accelerate

[![PyPI version](https://img.shields.io/pypi/v/spacy-accelerate.svg)](https://pypi.org/project/spacy-accelerate/)
[![Python](https://img.shields.io/pypi/pyversions/spacy-accelerate.svg)](https://pypi.org/project/spacy-accelerate/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Accelerate spaCy transformer pipelines with TensorRT and ONNX Runtime. Drop-in replacement — one line of code. Speedup depends on your GPU: **1.2–3.5× faster inference** on the tested setups, with small accuracy deltas.

Repository: [GitHub](https://github.com/nesergey/spacy-accelerate) • Package: [PyPI](https://pypi.org/project/spacy-accelerate/)

## Requirements

- Python 3.11+
- CUDA 12.x
- NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
- spaCy 3.8+ with spacy-transformers

## Installation

### PyPI

```bash
pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2
```

The second command installs the TensorRT-enabled build of `onnxruntime-gpu` from NVIDIA's index. It is required because the default PyPI build does not include the TensorRT execution provider.

> [!NOTE]
> `spacy-accelerate` pins the full CUDA/TensorRT stack to keep versions aligned.
> On import it also configures the native library paths automatically, so no
> manual `LD_LIBRARY_PATH` setup is needed in most cases.

### Source / editable install

```bash
pip install -r requirements.txt
pip install -e . --no-deps
```

> [!WARNING]
> Use `--no-deps` when doing an editable install. Running plain `pip install -e .`
> triggers a second resolver pass that can replace the pinned CUDA 12 stack with
> newer, incompatible packages.

### Verify the installation

```bash
python -m spacy_accelerate
```

You should see `TensorRT EP : OK` and `CUDA EP : OK` in the output.

## Quick Start

```python
import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal — same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))
```

## API Reference

### `optimize(nlp, **kwargs)`

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `nlp` | `spacy.Language` | required | spaCy pipeline with transformer |
| `precision` | `"fp32"` \| `"fp16"` | `"fp16"` | Model precision |
| `provider` | `"tensorrt"` \| `"cuda"` \| `"cpu"` | `"cuda"` | Execution provider |
| `cache_dir` | `Path` \| `str` | `~/.cache/spacy-accelerate` | ONNX model cache directory |
| `warmup` | `bool` | `True` | Run warmup inference |
| `device_id` | `int` | `0` | CUDA device ID |
| `max_batch_size` | `int` | `128` | Max batch size for IO Binding |
| `max_seq_length` | `int` | `512` | Max sequence length for IO Binding |
| `use_io_binding` | `bool` | `True` | Use zero-copy IO Binding |
| `verbose` | `bool` | `False` | Enable verbose logging |

**TensorRT-specific parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `trt_max_workspace_size` | `int` | `4 * 1024**3` | TensorRT workspace size in bytes |
| `trt_builder_optimization_level` | `int` | `3` | Optimization level (0–5) |
| `trt_timing_cache` | `bool` | `True` | Enable timing cache |

**Advanced parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `fixed_batch_size` | `int \| None` | `None` | Export ONNX with fixed batch size (dynamic if None) |
| `batch_buckets` | `list[int] \| None` | `None` | Pre-compiled TRT batch sizes; defaults to `[1,2,4,8,16,32,64,128]` for TensorRT |
| `fixed_seq_length` | `int \| None` | `None` | Pad/truncate all sequences to this length on GPU |
| `align_seq_length` | `int` | `16` | Pad sequence length to the nearest multiple of this value |

**Returns:** The optimized `spacy.Language` object (modified in-place).

### Cache Management

```python
import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")
```

## Examples

### Maximum performance with TensorRT

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8 GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds the TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")
```

### Custom cache directory

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)
```

### Verbose mode for debugging

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,
)
```

## Performance

**Conditions:** `en_core_web_trf`, CoNLL-2003 test set, `batch_size=128`,
1 warmup pass + 3 measured passes averaged.

### How much speedup to expect

The relative gain depends on your GPU. On a faster GPU, PyTorch is already
faster — GPU compute stops being the bottleneck and the overhead shifts to
the spaCy pipeline (tokenization, NER decoding, Python). ONNX Runtime does
not accelerate that part.

| GPU | Mode | PyTorch baseline | Best provider | Best WPS | Speedup |
|-----|------|-----------------|---------------|----------|---------|
| RTX 4000 SFF Ada ¹ | Full pipeline | 6,241 WPS | TensorRT FP16 | **16,935** | **2.71×** |
| RTX 4000 SFF Ada ¹ | NER only | 7,066 WPS | TensorRT FP16 | **24,823** | **3.51×** |
| A100 80 GB ² | Full pipeline | 6,881 WPS | CUDA FP16 | 9,670 | 1.41× |
| A100 80 GB ² | NER only | 9,486 WPS | CUDA FP16 | 15,291 | 1.61× |
| RTX 4090 ³ | Full pipeline | 9,924 WPS | TensorRT FP16 | 11,726 | 1.18× |
| RTX 4090 ³ | NER only | 14,728 WPS | TensorRT FP16 | 19,313 | 1.31× |

> **NER-only mode** disables `tagger`, `parser`, `attribute_ruler`, `lemmatizer`.
> Only the transformer + NER component run, which reduces non-GPU overhead
> and yields higher speedups.

---

### RTX 4000 SFF Ada Generation — full results ¹

**Full pipeline:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 6,241 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,166 | 0.99× | 100.00% |
| CUDA EP FP32 | 9,910 | 1.59× | 99.90% |
| CUDA EP FP16 | 15,763 | 2.53× | 99.75% |
| TensorRT FP32 | 10,552 | 1.69× | 99.95% |
| **TensorRT FP16** | **16,935** | **2.71×** | **99.50%** |

**NER only:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 7,066 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,859 | 0.97× | 100.00% |
| CUDA EP FP32 | 11,972 | 1.69× | 99.90% |
| CUDA EP FP16 | 22,394 | 3.17× | 99.75% |
| TensorRT FP32 | 13,138 | 1.86× | 99.95% |
| **TensorRT FP16** | **24,823** | **3.51×** | **99.65%** |

---

### A100 80 GB — full results ²

**Full pipeline:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 6,881 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 6,882 | 1.00× | 100.00% |
| CUDA EP FP32 | 8,822 | 1.28× | 99.85% |
| **CUDA EP FP16** | **9,670** | **1.41×** | **99.75%** |
| TensorRT FP32 | 8,846 | 1.29× | 99.95% |
| TensorRT FP16 | 9,491 | 1.38× | 99.05% |

**NER only:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 9,486 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 9,414 | 0.99× | 100.00% |
| CUDA EP FP32 | 13,554 | 1.43× | 99.85% |
| **CUDA EP FP16** | **15,291** | **1.61×** | **99.75%** |
| TensorRT FP32 | 13,579 | 1.43× | 99.95% |
| TensorRT FP16 | 13,078 | 1.38× | 99.05% |

> On A100, **CUDA EP FP16 outperforms TensorRT FP16**. This is expected:
> A100 was optimized for BF16 and large-batch datacenter workloads;
> TensorRT gains are less pronounced for the NLP batch sizes typical
> in spaCy pipelines.

---

### RTX 4090 — full results ³

**Full pipeline:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 9,924 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 9,839 | 0.99× | 100.00% |
| CUDA EP FP32 | 10,102 | 1.02× | 99.85% |
| CUDA EP FP16 | 11,381 | 1.15× | 99.85% |
| TensorRT FP32 | 10,397 | 1.05× | 99.95% |
| **TensorRT FP16** | **11,726** | **1.18×** | **99.65%** |

**NER only:**

| Execution Provider | WPS | Speedup | Accuracy |
|--------------------|-----|---------|----------|
| PyTorch Baseline FP32 | 14,728 | 1.00× | 100.00% |
| PyTorch Baseline FP16 | 14,557 | 0.99× | 100.00% |
| CUDA EP FP32 | 15,153 | 1.03× | 99.85% |
| CUDA EP FP16 | 18,126 | 1.23× | 99.85% |
| TensorRT FP32 | 15,853 | 1.08× | 99.95% |
| **TensorRT FP16** | **19,313** | **1.31×** | **99.65%** |

> On RTX 4090, the PyTorch baseline is already fast (~10k WPS full pipeline).
> The remaining gains come mostly from FP16 precision, not from the runtime switch.
> Switching to NER-only mode shows the clearest improvement (1.31×).

---

¹ Cloud instance (Hetzner). Ada Lovelace architecture benefits most from TensorRT FP16.
² Virtual partition (GRID A100D-80C). On this GPU CUDA EP FP16 is the recommended provider — TensorRT does not outperform it for typical spaCy batch sizes.
³ Cloud instance (RunPod). Local RTX 4090 results may differ due to power limits or virtualization overhead.

## Supported Models

Currently tested, confirmed, and supported:

- `en_core_web_trf` (RoBERTa-based)

Other spaCy transformer packages should be treated as unsupported for now, even
if related architecture-detection code exists internally.

## How It Works

1. **Weight Mapping** — extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.
2. **ONNX Export** — exports the model to ONNX with dynamic batch and sequence dimensions.
3. **FP16 Optimization** (optional) — applies BERT-style graph optimizations and converts weights to FP16.
4. **Runtime Patching** — replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same spaCy interface.
5. **Caching** — converted models are cached to disk to avoid re-conversion on subsequent runs.

## Troubleshooting

### TensorRT provider not available

Run the diagnostic tool first:

```bash
python -m spacy_accelerate
```

If you see `TensorRT EP : MISSING`, the NVIDIA build of `onnxruntime-gpu` is not installed. Fix it with:

```bash
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2
```

### libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like `libnvinfer.so.10: cannot open shared object file`:

**Automatic fix:** `spacy-accelerate` configures the native library paths on import. Make sure to `import spacy_accelerate` before calling `spacy.require_gpu()` or creating any ONNX Runtime sessions.

**Manual fix:** Set `LD_LIBRARY_PATH` explicitly:

```bash
SITE=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE/tensorrt_libs:$SITE/nvidia/cublas/lib:$SITE/nvidia/cuda_runtime/lib:$SITE/nvidia/cudnn/lib:$LD_LIBRARY_PATH"
```

### CUDA out of memory

Reduce workspace size or batch size:

```python
# For TensorRT provider — reduce workspace
nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_max_workspace_size=2 * 1024**3,  # 2 GB instead of 4 GB
)

# For any provider — reduce batch size
nlp = spacy_accelerate.optimize(
    nlp,
    max_batch_size=16,
)
```

### First inference is slow

This only applies to the TensorRT provider. TensorRT compiles an optimized
engine on the first run — this can take tens of seconds. Subsequent runs
reuse the cached engine and are fast.

The timing cache is enabled by default and carries over build history between
runs. If build time matters, prefer a lower optimization level:

```python
nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,             # on by default
    trt_builder_optimization_level=3,  # lower = faster build, 5 = best runtime perf
)
```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request on [GitHub](https://github.com/nesergey/spacy-accelerate).

## License

MIT License
