Metadata-Version: 2.4
Name: spacy-accelerate
Version: 0.3.0
Summary: Accelerate spaCy transformers with TensorRT/ONNX Runtime
Project-URL: Homepage, https://github.com/nesergey/spacy-accelerate
Project-URL: Documentation, https://github.com/nesergey/spacy-accelerate#readme
Project-URL: Repository, https://github.com/nesergey/spacy-accelerate
Project-URL: Issues, https://github.com/nesergey/spacy-accelerate/issues
Author-email: Siarhei Niaverau <nesergey@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: acceleration,nlp,onnx,spacy,tensorrt,transformer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: cupy-cuda12x==13.6.0
Requires-Dist: numpy==2.4.1
Requires-Dist: onnx==1.20.1
Requires-Dist: onnxruntime-gpu==1.23.2
Requires-Dist: onnxscript<0.2.0,>=0.1.0
Requires-Dist: spacy-transformers==1.3.9
Requires-Dist: spacy==3.8.2
Requires-Dist: tensorrt-cu12-bindings==10.15.1.29
Requires-Dist: tensorrt-cu12-libs==10.15.1.29
Requires-Dist: tensorrt-cu12==10.15.1.29
Requires-Dist: tensorrt==10.15.1.29
Requires-Dist: thinc==8.3.10
Requires-Dist: torch==2.5.1
Requires-Dist: transformers==4.41.2
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: datasets>=2.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Description-Content-Type: text/markdown

# spacy-accelerate

Accelerate spaCy transformers with TensorRT/ONNX Runtime. Drop-in replacement for transformer-based spaCy pipelines with Docker-verified GPU benchmark workflows.

## Installation

`spacy-accelerate` depends on a CUDA/TensorRT stack that must stay version-aligned.
The two failure modes we hit in practice were:

- a second dependency resolution pass upgrading parts of the stack to different CUDA majors;
- CUDA/TensorRT shared libraries from pip wheels not being visible to CuPy / ONNX Runtime.

The package now pins the runtime versions in `pyproject.toml`, and it configures
the pip-installed native libraries automatically on import.

Benchmark Docker files live under `benchmarks/docker/`, and canonical benchmark
artifacts are saved under `artifacts/benchmarks/docker/`. The root
`.dockerignore` is kept at repository level because Docker build context
filtering applies to the whole repo root.

### PyPI install

```bash
pip install spacy-accelerate
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2
```

The second command is still required to guarantee the TensorRT-enabled
`onnxruntime-gpu` build from NVIDIA.

### Source / editable install

```bash
pip install -r requirements.txt
pip install -e . --no-deps
```

Do not run plain `pip install -e .` after that. It can trigger a second resolver
pass and replace the pinned CUDA 12 stack with newer incompatible packages.

**Verify the installation:**
```bash
python -m spacy_accelerate
```

You should see `TensorRT EP : OK` and `CUDA EP : OK` in the output.

**Requirements:**
- Python 3.11+
- CUDA 12.x
- NVIDIA GPU with TensorRT support (Ampere / Ada Lovelace recommended)
- spaCy 3.8+ with spacy-transformers

## Quick Start

```python
import spacy
import spacy_accelerate

# Load your spaCy transformer model
nlp = spacy.load("en_core_web_trf")

# Optimize with one line!
nlp = spacy_accelerate.optimize(nlp, precision="fp16")

# Use as normal - same API, faster inference
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Apple Inc.', 'ORG'), ('Steve Jobs', 'PERSON'), ('Cupertino', 'GPE')]

# Batch processing works too
texts = ["Text one.", "Text two.", "Text three."]
docs = list(nlp.pipe(texts, batch_size=32))
```

## API Reference

### `optimize(nlp, **kwargs)`

Optimize a spaCy transformer pipeline with ONNX Runtime / TensorRT.

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `nlp` | `spacy.Language` | required | spaCy pipeline with transformer |
| `precision` | `"fp32"` \| `"fp16"` | `"fp16"` | Model precision |
| `provider` | `"tensorrt"` \| `"cuda"` \| `"cpu"` | `"cuda"` | Execution provider |
| `cache_dir` | `Path` \| `str` | `~/.cache/spacy-accelerate` | ONNX model cache directory |
| `warmup` | `bool` | `True` | Run warmup inference |
| `device_id` | `int` | `0` | CUDA device ID |
| `max_batch_size` | `int` | `128` | Max batch size for IO Binding |
| `max_seq_length` | `int` | `512` | Max sequence length for IO Binding |
| `use_io_binding` | `bool` | `True` | Use zero-copy IO Binding |
| `verbose` | `bool` | `False` | Enable verbose logging |

**TensorRT-specific parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `trt_max_workspace_size` | `int` | `4GB` | TensorRT workspace size |
| `trt_builder_optimization_level` | `int` | `3` | Optimization level (0-5) |
| `trt_timing_cache` | `bool` | `True` | Enable timing cache |

**Returns:** The optimized `spacy.Language` object (modified in-place).

### Cache Management

```python
import spacy_accelerate

# List cached models
cached = spacy_accelerate.list_cached()
print(f"Cached models: {cached}")

# Get cache size
size_bytes = spacy_accelerate.get_cache_size()
print(f"Cache size: {size_bytes / 1024**2:.1f} MB")

# Clear cache
cleared = spacy_accelerate.clear_cache()
print(f"Cleared {cleared} cache entries")
```

## Performance

Canonical benchmark results are the Docker runs under [artifacts/benchmarks/docker](/Users/nesergeyv/Projects/spacy-accelerate/artifacts/benchmarks/docker).

Benchmark commands and runner details are maintained in [benchmarks/README.md](/Users/nesergeyv/Projects/spacy-accelerate/benchmarks/README.md).

Latest full-pipeline Docker measurement for `en_core_web_trf` on **NVIDIA RTX 4000 SFF Ada Generation**, **CoNLL-2003** test set, `batch_size=128`, `1` discarded prime pass and `3` measured passes averaged:

| Execution Provider | Speed (WPS) | Speedup vs PyTorch | Accuracy |
|--------------------|-------------|--------------------|----------|
| PyTorch Baseline (FP32) | 6,241 | 1.00x | 100.00% |
| PyTorch Baseline (FP16) | 6,166 | 0.99x | 100.00% |
| CUDA FP32 | 9,910 | 1.59x | 99.90% |
| CUDA FP16 | 15,763 | 2.53x | 99.75% |
| TensorRT FP32 | 10,552 | 1.69x | 99.95% |
| **TensorRT FP16** | **16,935** | **2.71x** | **99.50%** |

Latest Docker NER-only measurement for `en_core_web_trf` with `tagger`, `parser`, `attribute_ruler`, and `lemmatizer` disabled:

| Execution Provider | Speed (WPS) | Speedup vs PyTorch | Accuracy |
|--------------------|-------------|--------------------|----------|
| PyTorch Baseline (FP32) | 7,066 | 1.00x | 100.00% |
| PyTorch Baseline (FP16) | 6,859 | 0.97x | 100.00% |
| CUDA FP32 | 11,972 | 1.69x | 99.90% |
| CUDA FP16 | 22,394 | 3.17x | 99.75% |
| TensorRT FP32 | 13,138 | 1.86x | 99.95% |
| **TensorRT FP16** | **24,823** | **3.51x** | **99.65%** |


## Examples

### Using TensorRT for Maximum Performance

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    precision="fp16",
    trt_max_workspace_size=8 * 1024**3,  # 8GB
    trt_builder_optimization_level=5,     # Maximum optimization
)

# First inference builds TensorRT engine (cached for subsequent runs)
doc = nlp("TensorRT provides maximum inference speed.")
```



### Custom Cache Directory

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    cache_dir="/path/to/custom/cache",
    precision="fp16",
)
```

### Verbose Mode for Debugging

```python
import spacy
import spacy_accelerate

nlp = spacy.load("en_core_web_trf")

nlp = spacy_accelerate.optimize(
    nlp,
    verbose=True,  # Print detailed logs
)
```

## Supported Models

Right now the confirmed spaCy model support is:

- `en_core_web_trf`

The earlier wording here listed transformer architecture families, not actual
published spaCy package names. Internally, the exporter and architecture
detection logic currently target curated-transformer / RoBERTa-style backbones,
with partial code paths for BERT and XLM-RoBERTa families, but those are not yet
claimed here as generally supported spaCy packages.

## How It Works

1. **Weight Mapping**: Extracts transformer weights from spaCy's internal format and maps them to HuggingFace format.

2. **ONNX Export**: Exports the mapped model to ONNX format with dynamic batch and sequence dimensions.

3. **FP16 Optimization** (optional): Applies BERT-style optimizations and converts to FP16 for faster inference.

4. **Runtime Patching**: Replaces the PyTorch transformer with an ONNX Runtime proxy that provides the same interface.

5. **Caching**: Converted models are cached to avoid re-conversion on subsequent loads.

## Troubleshooting

### TensorRT provider not available

Run the diagnostic tool first:
```bash
python -m spacy_accelerate
```

If you see `TensorRT EP : MISSING`, the NVIDIA build of onnxruntime-gpu is not installed.
Fix with step 2 from the installation instructions:
```bash
pip install --force-reinstall \
    --extra-index-url https://pypi.nvidia.com \
    onnxruntime-gpu==1.23.2
```

### libnvinfer.so / libcublas.so / libcublasLt.so not found

If you see errors like `libnvinfer.so.10`, `libcublas.so.12`, or
`libcublasLt.so.12: cannot open shared object file`:

**Automatic fix:** `spacy-accelerate` automatically configures both TensorRT
libraries and the CUDA libraries installed under `site-packages/nvidia/*/lib`.
Import `spacy_accelerate` before creating ONNX Runtime sessions or calling
`spacy.require_gpu()`.

**Manual fix:** If the automatic configuration doesn't work (e.g., running scripts directly):
```bash
SITE_PACKAGES=$(python -c "import site; print(site.getsitepackages()[0])")
export LD_LIBRARY_PATH="$SITE_PACKAGES/tensorrt_libs:$SITE_PACKAGES/nvidia/cublas/lib:$SITE_PACKAGES/nvidia/cuda_runtime/lib:$SITE_PACKAGES/nvidia/cudnn/lib:$LD_LIBRARY_PATH"
```

### CUDA out of memory

Reduce workspace size or batch size:

```python
nlp = spacy_accelerate.optimize(
    nlp,
    trt_max_workspace_size=2 * 1024**3,  # 2GB instead of 4GB
    max_batch_size=16,                    # Smaller batches
)
```

### First inference is slow

TensorRT builds optimized engines on first run. Enable caching:

```python
nlp = spacy_accelerate.optimize(
    nlp,
    provider="tensorrt",
    trt_timing_cache=True,  # Cache timing data
)
```

## License

MIT License

## Contributing

Contributions are welcome! Please open an issue or submit a pull request.
