Metadata-Version: 2.4
Name: fasr-vad-marblenet
Version: 0.5.2
Summary: NVIDIA MarbleNet vad model for fasr
Author-email: fasr <790990241@qq.com>
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: fasr
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.16.0

# fasr-vad-marblenet

[Chinese documentation](README_ZH.md)

NVIDIA MarbleNet voice activity detection for fasr. The plugin ships a bundled
ONNX model, so the default `marblenet` registry entry works without downloading
extra weights.

## Install

```bash
pip install fasr-vad-marblenet
```

## Registered Model

| Registry name | Class | Best for |
|---|---|---|
| `marblenet` | `MarbleNetForVAD` | Offline CPU-friendly VAD with ONNX Runtime |

## Pipeline Usage

```python
from fasr import AudioPipeline

pipeline = (
    AudioPipeline()
    .add_pipe(
        "detector",
        model="marblenet",
        speaking_score=0.55,
        silence_score=0.45,
        fusion_threshold=0.2,
    )
    .add_pipe("recognizer", model="paraformer")
)
```

Quick choices:

| Goal | Use | Result |
|---|---|---|
| Reduce false starts from noise | `speaking_score=0.65` | Speech starts only when the model is more confident |
| Keep quiet speech | `speaking_score=0.35` | More sensitive starts, with more risk of noise |
| End speech sooner | `silence_score=0.35` | Shorter segments, lower trailing silence |
| Avoid fragmented segments | `fusion_threshold=0.3` | Merges speech pieces separated by short pauses |
| Drop clicks or very short bursts | `min_speech_duration=0.1` | Filters segments shorter than 100 ms |
| Cap ASR segment length | `max_speech_duration=15.0` | Hard-splits long speech spans into 15-second pieces |

## Confection Config

```toml
[vad_model]
@vad_models = "marblenet"
speaking_score = 0.55
silence_score = 0.45
fusion_threshold = 0.2
```

Inside a pipeline:

```toml
[pipeline]
@pipelines = "AudioPipeline.v1"
pipe_order = ["detector"]

[pipeline.pipes]

[pipeline.pipes.detector]
@pipes = "thread_pipe"

[pipeline.pipes.detector.component]
@components = "detector"

[pipeline.pipes.detector.component.model]
@vad_models = "marblenet"
speaking_score = 0.55
silence_score = 0.45
fusion_threshold = 0.2
```

## Direct Model Usage

```python
from fasr.config import registry
from fasr.data import AudioSpan, Waveform

model = registry.vad_models.get("marblenet")(
    speaking_score=0.55,
    silence_score=0.45,
)

audio = AudioSpan(waveform=Waveform.from_file("example.wav"), start_ms=0)
segments = model.detect(audio)
for segment in segments:
    print(f"{segment.start_ms}ms - {segment.end_ms}ms")
```

Use a local ONNX directory when needed:

```python
model.load_checkpoint("/path/to/marblenet")
```

## Parameters

| Parameter | Type / range | Default | Higher value | Lower value | Change when |
|---|---|---|---|---|---|
| `speaking_score` | `float`, `0.0` to `1.0` | `0.5` | More conservative starts | More sensitive starts | Starts are too eager or quiet speech is missed |
| `silence_score` | `float`, `0.0` to `1.0` | `0.5` | Speech ends later | Speech ends sooner | Segments are too long or clipped |
| `fusion_threshold` | `float >= 0`, seconds | `0.1` | Merges wider gaps | Keeps nearby segments separate | Output is too fragmented or too merged |
| `min_speech_duration` | `float >= 0`, seconds | `0.05` | Filters more short segments | Keeps shorter bursts | Clicks leak through, or short words disappear |
| `max_speech_duration` | `float > 0` or `None`, seconds | `None` | Longer hard-split limit | Shorter hard-split limit | ASR works better with bounded segments |
| `intra_op_num_threads` | `int >= 0` | `2` | More CPU parallelism | Less CPU usage | CPU throughput needs tuning |
| `inter_op_num_threads` | `int >= 0` | `0` | More operator-level parallelism | Lets ORT decide | Advanced ONNX Runtime tuning |

## Tuning Guide

| Symptom | Try first |
|---|---|
| Noise starts speech segments | Raise `speaking_score` to `0.6` or `0.7` |
| Quiet speech start is missed | Lower `speaking_score` to `0.35` or `0.4` |
| Segment tail is too long | Lower `silence_score` to `0.35` or `0.4` |
| Speech is cut too early | Raise `silence_score` to `0.6` |
| Segments are too fragmented | Raise `fusion_threshold` to `0.2` or `0.3` |
| Very short false segments appear | Raise `min_speech_duration` to `0.1` |

## Dependencies

- `fasr`
- `numpy >= 1.24`
- `onnxruntime >= 1.16.0`
- Python 3.10-3.12
