Metadata-Version: 2.3
Name: fasr-vad-fsmn
Version: 0.5.2
Summary: fsmn vad model for fasr
Author: osc
Author-email: osc <790990241@qq.com>
Requires-Dist: fasr
Requires-Dist: funasr-onnx
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.16,<1.24
Requires-Python: >=3.10, <3.13
Description-Content-Type: text/markdown

# fasr-vad-fsmn

[Chinese documentation](README_ZH.md)

FSMN voice activity detection for fasr. The offline `fsmn` model delegates
feature extraction and ONNX inference to `funasr_onnx`; the plugin also provides
`fsmn_online` for streaming VAD.

## Install

```bash
pip install fasr-vad-fsmn
```

## Registered Models

| Registry name | Class | Best for |
|---|---|---|
| `fsmn` | `FSMNVad` | Offline VAD, segmenting complete audio into speech spans |
| `fsmn_online` | `FSMNVadOnline` | Streaming VAD, emitting speech chunks as audio arrives |

## Pipeline Usage

Any keyword argument after `component`, `model`, `batch_size`, and other pipe
options is forwarded to the detector model. Put FSMN parameters directly on the
detector pipe:

```python
from fasr import AudioPipeline

pipeline = (
    AudioPipeline()
    .add_pipe(
        "detector",
        model="fsmn",
        max_end_silence_time=600,
        speech_noise_thres=0.55,
        num_threads=4,
    )
    .add_pipe("recognizer", model="paraformer")
    .add_pipe("sentencizer", model="ct_transformer")
)
```

Quick choices:

| Goal | Use | Result |
|---|---|---|
| Keep long sentences together | `max_end_silence_time=1000` | Short pauses inside a sentence are less likely to split the segment |
| Lower endpoint latency | `max_end_silence_time=300` | Segments end sooner, but sentences may be split more often |
| Suppress noisy backgrounds | `speech_noise_thres=0.7` | Fewer noise false positives, with higher risk of missing quiet speech |
| Keep quiet or far-field speech | `speech_noise_thres=0.45` | More sensitive detection, with higher risk of including noise |
| Increase CPU throughput | `num_threads=4` or `num_threads=8` | More ONNX Runtime CPU parallelism, with higher CPU usage |
| Use GPU | `device_id=0` | Uses GPU 0 through ONNX Runtime, after installing `onnxruntime-gpu` |

## Confection Config

fasr config files use Confection's TOML-style syntax, not YAML.

To configure only the VAD model:

```toml
[vad_model]
@vad_models = "fsmn"
max_end_silence_time = 600
speech_noise_thres = 0.55
num_threads = 4
```

Inside a pipeline, model parameters live under
`pipeline.pipes.detector.component.model`:

```toml
[pipeline]
@pipelines = "AudioPipeline.v1"
pipe_order = ["detector"]

[pipeline.pipes]

[pipeline.pipes.detector]
@pipes = "thread_pipe"
batch_size = 4
batch_timeout = 0.1

[pipeline.pipes.detector.component]
@components = "detector"
num_threads = 2
max_segment_duration = 30.0

[pipeline.pipes.detector.component.model]
@vad_models = "fsmn"
max_end_silence_time = 600
speech_noise_thres = 0.55
num_threads = 4
```

## Direct Model Usage

Model construction automatically downloads and loads the checkpoint.

```python
from fasr.config import registry
from fasr.data import AudioSpan, Waveform

model = registry.vad_models.get("fsmn")(
    max_end_silence_time=600,
    speech_noise_thres=0.55,
)

audio = AudioSpan(waveform=Waveform.from_file("example.wav"), start_ms=0)
segments = model.detect(audio)
for segment in segments:
    print(f"{segment.start_ms}ms - {segment.end_ms}ms")
```

Use a local checkpoint directory when needed:

```python
model.load_checkpoint("/path/to/fsmn-vad")
```

## Parameters

Offline `fsmn` exposes only the parameters that still affect `funasr_onnx`
inference. Generic checkpoint fields such as `checkpoint`, `cache_dir`,
`endpoint`, `revision`, and `force_download` are inherited from the base model.

| Parameter | Type / range | Default | Higher value | Lower value | Change when |
|---|---|---|---|---|---|
| `sample_rate` | `int`, recommended `16000` | `16000` | Not recommended; adds resampling/inference cost | Not recommended; may lose speech detail | Usually never; keep model input at 16 kHz |
| `device_id` | `None`, `-1`, `"cpu"`, or GPU id like `0` | `None` | GPU id uses that GPU | `None` / `-1` / `"cpu"` uses CPU | You need lower latency or higher concurrency |
| `num_threads` | `int >= 0` | `2` | Often faster on CPU, but uses more cores | Saves CPU, may slow inference | CPU deployment needs tuning |
| `max_end_silence_time` | `int >= 0`, milliseconds | `800` | More tolerant of pauses; longer, more complete segments; later endpoint | Faster endpoint; more fragmented segments | Sentences are split too often, or endpoint latency is too high |
| `speech_noise_thres` | `float`, `0.0` to `1.0` | `0.6` | More conservative; fewer noise false positives; may miss quiet speech | More sensitive; keeps weak speech; may include noise | Noise is detected as speech, or quiet speech is missed |

## Tuning Guide

| Symptom | Try first |
|---|---|
| One sentence is split into many pieces | Raise `max_end_silence_time` to `1000` or `1200` |
| Speech end is detected too late | Lower `max_end_silence_time` to `300` to `500` |
| Background noise becomes speech | Raise `speech_noise_thres` to `0.7` or `0.8` |
| Quiet or far-field speech is missed | Lower `speech_noise_thres` to `0.45` or `0.5` |
| CPU usage is too high | Lower `num_threads` |
| CPU inference is too slow | Raise `num_threads`, or install `onnxruntime-gpu` and set `device_id=0` |

For `fsmn_online`, use `device="cpu"` or `device="cuda"` instead of
`device_id`. It also exposes `chunk_size_ms`: smaller chunks improve realtime
responsiveness but increase scheduling overhead; larger chunks improve
throughput but delay output. The default `100` ms is a good starting point.

## CPU / GPU

The default runtime is CPU ONNX Runtime. During model loading, the plugin logs
whether CPU or GPU is being used.

For GPU inference:

```bash
uv pip install onnxruntime-gpu
```

```python
model = registry.vad_models.get("fsmn")(device_id=0)
stream_model = registry.vad_models.get("fsmn_online")(device="cuda")
```

## Dependencies

- `fasr`
- `funasr-onnx`
- `numpy >= 1.24`
- `onnxruntime >= 1.16, < 1.24`
- Python 3.10-3.12
