Metadata-Version: 2.4
Name: fasr-asr-qwen3asr
Version: 0.5.2
Summary: Qwen3 ASR model for fasr
Author-email: fasr <790990241@qq.com>
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: fasr
Requires-Dist: transformers==4.57.6
Requires-Dist: nagisa==0.2.11
Requires-Dist: soynlp==0.0.493
Requires-Dist: accelerate==1.12.0
Requires-Dist: vllm==0.14.0
Requires-Dist: librosa
Requires-Dist: soundfile

# fasr-asr-qwen3asr

[Chinese documentation](README_ZH.md)

Qwen3-ASR speech recognition for fasr. The plugin uses the bundled Qwen3-ASR
vLLM backend and supports both batch transcription and cumulative streaming
transcription on the same loaded engine.

## Install

```bash
pip install fasr-asr-qwen3asr
```

## Registered Model

| Registry name | Class selected by `size` | Best for |
|---|---|---|
| `qwen3asr` | `Qwen3ASRSmall` or `Qwen3ASRLarge` | GPU ASR without word timestamps |

Use `size="small"` for `Qwen/Qwen3-ASR-0.6B` and `size="large"` for
`Qwen/Qwen3-ASR-1.7B`.

## Pipeline Usage

```python
from fasr import AudioPipeline

pipeline = (
    AudioPipeline()
    .add_pipe("detector", model="fsmn")
    .add_pipe(
        "recognizer",
        model="qwen3asr",
        size="small",
        gpu_memory_utilization=0.7,
        max_new_tokens=2048,
    )
    .add_pipe("sentencizer", model="ct_transformer")
)
```

Quick choices:

| Goal | Use | Result |
|---|---|---|
| Lower VRAM | `size="small"` | Uses the 0.6B checkpoint |
| Better accuracy | `size="large"` | Uses the 1.7B checkpoint, needs more GPU memory |
| Leave GPU headroom | `gpu_memory_utilization=0.6` | vLLM reserves less memory |
| Long-form output | `max_new_tokens=4096` or higher | Allows longer generation |
| Bias vocabulary | `context="..."` | Adds prompt context for names or rare terms |
| Force language | `language="zh"` | Uses a fixed language hint |

## Confection Config

```toml
[asr_model]
@asr_models = "qwen3asr"
size = "small"
gpu_memory_utilization = 0.7
max_new_tokens = 2048
language = "zh"
context = "product names: fasr, Qwen3-ASR"
```

Inside a pipeline:

```toml
[pipeline]
@pipelines = "AudioPipeline.v1"
pipe_order = ["recognizer"]

[pipeline.pipes]

[pipeline.pipes.recognizer]
@pipes = "thread_pipe"
batch_size = 1

[pipeline.pipes.recognizer.component]
@components = "recognizer"

[pipeline.pipes.recognizer.component.model]
@asr_models = "qwen3asr"
size = "small"
gpu_memory_utilization = 0.7
max_new_tokens = 2048
language = "zh"
context = "product names: fasr, Qwen3-ASR"
```

## Direct Model Usage

```python
from fasr.config import registry

model = registry.asr_models.get("qwen3asr")(
    size="small",
    gpu_memory_utilization=0.7,
)

spans = model.transcribe(audio_spans)
for span in spans:
    print(span.text)
```

Streaming uses the same model instance and returns cumulative text. Consumers
should overwrite the displayed partial text instead of appending deltas:

```python
model = registry.asr_models.get("qwen3asr")(
    size="small",
    chunk_size_ms=2000,
    language="zh",
)

for chunk in audio_chunks:
    span = model.push_chunk(chunk)
    if span is not None:
        print(span.text)
```

## Parameters

| Parameter | Type / range | Default | Higher / true | Lower / false | Change when |
|---|---|---|---|---|---|
| `size` | `"small"` or `"large"` | `"small"` | `"large"` improves capacity, needs more VRAM | `"small"` is cheaper | Accuracy or resource budget changes |
| `gpu_memory_utilization` | `float`, `(0, 1]` | `0.8` | More KV-cache headroom, more VRAM reserved | Leaves more VRAM for other processes | vLLM OOMs or underuses GPU |
| `max_new_tokens` | `int >= 1` | `4096` | Longer outputs, more decode work | Shorter cap, less compute | Text is truncated or memory is tight |
| `max_inference_batch_size` | `int`, `-1` or positive | `-1` | `-1` lets backend choose | Smaller cap reduces peak memory | Batch inference OOMs |
| `max_model_len` | `int` or `None` | `None` | Longer prompt+generation context | Shorter context, less memory | Long prompts or OOMs |
| `language` | `str` or `None` | `None` | Fixed language hint | Auto language behavior | Language is known |
| `context` | `str` | `""` | More biasing context | Less prompt bias | Rare terms or names are missed |
| `chunk_size_ms` | `int` or `None` | `None` | Less frequent streaming decode | More responsive streaming | Streaming latency/throughput tuning |
| `unfixed_chunk_num` | `int` or `None` | `None` | More initial chunks without prefix fixing | Earlier prefix use | Early streaming text is unstable |
| `unfixed_token_num` | `int` or `None` | `None` | More rollback, more correction room | Less rollback, more stable prefix | Streaming revisions are too aggressive or too sticky |

Generic checkpoint fields such as `checkpoint`, `cache_dir`, `endpoint`,
`revision`, and `force_download` are inherited from the base model.

## Output

- Batch mode writes full text to `span.raw_text`.
- Streaming mode returns cumulative `AudioSpan(raw_text=...)`.
- Word or character timestamps are not returned by this plugin.

## Dependencies

- `fasr`
- `transformers == 4.57.6`
- `vllm == 0.14.0`
- `accelerate == 1.12.0`
- `librosa`
- `soundfile`
- Python 3.10-3.12
