Metadata-Version: 2.4
Name: sinapsis-diarization
Version: 0.2.0
Summary: Speech processing templates and pipelines for transcription, speaker diarization, and emotion analysis
Author-email: SinapsisAI <dev@sinapsis.tech>
Requires-Python: >=3.10.12
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: src/sinapsis_diarization/bin/LICENSE
License-File: src/sinapsis_diarization/cli/LICENSE
License-File: src/sinapsis_diarization/pipelines/LICENSE
Requires-Dist: nemo-toolkit[asr,core,lightning]==2.6.2
Requires-Dist: nltk>=3.9.3
Requires-Dist: sacrebleu>=2.6.0
Requires-Dist: sinapsis>=0.2.26
Requires-Dist: torchaudio==2.8.0
Provides-Extra: pyannote
Requires-Dist: pyannote-audio>=4.0.4; extra == "pyannote"
Requires-Dist: torchcodec<0.8.0,>=0.7.0; extra == "pyannote"
Provides-Extra: data-tools
Requires-Dist: sinapsis-data-readers>=0.1.28; extra == "data-tools"
Provides-Extra: all
Requires-Dist: sinapsis-diarization[data-tools,dev,emotion,pyannote,whisperx]; extra == "all"
Provides-Extra: emotion
Requires-Dist: ruamel-yaml==0.18.17; extra == "emotion"
Requires-Dist: speechbrain; extra == "emotion"
Provides-Extra: whisperx
Requires-Dist: onnx>=1.20.1; extra == "whisperx"
Requires-Dist: onnxruntime==1.23.2; extra == "whisperx"
Requires-Dist: whisperx>=3.8.0; extra == "whisperx"
Provides-Extra: dev
Requires-Dist: ty>=0.0.31; extra == "dev"
Dynamic: license-file

[![sp](https://img.shields.io/badge/lang-sp-red.svg)](https://github.com/Sinapsis-AI/sinapsis-diarization/blob/main/README.es.md)
<h1 align="center">
<br>
<br>
<a href="https://sinapsis.tech/">
  <img
    src="https://github.com/Sinapsis-AI/brand-resources/blob/main/sinapsis_logo/4x/logo.png?raw=true"
    alt="" width="300">
</a>
<br>
Sinapsis Diarization
<br>
</h1>

<h4 align="center">Templates for Automatic Speech Recognition, Diarization and Emotion Recognition.</h4>

<p align="center">
<a href="#installation">🐍 Installation</a> •
<a href="#features">🚀 Features</a> •
<a href="#usage-example">📚 Example usage</a> •
<a href="#cli">CLI</a>
<a href="#documentation">📙 Documentation</a> •
<a href="#license">🔍 License</a>
</p>

The `sinapsis-diarization` module provides templates for Automatic Speech Recognition, Speaker Diarization, and Emotion Recognition, enabling efficient and accurate audio analysis pipelines.


<h2 id="installation">🐍 Installation</h2>

Install using your package manager of choice. We encourage the use of `uv`

```bash
uv pip install sinapsis-diarization
```
or with raw pip
```bash
pip install sinapsis-diarization
```

> [!IMPORTANT]
> Templates in sinapsis-diarization package may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:

```bash
uv pip install sinapsis-diarization[all] --extra-index-url https://pypi.sinapsis.tech
```
or
```bash
pip install sinapsis-diarization[all] --extra-index-url https://pypi.sinapsis.tech
```


> [!IMPORTANT]
> Templates in sinapsis-diarization package may require a Huggingface Token. Set the environment variable for Hugginface using
> <code>export HF_TOKEN="your_huggingface_token"</code>



<h2 id="features">🚀 Features</h2>

<h3> Templates</h3>

The **Sinapsis Diarization** module provides multiple templates for Automatic Speech Recognition, Diarization, Emotion Recognition and combined pipelines.

<h4>Automatic Speech Recognition (ASR)</h4>

These templates provide transcriptions of audios.

- **SinapsisParakeetASR**: Runs Parakeet speech recognition for audio transcription.
- **SinapsisCanaryASR**: Runs Canary speech recognition for audio transcription

<h4>Diarization</h4>

These templates identify speakers in a conversation and their turn times during the audio.

- **SinapsisSortformerDiarizer**: Runs Sortformer to get diarization of the speakers in an audio.
- **SinapsisPyannoteDiarizer**: Runs Pyannote to get diarization of speakers in an audio.

<h4>ASR plus Diarization</h4>

These templates combine ASR with Diarization to transcribe a conversation and identify the speakers automatically.

- **PyannoteDiarizedTranscription**: Runs Parakeet and Pyannote to transcribe a conversation with speaker identification.
- **SortformerDiarizedTranscription**: Runs Parakeet and Sortformer to transcribe a conversation with speaker identification.
- **WhisperxDiarizedTranscription**: Runs Whisperx to transcribe an audio and divide it by speaker and time.

<h4>ASR plus Diarization and Emotion Recognition</h4>

These templates combine the diarized transcription templates with an emotion recognition pipeline to show the emotions detected by each speaker in the audio.

- **PyannoteEmotionTranscription**: Runs Parakeet, Pyannote and Speechbrain to transcribe an audio, divide it by speaker and assign emotions to the segments.
- **SortformerEmotionTranscription**: Runs Parakeet, Sortformer and Speechbrain to transcribe an audio, divide it by speaker and assign emotions to the segments.

Examples for template use can be found at <code>src/sinapsis_diarization/configs</code>

To get information on any of the templates use <code>uv run sinapsis info --template TemplateName</code> and substitute TemplateName for any of the sinapsis-diarization templates


<h2 id="usage-example">📚 Example usage</h2>

All templates can be run through YAML configuration files using the `sinapsis run` command. The example below shows how to set up the **WhisperxDiarizedTranscription** template, which performs ASR and speaker diarization in a single pipeline.

Configuration attributes:
- **root_dir**: Base directory for resolving relative file paths.
- **audio_file_path**: Path to the audio file to process (relative to `root_dir`).
- **asr_model_name**: WhisperX model size. Available options: `"tiny"`, `"base"`, `"small"`, `"medium"`, `"large-v1"`, `"large-v2"`, `"large-v3"`.
- **device**: `"cuda"` or `"cpu"`.
- **chunk_size_in_secs**: Split audio into chunks of *n* seconds for processing. Use `-1` to process the full audio at once.
- **min_speakers** / **max_speakers**: Expected speaker count range.

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Config file</span></strong></summary>

```yaml
agent:
  name: whisperx_asr_diarization
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: WhisperxDiarizedTranscription
  class_name: WhisperxDiarizedTranscription
  template_input: InputTemplate
  attributes:
    root_dir: path_to_root_dir
    audio_file_path: audio_sample.mp3
    asr_model_name: large-v3
    device: cuda
    sample_rate: 16000
    chunk_size_in_secs: 20
    min_speakers: 2
    max_speakers: 2
```
</details>

To run the agent with the config above:

```bash
sinapsis run /path/to/sinapsis-diarization/src/sinapsis_diarization/configs/transcription_with_diarization/whisperx_asr_diarization.yaml
```

More config examples for all available templates can be found in <code>src/sinapsis_diarization/configs/</code>.


<h2 id="cli">📙 CLI</h2>

The pipelines for ASR, Diarization and Emotion diarization are available as CLI commands. Each command takes an audio file and model options as input, then writes results to a given output directory (by default <code>results</code>). Required arguments are marked with <code>(required)</code>.

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Sinapsis ASR</span></strong></summary>

Transcribe an audio file using a single ASR engine.

Run using <code>uv run sinapsis-asr</code>

Example:
<code>uv run sinapsis-asr --audio "path to audio" --model parakeet --chunk-size-in-secs 20 --device cuda
</code>

This command has the following options:

```bash
--audio AUDIO_PATH       Path to audio (required)
--model MODEL            Type of model to run (required)
--device DEVICE          Device to run the model (required)
--chunk-size-in-secs CHUNK_SIZE Size of chunks in seconds
--model-name MODEL_NAME  Name of model to use
--sample-rate SAMPLE_RATE Sample rate of audio
--output-dir OUTPUT_DIR  Output directory for results
```
**Models**
- "parakeet"
- "canary"

**Model names**
- Parakeet:
  - "nvidia/parakeet-tdt-0.6b-v2"
  - "nvidia/parakeet-tdt-0.6b-v3"

- Canary:
  - "nvidia/canary-qwen-2.5b"

**Device options**
- "cuda"
- "cpu"


</details>

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Sinapsis Diarize</span></strong></summary>

Identify speakers and their turn times in an audio file using a single diarization engine.

Run using <code>uv run sinapsis-diarize</code>

Example:
<code>uv run sinapsis-diarize --audio "path to audio" --model sortformer --chunk-size-in-secs 20 --device cuda</code>

This command has the following options:

```bash
--audio AUDIO_PATH       Path to audio (required)
--model MODEL            Type of model to run (required)
--device DEVICE          Device to run the model (required)
--chunk-size-in-secs CHUNK_SIZE Size of chunks in seconds
--model-name MODEL_NAME  Name of model to use
--sample-rate SAMPLE_RATE Sample rate of audio
--output-dir OUTPUT_DIR  Output directory for results
```
**Models**
- "sortformer"
- "pyannote"

**Model names**
- Sortformer:
  - "nvidia/diar_streaming_sortformer_4spk-v2.1"

- Pyannote:
  - "pyannote/speaker-diarization-community-1"

**Device options**
- "cuda"
- "cpu"


</details>

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Sinapsis ASR Diarize</span></strong></summary>

Transcribe and diarize an audio file by composing separate ASR and diarization engines. Produces a speaker-attributed transcript.

Run using <code>uv run sinapsis-asr-diarize</code>

Example:

<code>uv run sinapsis-asr-diarize --audio "path to audio" --asr-model parakeet --diarization-model sortformer --chunk-size-in-secs 20 --device cuda</code>

This command has the following options:

```bash
--audio AUDIO_PATH       Path to audio (required)
--asr-model ASR_MODEL    Type of ASR model to run (required)
--diarization-model DIARIZATION_MODEL Type of Diarization model to run (required)
--device DEVICE          Device to run the model (required)
--chunk-size-in-secs CHUNK_SIZE Size of chunks in seconds
--asr-model-name ASR_MODEL_NAME Name of ASR model to use
--diarization-model-name DIARIZATION_MODEL_NAME Name of Diarization model to use
--sample-rate SAMPLE_RATE Sample rate of audio
--output-dir OUTPUT_DIR  Output directory for results
--num-speakers NUM_SPEAKERS Number of speakers for models that require it

```
**ASR Models**
- "parakeet"
- "canary"

**Diarization Models**
- "sortformer"
- "pyannote"

**Model names**
- Parakeet:
  - "nvidia/parakeet-tdt-0.6b-v2"
  - "nvidia/parakeet-tdt-0.6b-v3"

- Canary:
  - "nvidia/canary-qwen-2.5b"

- Sortformer:
  - "nvidia/diar_streaming_sortformer_4spk-v2.1"

- Pyannote:
  - "pyannote/speaker-diarization-community-1"

**Device options**
- "cuda"
- "cpu"


</details>

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Sinapsis Whisperx ASR Diarize</span></strong></summary>

Transcribe and diarize an audio file using the WhisperX monolithic pipeline. Unlike <code>sinapsis-asr-diarize</code>, this uses a single model for both ASR and diarization.

Run using <code>uv run sinapsis-whisperx-asr-diarize</code>

Example:

<code>uv run sinapsis-whisperx-asr-diarize --audio "path to audio" --model-name large-v3 --chunk-size-in-secs 20 --device cuda</code>

This command has the following options:

```bash
--audio AUDIO_PATH       Path to audio (required)
--model-name MODEL       WhisperX model size (required)
--device DEVICE          Device to run the model (required)
--chunk-size-in-secs CHUNK_SIZE Size of chunks in seconds
--sample-rate SAMPLE_RATE Sample rate of audio
--output-dir OUTPUT_DIR  Output directory for results
--min-speakers MIN_SPEAKERS Minimum number of speakers
--max-speakers MAX_SPEAKERS Maximum number of speakers
```
**Model names**
- "tiny"
- "base"
- "small"
- "medium"
- "large-v1"
- "large-v2"
- "large-v3"

**Device options**
- "cuda"
- "cpu"


</details>

<details>
  <summary id="docker"><strong><span style="font-size: 1.2em;">Sinapsis ASR Diarize Emotion</span></strong></summary>

Transcribe, diarize, and detect emotions in an audio file. Composes ASR, diarization, and emotion detection engines to produce a speaker-attributed transcript with per-segment emotions.

Run using <code>uv run sinapsis-asr-diarize-emotion</code>

Example:

<code>uv run sinapsis-asr-diarize-emotion --audio "path to audio" --asr-model parakeet --diarization-model sortformer --emotion-model speechbrain --chunk-size-in-secs 20 --device cuda</code>

This command has the following options:

```bash
--audio AUDIO_PATH       Path to audio (required)
--asr-model ASR_MODEL    Type of ASR model to run (required)
--diarization-model DIARIZATION_MODEL Type of Diarization model to run (required)
--emotion-model EMOTION_MODEL Type of Emotion model to run (required)
--device DEVICE          Device to run the model (required)
--chunk-size-in-secs CHUNK_SIZE Size of chunks in seconds
--asr-model-name ASR_MODEL_NAME Name of ASR model to use
--diarization-model-name DIARIZATION_MODEL_NAME Name of Diarization model to use
--emotion-model-name EMOTION_MODEL_NAME Name of emotion model to use
--sample-rate SAMPLE_RATE Sample rate of audio
--output-dir OUTPUT_DIR  Output directory for results
--num-speakers NUM_SPEAKERS Number of speakers for models that require it

```
**ASR Models**
- "parakeet"
- "canary"

**Diarization Models**
- "sortformer"
- "pyannote"

**Emotion Models**
- "speechbrain"

**Model names**
- Parakeet:
  - "nvidia/parakeet-tdt-0.6b-v2"
  - "nvidia/parakeet-tdt-0.6b-v3"

- Canary:
  - "nvidia/canary-qwen-2.5b"

- Sortformer:
  - "nvidia/diar_streaming_sortformer_4spk-v2.1"

- Pyannote:
  - "pyannote/speaker-diarization-community-1"

**Device options**
- "cuda"
- "cpu"


</details>


<h2 id="documentation">📙 Documentation</h2>

Documentation is available on the [sinapsis website](https://docs.sinapsis.tech/docs)

Tutorials for different projects within sinapsis are available at [sinapsis tutorials page](https://docs.sinapsis.tech/tutorials)

<h2 id="license">🔍 License</h2>

The templates in this project are licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the [LICENSE](LICENSE) file.

The command line interface and pipelines in this project are licensed under the MIT license, which allows for unrestricted use of the software and encourages open collaboration. For more details please refer to the [LICENSE](src/sinapsis_diarization/pipelines/LICENSE) file

For commercial use, please refer to our [official Sinapsis website](https://sinapsis.tech) for information on obtaining a commercial license.
