Metadata-Version: 2.4
Name: voicebridge
Version: 0.0.3
Summary: Comprehensive bidirectional voice-text CLI tool with Whisper and VibeVoice
Author: Patrick
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.16.0
Requires-Dist: rich>=14.0.0
Requires-Dist: pydantic>=2.11.0
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: openai-whisper>=20250625
Requires-Dist: pygame>=2.6.1
Requires-Dist: setuptools<81,>=70.0.0
Requires-Dist: PyAudio>=0.2.14
Requires-Dist: soundfile>=0.13.1
Requires-Dist: librosa>=0.11.0
Requires-Dist: noisereduce>=3.0.3
Requires-Dist: pydub>=0.25.1
Requires-Dist: ffmpy>=0.6.1
Requires-Dist: torch>=2.8.0
Requires-Dist: torchaudio>=2.8.0
Requires-Dist: torchvision>=0.23.0
Requires-Dist: torchmetrics>=1.8.2
Requires-Dist: transformers==4.51.3
Requires-Dist: accelerate>=1.10.1
Requires-Dist: diffusers>=0.35.1
Requires-Dist: safetensors>=0.6.2
Requires-Dist: tiktoken>=0.11.0
Requires-Dist: tokenizers>=0.21.4
Requires-Dist: anyascii>=0.3.2
Requires-Dist: coqpit>=0.0.17
Requires-Dist: encodec>=0.1.1
Requires-Dist: ml-collections>=0.1.1
Requires-Dist: absl-py>=2.0.0
Requires-Dist: av>=13.1.0
Requires-Dist: aiortc>=1.9.0
Requires-Dist: numba>=0.57.0
Requires-Dist: llvmlite>=0.40.0
Requires-Dist: gradio>=5.0.0
Requires-Dist: pynput>=1.8.1
Requires-Dist: pyperclip>=1.9.0
Requires-Dist: psutil>=7.0.0
Requires-Dist: python-xlib>=0.33; sys_platform == "linux"
Requires-Dist: evdev>=1.9.2; sys_platform == "linux"
Requires-Dist: fastapi>=0.116.1
Requires-Dist: uvicorn>=0.35.0
Requires-Dist: httpx>=0.28.1
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: websockets>=15.0.1
Requires-Dist: numpy>=2.2.6
Requires-Dist: pandas>=2.3.2
Requires-Dist: scipy>=1.16.2
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: python-dateutil>=2.9.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: joblib>=1.5.1
Requires-Dist: filelock>=3.18.0
Requires-Dist: packaging>=25.0
Requires-Dist: certifi>=2025.6.15
Requires-Dist: requests>=2.32.4
Requires-Dist: beautifulsoup4>=4.13.4
Requires-Dist: huggingface_hub[hf_xet]
Provides-Extra: dev
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: pytest-cov>=6.3.0; extra == "dev"
Requires-Dist: pytest-mock>=3.15.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: ruff>=0.1.15; extra == "dev"
Requires-Dist: black>=25.1.0; extra == "dev"
Requires-Dist: coverage>=7.10.6; extra == "dev"
Provides-Extra: cuda
Requires-Dist: torch>=2.8.0; extra == "cuda"
Requires-Dist: torchaudio>=2.8.0; extra == "cuda"
Requires-Dist: torchvision>=0.23.0; extra == "cuda"
Requires-Dist: triton>=3.4.0; extra == "cuda"
Provides-Extra: tray
Requires-Dist: pystray>=0.19.4; extra == "tray"
Requires-Dist: Pillow>=11.2.1; extra == "tray"
Dynamic: license-file

# VoiceBridge 🎙️ ↔️ 📝

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Platform Support](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey.svg)]()

> **The ultimate bidirectional voice-text bridge.** Seamlessly convert speech to text and text to speech with professional-grade accuracy, real-time processing, and hotkey-driven workflows.

## 🚀 What is VoiceBridge?

VoiceBridge eliminates the friction between voice and text. Whether you're transcribing interviews, creating accessible content, building voice-driven workflows, or simply need hands-free text input, VoiceBridge provides a powerful, flexible CLI that adapts to your needs.

**Built on OpenAI's Whisper** for world-class speech recognition and **VibeVoice** for natural text-to-speech synthesis.

## 🎯 What Problems Does It Solve?

- **Content Creators**: Transcribe podcasts, interviews, and videos with timestamp precision
- **Accessibility**: Convert text to natural speech for screen readers and audio content
- **Productivity**: Voice-to-text note-taking with hotkey triggers during meetings
- **Developers**: Integrate speech processing into applications and workflows
- **Researchers**: Batch process audio data with confidence analysis and quality metrics
- **Writers**: Dictate drafts and have articles read back with custom voices

## ✨ Key Features

### 🎤 Speech-to-Text (STT)

- **Real-time transcription** with hotkeys (F9 toggle/hold modes)
- **Interactive mode** with press-and-hold 'r' to record
- **File processing** (MP3, WAV, M4A, FLAC, OGG) with chunked processing
- **Batch transcription** of entire directories with parallel workers
- **Resume capability** for interrupted long transcriptions with session management
- **Streaming transcription** with real-time output and live updates
- **GPU acceleration** (CUDA/Metal) with automatic device detection
- **Memory optimization** with configurable limits and streaming
- **Custom vocabulary** management for domain-specific terms
- **Export formats**: JSON, SRT, VTT, plain text, CSV with timestamps and confidence
- **Confidence analysis** and quality assessment with detailed reporting
- **Webhook integration** for external notifications and automation
- **Post-processing** with spell check, grammar correction, and custom rules
- **Profile management** for different use cases and configurations
- **Performance monitoring** with comprehensive metrics and benchmarking

### 🗣️ Text-to-Speech (TTS)

- **High-quality voice synthesis** with VibeVoice neural models
- **Multiple input modes**: clipboard monitoring, text selection, direct input
- **Custom voice samples** with automatic detection and voice cloning
- **Streaming and non-streaming** modes for real-time or complete generation
- **Daemon mode** for background processing and system integration
- **Hotkey controls** for hands-free operation (F12 generate, Ctrl+Alt+S stop)
- **Voice management** with sample validation and quality checks
- **GPU acceleration** for faster synthesis and model loading
- **Configuration profiles** for different voice settings and use cases
- **Audio output options**: play immediately, save to file, or both

### 🔧 Advanced Processing

- **Audio enhancement**: noise reduction, normalization, silence trimming, fade effects
- **Audio splitting**: by duration, silence detection, or file size with smart segmentation
- **Confidence analysis** and quality assessment with detailed statistics
- **Session management** with progress tracking, resume capability, and persistence
- **Performance monitoring** with GPU benchmarking, memory usage, and operation tracking
- **Webhook integration** for external notifications and workflow automation
- **Profile management** for different use cases and quick configuration switching
- **Vocabulary management** for improved recognition of technical terms and proper nouns
- **Post-processing pipeline** with spell check, grammar correction, and custom rules
- **API server** for integration with external applications and services
- **Comprehensive testing** with E2E test suites for all major functionality

## 🚀 Quick Start

### Installation

VoiceBridge uses **uv** for fast dependency management. Install uv first if you don't have it:

```bash
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install with uv
uv pip install voicebridge
```

### Basic Usage

```bash
# Listen for speech and transcribe with hotkeys
voicebridge stt listen

# Transcribe an audio file
voicebridge stt transcribe audio.mp3 --output transcript.txt

# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"

# Start clipboard monitoring for TTS
voicebridge tts listen-clipboard
```

## 📖 Examples

### 1. Content Creator Workflow

```bash
# Transcribe a podcast episode with timestamps
voicebridge stt transcribe podcast_episode.mp3 \
  --format srt \
  --output episode_subtitles.srt \
  --language en

# Analyze transcription quality
voicebridge stt confidence analyze session_12345 --detailed
```

### 2. Accessibility Content

```bash
# Convert article to speech with custom voice
voicebridge tts generate \
  --voice en-Alice_woman \
  --output article_audio.wav \
  "$(cat article.txt)"

# Batch convert multiple documents
voicebridge stt batch-transcribe articles/ \
  --output-dir transcripts/ \
  --workers 4
```

### 3. Developer Integration

```bash
# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard

# Set up webhook notifications
voicebridge stt webhook add https://api.example.com/transcription-complete

# Real-time transcription with streaming
voicebridge stt realtime \
  --chunk-duration 2.0 \
  --output-format live
```

### 4. Research & Analysis

```bash
# Process interview recordings with resumable capability
voicebridge stt listen-resumable interview.wav \
  --session-name "interview-2024-01-15" \
  --language en

# Export results in multiple formats
voicebridge stt export session session_12345 \
  --format json \
  --include-confidence \
  --output transcript.json
```

## 🛠️ Local Development Setup

### Prerequisites

- **Python 3.10+**
- **uv** (Python package manager)
- **FFmpeg** (for audio processing)
- **CUDA** (optional, for GPU acceleration)

### Installation

```bash
# 1. Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone and setup
git clone https://github.com/yourusername/voicebridge.git
cd voicebridge

# 3. Choose your setup:
make prepare        # CPU version
make prepare-cuda   # With CUDA support
make prepare-tray   # With system tray support

# 4. Install system dependencies
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg

# macOS:
brew install ffmpeg

# Windows (with Chocolatey):
choco install ffmpeg
```

### TTS Setup

VoiceBridge includes comprehensive text-to-speech capabilities powered by VibeVoice.

#### Prerequisites

1. **Install VibeVoice dependencies** (if using local model):

   ```bash
   # Clone and install VibeVoice
   git clone https://github.com/WestZhang/VibeVoice.git
   cd VibeVoice
   pip install -e .
   ```

2. **Voice Samples**: Voice samples are included in `voices/` directory:
   ```
   voices/
   ├── en-Alice_woman.wav
   ├── en-Carter_man.wav
   ├── en-Frank_man.wav
   ├── en-Maya_woman.wav
   ├── en-Patrick.wav
   └── ... (additional voices)
   ```

#### Configuration

VoiceBridge works out-of-the-box with sensible defaults. Configuration can be set via:

1. **Config file** (`~/.config/voicebridge/config.json`):

   ```json
   {
     "tts_enabled": true,
     "tts_config": {
       "model_path": "aoi-ot/VibeVoice-7B",
       "voice_samples_dir": "voices",
       "default_voice": "en-Alice_woman",
       "cfg_scale": 1.3,
       "inference_steps": 10,
       "tts_mode": "clipboard",
       "streaming_mode": "non_streaming",
       "output_mode": "play",
       "tts_toggle_key": "f11",
       "tts_generate_key": "f12",
       "tts_stop_key": "ctrl+alt+s",
       "sample_rate": 24000,
       "auto_play": true,
       "use_gpu": true,
       "max_text_length": 2000,
       "chunk_text_threshold": 500
     }
   }
   ```

2. **Command-line flags** (override config file):
   ```bash
   # Generate with custom settings
   voicebridge tts generate "Hello world" \
     --voice en-Patrick \
     --streaming \
     --output speech.wav \
     --cfg-scale 1.5 \
     --inference-steps 15
   ```

#### Voice Sample Requirements

- **Format**: WAV (recommended), MP3, FLAC
- **Sample Rate**: 24kHz (recommended), 16kHz-48kHz supported
- **Channels**: Mono (preferred)
- **Duration**: 3-10 seconds
- **Quality**: Clear, single speaker, minimal background noise
- **Naming**: `language-name_gender.wav` (e.g., `en-Alice_woman.wav`)

#### Quick Test

```bash
# Test TTS with default settings
voicebridge tts generate "Hello, this is VoiceBridge text-to-speech!"

# List available voices
voicebridge tts voices

# Show current TTS configuration
voicebridge tts config show
```

### Development Commands

```bash
make help           # Show all available commands
make lint           # Run ruff linting and formatting
make test           # Run all tests with coverage
make test-fast      # Quick tests without coverage
make test-unit      # Run only unit tests (exclude e2e)
make test-e2e       # Run comprehensive end-to-end tests
make test-e2e-smoke # Run quick E2E smoke tests
make test-e2e-stt   # Run STT E2E tests only
make test-e2e-tts   # Run TTS E2E tests only
make test-e2e-audio # Run audio E2E tests only
make test-e2e-gpu   # Run GPU E2E tests only
make test-e2e-api   # Run API E2E tests only
make clean          # Clean cache and temporary files
```

### Configuration

```bash
# Show current STT configuration
voicebridge stt config show

# Set STT configuration values
voicebridge stt config set use_gpu true

# Show TTS configuration
voicebridge tts config show

# Set up profiles for different use cases
voicebridge stt profile save research-setup
voicebridge stt profile load research-setup
```

## 🎮 Usage Guide

### Speech-to-Text (STT) Commands

#### Real-time Recognition

```bash
# Listen with hotkeys (F9 to start/stop)
voicebridge stt listen

# Interactive mode (press 'r' to record)
voicebridge stt interactive

# Global hotkey listener with custom key
voicebridge stt hotkey --key f9 --mode toggle
```

#### File Processing

```bash
# Transcribe single file
voicebridge stt transcribe audio.mp3 --output transcript.txt

# Batch process directory
voicebridge stt batch-transcribe /path/to/audio/ --workers 4

# Long file with resume capability
voicebridge stt listen-resumable large_file.wav --session-name "my-session"

# Real-time streaming
voicebridge stt realtime --chunk-duration 2.0 --output-format live
```

#### Session Management

```bash
# List all sessions
voicebridge stt sessions list

# Resume interrupted session
voicebridge stt sessions resume --session-name "my-session"

# Clean up old sessions
voicebridge stt sessions cleanup

# Delete specific session
voicebridge stt sessions delete session_id
```

#### Advanced Features

```bash
# Add vocabulary words for better recognition
voicebridge stt vocabulary add "technical,terms,here" --type technical

# Export with confidence analysis
voicebridge stt export session session_id --format srt --confidence

# Set up webhooks for notifications
voicebridge stt webhook add https://api.example.com/notify
```

### Text-to-Speech (TTS) Commands

#### Basic Generation

```bash
# Generate speech from text
voicebridge tts generate "Hello, this is VoiceBridge!"

# Use specific voice and save to file
voicebridge tts generate "Hello world" --voice en-Alice_woman --output speech.wav

# Generate speech from a text file
voicebridge tts generate-file document.txt --output document.wav
voicebridge tts generate-file article.md --voice en-Patrick --streaming

# List available voices
voicebridge tts voices
```

#### Background Monitoring

```bash
# Monitor clipboard for text changes
voicebridge tts listen-clipboard --streaming

# Monitor text selections (use hotkey to trigger)
voicebridge tts listen-selection

# Start TTS daemon for background processing
voicebridge tts daemon start --mode clipboard
voicebridge tts daemon status
voicebridge tts daemon stop
```

#### Configuration

```bash
# Show TTS settings
voicebridge tts config show

# Configure TTS settings
voicebridge tts config set --default-voice en-Alice_woman --cfg-scale 1.5
```

### Audio Processing

```bash
# Get audio file information
voicebridge audio info audio.mp3

# List supported formats
voicebridge audio formats

# Split large audio file
voicebridge audio split recording.mp3 \
  --method duration \
  --chunk-duration 300

# Enhance audio quality
voicebridge audio preprocess input.wav output.wav \
  --noise-reduction 0.8 \
  --normalize \
  --trim-silence

# Test audio setup
voicebridge audio test
```

### System & Performance

```bash
# Check GPU status and acceleration
voicebridge gpu status

# Benchmark GPU performance
voicebridge gpu benchmark --model base

# View STT performance statistics
voicebridge stt performance stats

# Manage active operations
voicebridge stt operations list
voicebridge stt operations cancel operation_id
```

### API Server

```bash
# Start API server
voicebridge api start --host localhost --port 8000

# Check API status
voicebridge api status

# Get API information
voicebridge api info

# Stop API server
voicebridge api stop
```

## 📋 Complete Command Reference

VoiceBridge uses a hierarchical command structure with five main categories:

### 🎤 `stt` - Speech-to-Text Commands

```
stt listen              # Real-time transcription with hotkeys
stt interactive         # Press-and-hold 'r' to record mode
stt hotkey              # Global hotkey listener
stt transcribe          # Transcribe single audio file
stt batch-transcribe    # Batch process directory
stt listen-resumable    # Long file with resume capability
stt realtime            # Real-time streaming transcription

# Session Management
stt sessions list       # List all sessions
stt sessions resume     # Resume interrupted session
stt sessions cleanup    # Clean up old sessions
stt sessions delete     # Delete specific session

# Advanced Features
stt vocabulary add      # Add custom vocabulary
stt vocabulary remove   # Remove vocabulary
stt vocabulary list     # List vocabulary
stt vocabulary import   # Import from file
stt vocabulary export   # Export to file

stt export session      # Export session data
stt export formats      # List export formats

stt confidence analyze  # Analyze transcription confidence
stt confidence analyze-all # Analyze all sessions

stt postproc config     # Configure post-processing
stt postproc test       # Test post-processing

stt webhook add         # Add webhook notification
stt webhook remove      # Remove webhook
stt webhook list        # List webhooks
stt webhook test        # Test webhook

stt performance stats   # Performance statistics
stt operations list     # List active operations
stt operations cancel   # Cancel operation
stt operations status   # Check operation status

stt config show         # Show configuration
stt config set          # Set configuration

stt profile save        # Save configuration profile
stt profile load        # Load configuration profile
stt profile list        # List profiles
stt profile delete      # Delete profile
```

### 🗣️ `tts` - Text-to-Speech Commands

```
tts generate            # Generate speech from text
tts generate-file       # Generate speech from text file (txt, md, etc.)
tts listen-clipboard    # Monitor clipboard changes
tts listen-selection    # Monitor text selections with hotkey
tts voices              # List available voices

# Daemon Management
tts daemon start        # Start TTS daemon
tts daemon stop         # Stop TTS daemon
tts daemon status       # Check daemon status

# Configuration
tts config show         # Show TTS configuration
tts config set          # Configure TTS settings
```

### 🔊 `audio` - Audio Processing Commands

```
audio info              # Show audio file information
audio formats           # List supported formats
audio split             # Split audio file into chunks
audio preprocess        # Enhance audio quality
audio test              # Test audio setup
```

### 🖥️ `gpu` - GPU and System Commands

```
gpu status              # Show GPU status
gpu benchmark           # Benchmark GPU performance
```

### 🌐 `api` - API Server Management

```
api start               # Start API server
api stop                # Stop API server
api status              # Check API status
api info                # Show API information
```

## 🏗️ Architecture

VoiceBridge follows **hexagonal architecture** principles:

```
voicebridge/
├── domain/          # Core business logic and models
├── ports/           # Interfaces and abstractions
├── adapters/        # External integrations (Whisper, VibeVoice, etc.)
├── services/        # Application services and orchestration
├── cli/             # Command-line interface
└── tests/          # Comprehensive test suite
```

### Key Components

- **Domain Layer**: Core models, configurations, and business rules
- **Ports**: Abstract interfaces for transcription, TTS, audio processing
- **Adapters**: Concrete implementations for Whisper, VibeVoice, FFmpeg
- **Services**: Orchestration, session management, performance monitoring
- **CLI**: Typer-based command interface with sub-commands

## 🤝 Contributing

We welcome contributions! Here's how to get started:

### Development Workflow

1. **Fork** the repository
2. **Create** a feature branch: `git checkout -b feature/amazing-feature`
3. **Install** development dependencies: `make install-dev`
4. **Make** your changes following our coding standards
5. **Test** your changes: `make test`
6. **Lint** your code: `make lint`
7. **Commit** your changes: `git commit -m 'Add amazing feature'`
8. **Push** to your branch: `git push origin feature/amazing-feature`
9. **Open** a Pull Request

### Coding Standards

- **Python 3.10+** with comprehensive type hints
- **uv** for fast dependency management and virtual environments
- **Ruff** for linting and formatting (replaces Black and isort)
- **Pytest** for testing with >90% coverage target
- **Hexagonal architecture** for new features and clean separation of concerns
- **Comprehensive documentation** for public APIs and CLI commands
- **E2E testing** for all major CLI workflows and functionality
- **Makefile** for standardized development commands

### Areas for Contribution

- 🎯 **New audio formats** and processing capabilities
- 🌍 **Language support** and localization
- 🔧 **Performance optimizations** and GPU utilization
- 📱 **Platform integrations** (mobile, web interfaces)
- 🧪 **Test coverage** and edge case handling
- 📚 **Documentation** and usage examples
- 🎨 **Voice samples** and TTS improvements

### Reporting Issues

Please use our **issue templates**:

- 🐛 **Bug Report**: Describe the issue with reproduction steps
- 💡 **Feature Request**: Propose new functionality
- 📚 **Documentation**: Report unclear or missing docs
- 🏃 **Performance**: Report slow or resource-intensive operations

## 📜 License

This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **OpenAI Whisper** - State-of-the-art speech recognition
- **VibeVoice** - High-quality text-to-speech synthesis
- **FFmpeg** - Comprehensive audio processing
- **Typer** - Modern CLI framework
- **PyTorch** - Machine learning infrastructure
