Metadata-Version: 2.4
Name: maivi
Version: 0.2.0
Summary: Maivi - My AI Voice Input: Real-time voice-to-text with hotkey support
Author: Maxime Rivest
License: MIT
Project-URL: Homepage, https://github.com/MaximeRivest/maivi
Project-URL: Repository, https://github.com/MaximeRivest/maivi
Project-URL: Issues, https://github.com/MaximeRivest/maivi/issues
Keywords: speech-to-text,voice,transcription,ai,nemo
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nemo-toolkit[asr]>=2.0.0
Requires-Dist: PySide6>=6.5.0
Requires-Dist: pyaudio>=0.2.13
Requires-Dist: soundfile>=0.12.0
Requires-Dist: librosa>=0.10.0
Requires-Dist: numpy<2.0.0,>=1.24.0
Requires-Dist: pynput>=1.7.6
Requires-Dist: pyperclip>=1.8.2
Requires-Dist: plyer>=2.1.0
Requires-Dist: torch>=2.0.0
Requires-Dist: torchaudio>=2.0.0
Requires-Dist: ml-dtypes<0.6.0,>=0.5.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.12; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Provides-Extra: build
Requires-Dist: pyinstaller>=5.0; extra == "build"
Requires-Dist: build>=0.10; extra == "build"
Requires-Dist: twine>=4.0; extra == "build"
Dynamic: license-file

# Maivi - My AI Voice Input 🎤

**Real-time voice-to-text transcription with hotkey support**

Maivi (My AI Voice Input) is a cross-platform desktop application that turns your voice into text using state-of-the-art AI models. Simply press **Alt+Q** to start recording, and press again to stop. Your transcription appears in real-time and is automatically copied to your clipboard.

![License](https://img.shields.io/badge/license-MIT-blue.svg)
![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)
![Platform](https://img.shields.io/badge/platform-linux%20%7C%20macos%20%7C%20windows-lightgrey.svg)

## ✨ Features

- 🎤 **Hotkey Recording** - Toggle recording with Alt+Q
- ⚡ **Real-time Transcription** - See text appear as you speak
- 📋 **Clipboard Integration** - Automatic copy to clipboard
- 🪟 **Floating Overlay** - Live transcription in a sleek overlay window
- 🔄 **Smart Chunk Merging** - Advanced overlap-based merging eliminates duplicates
- 💻 **CPU-Only** - No GPU required (though GPU acceleration is supported)
- 🌍 **High Accuracy** - Powered by NVIDIA Parakeet TDT 0.6B model (~6-9% WER)
- 🚀 **Fast** - ~0.36x RTF (processes 7s audio in 2.5s on CPU)

## 🚀 Quick Start

### Installation

**CPU-only (Recommended - much faster, 100MB vs 2GB+):**
```bash
pip install maivi --extra-index-url https://download.pytorch.org/whl/cpu
```

**Or with GPU support (if you have NVIDIA GPU):**
```bash
pip install maivi --extra-index-url https://download.pytorch.org/whl/cu121
```

**Standard install (may download large CUDA files):**
```bash
pip install maivi
```

### System Requirements

**Linux:**
```bash
sudo apt-get install portaudio19-dev python3-pyaudio
```

**macOS:**
```bash
brew install portaudio
```

**Windows:**
- PortAudio is usually included with PyAudio

### Usage

**GUI Mode (Recommended):**
```bash
maivi
```

Press **Alt+Q** to start recording, press **Alt+Q** again to stop. The transcription will appear in a floating overlay and be copied to your clipboard.

**CLI Mode:**
```bash
# Basic CLI
maivi-cli

# With live terminal UI
maia-cli --show-ui

# Custom parameters
maia-cli --window 10 --slide 5 --show-ui
```

**Controls:**
- **Alt+Q** - Start/stop recording (toggle mode)
- **Esc** - Exit application

## 📖 How It Works

Maia uses a sophisticated streaming architecture:

1. **Sliding Window Recording** - Captures audio in overlapping 7-second chunks every 3 seconds
2. **Real-time Transcription** - Each chunk is transcribed by the NVIDIA Parakeet model
3. **Smart Merging** - Chunks are merged using overlap detection (4-second overlap)
4. **Live Updates** - The UI updates in real-time as transcription progresses

### Why Overlapping Chunks?

```
Chunk 1: "hello world how are you"
Chunk 2: "how are you doing today"
          ^^^^^^^^^^^^^^
          Overlap detected → merge!

Result: "hello world how are you doing today"
```

This approach ensures:
- ✅ No words cut mid-syllable
- ✅ Context preserved for better accuracy
- ✅ Seamless merging without duplicates
- ✅ Fast processing (no queue buildup)

## ⚙️ Configuration

### Chunk Parameters

```bash
maia-cli --window 7.0 --slide 3.0 --delay 2.0
```

- `--window`: Chunk size in seconds (default: 7.0)
  - Larger = better quality, slower processing
- `--slide`: Slide interval in seconds (default: 3.0)
  - Smaller = more overlap, higher CPU usage
  - **Rule:** Must be > `window × 0.36` to avoid queue buildup
- `--delay`: Processing start delay in seconds (default: 2.0)

### Advanced Options

```bash
# Speed adjustment (experimental)
maia-cli --speed 1.5

# Custom UI width
maia-cli --show-ui --ui-width 50

# Disable pause detection
maia-cli --no-pause-breaks

# Stream to file (for voice commands)
maia-cli --output-file transcription.txt
```

## 📦 Building Executables

Maivi can be packaged as standalone executables for easy distribution:

```bash
# Install build dependencies
pip install maivi[build]

# Build executable
pyinstaller --onefile --windowed \
  --name maivi \
  --add-data "src/maia:maia" \
  src/maia/__main__.py
```

Pre-built executables are available in [Releases](https://github.com/MaximeRivest/maivi/releases).

## 🏗️ Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/MaximeRivest/maivi.git
cd maivi

# Install in development mode
pip install -e .[dev]

# Run tests
pytest
```

### Project Structure

```
maia/
├── src/maia/
│   ├── __init__.py
│   ├── __main__.py           # GUI entry point
│   ├── core/
│   │   ├── streaming_recorder.py
│   │   ├── chunk_merger.py
│   │   └── pause_detector.py
│   ├── gui/
│   │   └── qt_gui.py
│   ├── cli/
│   │   ├── cli.py
│   │   ├── server.py
│   │   └── terminal_ui.py
│   └── utils/
├── tests/
├── docs/
├── pyproject.toml
├── README.md
└── LICENSE
```

## 🐛 Troubleshooting

### "No overlap found" warnings

This is **expected behavior** when there are long pauses (5+ seconds of silence). The system adds "..." gap markers to indicate the pause.

### Queue buildup (transcription continues after stopping)

Check that processing time < slide interval:
- Processing: `window_seconds × 0.36` (RTF)
- Should be < `slide_seconds`
- Default: `7 × 0.36 = 2.52s < 3s` ✅

### Model download issues

The first run downloads the NVIDIA Parakeet model (~600MB) from HuggingFace. If download fails:
- Check internet connection
- Verify HuggingFace is accessible
- Clear cache: `rm -rf ~/.cache/huggingface/`

### Qt/GUI crashes

If the GUI crashes on Linux:
```bash
# Check Qt installation
python -c "from PySide6 import QtWidgets; print('Qt OK')"

# Fall back to CLI mode
maia-cli --show-ui
```

## 📊 Performance

**Memory:**
- Model: ~2GB RAM
- Audio buffer: ~1MB
- Total: ~2.5GB RAM

**CPU:**
- Idle: <5% CPU
- Recording: 30-40% of 1 core
- Transcription: 100% of 1 core (during processing)

**Latency:**
- First transcription: 2s (start delay)
- Updates: Every 3s (slide interval)
- Completion: 1-3s after recording stops

**Accuracy:**
- Model WER: ~5-8%
- Overlap merging: <1% word loss
- Total effective WER: ~6-9%

## 🗺️ Roadmap

**v0.2 - Platform Support:**
- [ ] Test and verify macOS support
- [ ] Test and verify Windows support
- [ ] Platform-specific installers (.app, .exe)

**v0.3 - Features:**
- [ ] Configurable hotkeys via GUI
- [ ] Multi-language support
- [ ] Custom model selection
- [ ] Voice commands support

**v0.4 - Optimization:**
- [ ] GPU acceleration (CUDA)
- [ ] Export formats (JSON, SRT)
- [ ] Text editor integration
- [ ] Plugin system

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Built with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) ASR toolkit
- Uses [Parakeet TDT 0.6B](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) model
- GUI powered by [PySide6](https://wiki.qt.io/Qt_for_Python)

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 💬 Support

- 📫 [Create an issue](https://github.com/MaximeRivest/maivi/issues)
- 💡 [Feature requests](https://github.com/MaximeRivest/maivi/issues/new?labels=enhancement)
- 🐛 [Bug reports](https://github.com/MaximeRivest/maivi/issues/new?labels=bug)

---

Made with ❤️ by [Maxime Rivest](https://github.com/MaximeRivest)
