Metadata-Version: 2.4
Name: audio-metrics-cli
Version: 0.4.0
Summary: Voice Acoustic Analyzer - Professional audio metrics extraction
Author-email: OpenClaw <clawbot@openclaw.ai>
License: MIT
Project-URL: Homepage, https://github.com/i-whimsy/audio-metrics-cli
Project-URL: Repository, https://github.com/i-whimsy/audio-metrics-cli.git
Project-URL: Documentation, https://github.com/i-whimsy/audio-metrics-cli#readme
Keywords: audio,speech,analysis,metrics,whisper,vad,prosody,cli
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Analysis
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.23.0
Requires-Dist: librosa>=0.10.0
Requires-Dist: soundfile>=0.12.0
Requires-Dist: openai-whisper>=20230314
Requires-Dist: click>=8.1.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: structlog>=23.0.0
Requires-Dist: tenacity>=8.0.0
Requires-Dist: pyannote.audio>=3.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: torchaudio>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: nlp
Requires-Dist: spacy>=3.7.0; extra == "nlp"
Requires-Dist: textblob>=0.17.0; extra == "nlp"
Requires-Dist: snownlp>=0.12.0; extra == "nlp"
Provides-Extra: emotion
Requires-Dist: torch>=2.0.0; extra == "emotion"
Requires-Dist: torchaudio>=2.0.0; extra == "emotion"
Requires-Dist: speechbrain>=0.5.14; extra == "emotion"
Provides-Extra: api
Requires-Dist: fastapi>=0.100.0; extra == "api"
Requires-Dist: uvicorn>=0.23.0; extra == "api"
Requires-Dist: aiofiles>=23.0.0; extra == "api"
Requires-Dist: jinja2>=3.0.0; extra == "api"
Provides-Extra: ner
Requires-Dist: spacy>=3.7.0; extra == "ner"
Provides-Extra: extra
Requires-Dist: torch>=2.0.0; extra == "extra"
Requires-Dist: torchaudio>=2.0.0; extra == "extra"
Requires-Dist: speechbrain>=0.5.14; extra == "extra"
Requires-Dist: fastapi>=0.100.0; extra == "extra"
Requires-Dist: uvicorn>=0.23.0; extra == "extra"
Requires-Dist: aiofiles>=23.0.0; extra == "extra"
Requires-Dist: jinja2>=3.0.0; extra == "extra"
Requires-Dist: spacy>=3.7.0; extra == "extra"
Requires-Dist: textblob>=0.17.0; extra == "extra"
Requires-Dist: snownlp>=0.12.0; extra == "extra"
Dynamic: license-file

# Audio Metrics CLI v4

🎙️ **Industrial-Grade Speech Deep Analysis Platform**

[![PyPI version](https://badge.fury.io/py/audio-metrics-cli.svg)](https://badge.fury.io/py/audio-metrics-cli)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

> **v4.0 Architecture**: GPU-accelerated, chunked processing, Pydantic-validated industrial analysis

---

## 🇨🇳 中国区用户 - 首次使用必读

### 🚀 一键下载模型（推荐）

**Windows 用户**: 双击运行 `download_models.bat` 脚本

**或手动执行**（PowerShell）:
```powershell
$env:HF_ENDPOINT = "https://hf-mirror.com"
pip install huggingface-hub openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple
cd C:\Users\clawbot\.cache\torch\hub
git clone https://ghproxy.com/https://github.com/snakers4/silero-vad.git silero-vad_master
huggingface-cli download pyannote/speaker-diarization-3.1 --local-dir "C:\Users\clawbot\.cache\huggingface\hub\models--pyannote--speaker-diarization-3.1"
python -c "import whisper; whisper.load_model('base')"
```

**详细说明**: 查看 [`docs/MODEL_DEPENDENCIES.md`](docs/MODEL_DEPENDENCIES.md)

---

## 🚀 Quick Start

```bash
# Install from PyPI (recommended)
pip install audio-metrics-cli

# Full V4 analysis with GPU auto-detection
audio-metrics analyze audio.wav -o result.json

# Specify device manually
audio-metrics analyze audio.wav -d cuda -o result.json

# Long audio (>1h) with custom chunk size
audio-metrics analyze audio.wav -o result.json --chunk-size 900 --show-timings
```

### GPU Acceleration

V4 auto-detects NVIDIA GPUs and runs Whisper + pyannote.audio on CUDA:

```bash
audio-metrics analyze audio.wav -d auto   # GPU if available, else CPU
audio-metrics analyze audio.wav -d cuda     # Force GPU
audio-metrics analyze audio.wav -d cpu     # Force CPU
```

---

## 🏛️ Architecture v4

```
┌──────────────────────────────────────────────────────┐
│               CLI Layer (main_cli.py)                │
│   analyze | analyze-multi | voice-acoustic | serve   │
└──────────────────────────────────────────────────┬───┘
                                                   ↓
┌──────────────────────────────────────────────────────┐
│          V4 Pipeline (v4/pipeline.py)               │
│  DeviceManager → AudioHealth → Chunker → Analyzer  │
└──────────────────────────────────────────────────┬───┘
                                                   ↓
┌──────────────────────────────────────────────────────┐
│          Pydantic Schema (v4/schemas.py)             │
│  V4Result → SegmentModel → SpeakerModel → NER      │
└──────────────────────────────────────────────────────┘
```

### Key Features

| Feature | Description |
|---------|-------------|
| **GPU Auto-Detection** | Automatic CUDA detection for Whisper + pyannote.audio |
| **Chunked Processing** | Handles 1h+ audio without OOM (1800s chunks, 60s overlap) |
| **Word-Level Alignment** | Precise timestamp alignment (replaces `seg_duration*5` estimation) |
| **30+ Prosody Metrics** | Pitch, energy, spectral, voice quality, speech rate per segment |
| **Fluency Analysis** | Filler words (呃/嗯/那个) + unnatural pauses detection |
| **NER** | spaCy-based named entity recognition (commercial entities, persons, locations) |
| **Topic Segmentation** | Semantic topic chapters with Jaccard keyword similarity |
| **Sentiment & Key Points** | TextBlob/snownlp sentiment scoring, automatic key point detection |
| **Pydantic Validation** | All outputs validated against strict schema (100% constraint enforcement) |
| **tqdm Progress Bars** | Real-time feedback on VAD, Diarization, STT, metrics extraction |

---

## 📖 CLI Commands

### `analyze` - V4 Full Analysis

Single audio file → V4 pipeline with full feature set.

```bash
audio-metrics analyze AUDIO_FILE [OPTIONS]

Options:
  -o, --output PATH           Output JSON file path
  -d, --device [auto|cuda|cpu]  Device for inference (default: auto)
  -m, --model TEXT            Whisper model (tiny/base/small/medium/large)
  --num-speakers INTEGER       Number of speakers (if known)
  --min-speakers INTEGER       Minimum number of speakers
  --max-speakers INTEGER       Maximum number of speakers
  --language TEXT              Language code (auto-detect if not specified)
  --chunk-size INTEGER         Chunk size in seconds for long audio (default: 1800)
  --no-emotion                 Skip emotion analysis
  --no-progress                Disable tqdm progress bars
  --show-timings               Show step timing information
  --show-progress              Show progress bars
  -f, --format [json|csv|html]  Output format (default: json)
  --parallel                   Use parallel processing (batch mode)
  --batch PATH                 Process all audio files in directory
  --glob TEXT                  Glob pattern for batch processing
  -j, --workers INTEGER        Number of parallel workers
  -v, --verbose                Verbose output

Examples:
  audio-metrics analyze meeting.wav -o result.json
  audio-metrics analyze meeting.wav -d cuda -o result.json --show-timings
  audio-metrics analyze long_recording.wav --chunk-size 900 --language zh
```

### `analyze-multi` - Multi-Speaker Conversation

```bash
audio-metrics analyze-multi AUDIO_FILE [OPTIONS]
```

### `voice-acoustic` - Acoustic Features Only

```bash
audio-metrics voice-acoustic AUDIO_FILE [OPTIONS]
```

### `transcribe` - Whisper Transcription Only

```bash
audio-metrics transcribe AUDIO_FILE [-o OUTPUT] [-m MODEL] [--language LANG]
```

### `compare` - Compare Two Audio Files

```bash
audio-metrics compare FILE1 FILE2 [--format text|json|markdown]
```

### `serve` - Start API Server

```bash
audio-metrics serve [--host HOST] [-p PORT] [--reload]
```

---

## 📊 V4 Output Schema

All outputs are Pydantic-validated JSON with strict constraints.

### Top-Level Structure

```json
{
  "meta": {
    "version": "4.0.0",
    "device_used": "cuda",
    "chunked_processing": false,
    "analysis_complete": true
  },
  "audio": { ... },
  "speakers": [ ... ],
  "segments": [ ... ],
  "prosody": { ... },
  "fluency": { ... },
  "conversation_dynamics": { ... },
  "vad": { ... },
  "emotion": { ... },
  "named_entities": { ... },
  "topic_segments": [ ... ],
  "transcript_text": "...",
  "transcript_language": "zh"
}
```

### Segment Detail (Core Output Unit)

```json
{
  "segment_index": 0,
  "start": 0.0,
  "end": 15.234,
  "duration": 15.234,
  "confidence": 0.95,
  "speaker": "speaker_0",
  "text": "今天我们讨论一下Aibee项目的进展情况。万象城的项目已经进入第三期。",
  "pitch_mean_hz": 175.3,
  "energy_mean": 0.0245,
  "speech_rate_wpm": 150.2,
  "filler_words": { "那个": 2, "嗯": 1 },
  "sentiment_score": 0.3,
  "is_key_point": true,
  "named_entities": ["Aibee", "万象城"],
  "topic": "project_update"
}
```

### Named Entities

```json
{
  "total_entities": 7,
  "commercial_entities": ["Aibee", "万象城", "中海地产", "保利", "SKP"],
  "persons": ["张三"],
  "organizations": ["Aibee", "中海地产"]
}
```

### Topic Segmentation

```json
{
  "num_topics": 3,
  "topics": [
    { "start": 0.0, "end": 1200.0, "topic_label": "project_update", "keywords": ["项目", "进度", "Aibee", "万象城"], "confidence": 0.85 },
    { "start": 1200.0, "end": 2400.0, "topic_label": "planning", "keywords": ["计划", "目标", "策略"], "confidence": 0.78 }
  ]
}
```

See [`standard_v4_sample.json`](standard_v4_sample.json) for full reference.

---

## ⚠️ Important: Dependencies

This tool requires **pyannote.audio** for accurate multi-speaker analysis.

Without pyannote.audio installed, the tool uses a fallback VAD-based method that:
- ❌ Cannot distinguish between different speakers
- ❌ Will show 50/50 speaking time even when one person talks 90% of the time

With pyannote.audio installed:
- ✅ Correctly identifies who spoke when
- ✅ Accurate speaker time statistics
- ✅ Works with any number of speakers

### Installation

```bash
# CPU-only (faster install, recommended for testing)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install pyannote.audio

# GPU (faster inference, requires CUDA)
pip install torch torchaudio
pip install pyannote.audio
```

### Optional Dependencies for Full V4 Features

```bash
# NER + Sentiment (recommended)
pip install audio-metrics-cli[nlp]

# Individual
pip install audio-metrics-cli[ner]      # spaCy for named entity recognition
pip install audio-metrics-cli[emotion]   # SpeechBrain for emotion analysis
pip install audio-metrics-cli[api]       # FastAPI server
```

---

## 💻 Development

```bash
# Clone repository
git clone https://github.com/i-whimsy/audio-metrics-cli.git
cd audio-metrics-cli

# Install with dev dependencies
pip install -e ".[dev]"

# Run V4 tests
pytest tests/v4/ -v

# Run all tests
pytest tests/ -v

# Format code
black src/
ruff check src/
```

### Project Structure

```
audio-metrics-cli/
├── src/audio_metrics/
│   ├── main_cli.py              # V4 CLI entry point
│   ├── cli/
│   │   ├── __init__.py          # cli/__init__ → main_cli.py
│   │   └── cli.py               # Legacy v3 CLI (superseded)
│   ├── v4/
│   │   ├── __init__.py
│   │   ├── schemas.py           # Pydantic V4 models
│   │   ├── pipeline.py         # V4 orchestrator
│   │   └── generate_sample.py   # Sample generation
│   ├── analyzers/
│   │   ├── audio_health.py      # Audio validation/normalization
│   │   ├── speech_to_text.py    # Word-level timestamps
│   │   ├── speaker_diarization.py  # GPU device support
│   │   ├── prosody_analyzer.py  # 30+ prosody features
│   │   ├── filler_detector.py  # Filler word detection
│   │   ├── fluency_analyzer.py # Unnatural pauses
│   │   └── ...
│   ├── nlp/
│   │   ├── ner_analyzer.py      # spaCy NER
│   │   ├── topic_segmenter.py  # Topic segmentation
│   │   ├── sentiment_analyzer.py  # TextBlob + snownlp
│   │   └── ...
│   ├── core/
│   │   ├── device.py            # GPU/CPU detection
│   │   ├── chunker.py           # Long audio chunking
│   │   ├── warnings.py         # Warning suppression
│   │   └── ...
│   ├── conversation/
│   ├── metrics/
│   └── exporters/
├── tests/
│   └── v4/
│       ├── test_schema_validation.py  # 27 schema tests
│       └── test_edge_cases.py        # 17 boundary tests
├── standard_v4_sample.json     # Reference output
├── pyproject.toml
└── README.md
```

---

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

---

## 📝 License

MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- [OpenAI Whisper](https://github.com/openai/whisper) - Speech-to-text
- [Silero VAD](https://github.com/snakers4/silero-vad) - Voice activity detection
- [pyannote](https://github.com/pyannote/pyannote-audio) - Speaker diarization
- [Librosa](https://librosa.org/) - Audio analysis
- [spaCy](https://spacy.io/) - Named entity recognition
- [TextBlob](https://textblob.readthedocs.io/) - Sentiment analysis
- [SnowNLP](https://github.com/isnowfy/snownlp) - Chinese sentiment

---

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/i-whimsy/audio-metrics-cli/issues)
- **Discussions**: [GitHub Discussions](https://github.com/i-whimsy/audio-metrics-cli/discussions)
- **Email**: clawbot@openclaw.ai

---

**Built with ❤️ by OpenClaw Team**
**v4.0 - Industrial-Grade Speech Deep Analysis**
