Metadata-Version: 2.4
Name: mlx-meralion
Version: 0.1.0
Summary: MLX-native inference for MERaLiON AudioLLM on Apple Silicon
Author: Yingxu He
License: MIT
Project-URL: Homepage, https://github.com/YingxuH/mlx-audiollm
Project-URL: Repository, https://github.com/YingxuH/mlx-audiollm
Project-URL: Issues, https://github.com/YingxuH/mlx-audiollm/issues
Keywords: mlx,apple-silicon,speech,asr,audio,llm,meralion
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlx>=0.18.0
Requires-Dist: mlx-lm>=0.20.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: librosa>=0.10.0
Requires-Dist: soundfile>=0.12.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: tokenizers>=0.15.0
Requires-Dist: transformers>=4.46.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Dynamic: license-file

# mlx-meralion

MLX-native inference for [MERaLiON AudioLLM](https://huggingface.co/MERaLiON) on Apple Silicon.

MERaLiON is A*STAR's multimodal audio-language model for speech transcription, translation, spoken question answering, and more.

## Installation

```bash
pip install mlx-meralion
```

Requires macOS on Apple Silicon (M1/M2/M3/M4) and Python 3.10+.

## Quick Start

### Python API

```python
from mlx_meralion import load_model, transcribe

# Load model (auto-downloads from HuggingFace on first use)
model = load_model("MERaLiON/MERaLiON-2-10B-MLX")  # 10B 8-bit, recommended
# model = load_model("MERaLiON/MERaLiON-2-3B-MLX")   # 3B fp16, smaller

# Transcribe speech
text = transcribe(model, "audio.wav")
print(text)

# Translate to Chinese
text = transcribe(model, "audio.wav", task="translate_zh")

# Spoken question answering
text = transcribe(model, "audio.wav", task="sqa", question="What is the speaker talking about?")

# Summarize dialogue
text = transcribe(model, "audio.wav", task="summarize")
```

### CLI

```bash
# ASR (default task)
mlx-meralion --model MERaLiON/MERaLiON-2-10B-MLX --audio audio.wav --task asr

# Translation
mlx-meralion --model MERaLiON/MERaLiON-2-10B-MLX --audio audio.wav --task translate_zh

# Custom instruction
mlx-meralion --model MERaLiON/MERaLiON-2-10B-MLX --audio audio.wav --instruction "Summarize this in one sentence."
```

## Supported Tasks

| Task | Description |
|------|-------------|
| `asr` | Speech-to-text transcription |
| `translate_zh` | Translate to Chinese |
| `translate_id` | Translate to Indonesian |
| `translate_ms` | Translate to Malay |
| `translate_ta` | Translate to Tamil |
| `sqa` | Spoken question answering (requires `question=`) |
| `summarize` | Dialogue summarization |
| `paralinguistics` | Speaker characteristic analysis |

## Available Models

| Model | Size | RAM | Quality | HuggingFace |
|-------|------|-----|---------|-------------|
| MERaLiON-2-10B-MLX | ~10 GB | 16+ GB | Best | [MERaLiON/MERaLiON-2-10B-MLX](https://huggingface.co/MERaLiON/MERaLiON-2-10B-MLX) |
| MERaLiON-2-3B-MLX | ~6 GB | 8+ GB | Good | [MERaLiON/MERaLiON-2-3B-MLX](https://huggingface.co/MERaLiON/MERaLiON-2-3B-MLX) |

## Features

- **Apple Silicon native**: Runs entirely on MLX with GPU acceleration
- **N-gram blocking**: Automatically prevents repetitive output (matching HuggingFace quality)
- **Smart chunking**: Long audio split at 30s boundaries; short tails merged to prevent hallucination
- **Auto-download**: HuggingFace models are downloaded and cached automatically
- **Multiple tasks**: ASR, translation, QA, summarization, and more

## Architecture

```
Audio (WAV/MP3/FLAC)
  -> Whisper Encoder (1280-d)
    -> LayerNorm + MLP Adaptor
      -> Speech embeddings merged into text sequence
        -> Gemma2 Decoder -> text output
```

## License

MIT
