Metadata-Version: 2.4
Name: pitchbench
Version: 0.1.6
Summary: Benchmark suite for evaluating pitch and acoustic perception in audio language models
Author-email: pitchbench-authors <lofi123woooo@gmail.com>
License: MIT
Keywords: audio,language-model,benchmark,pitch,music,ALM
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=2.0
Requires-Dist: soundfile>=0.12
Requires-Dist: matplotlib>=3.9
Requires-Dist: requests>=2.32
Requires-Dist: openai>=1.0
Requires-Dist: scikit-learn>=1.4
Requires-Dist: python-dotenv>=1.0
Requires-Dist: music21>=9.9.1
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=15
Requires-Dist: pedalboard>=0.9
Requires-Dist: pyfluidsynth>=1.3
Requires-Dist: pretty-midi>=0.2
Requires-Dist: scipy>=1.13
Requires-Dist: tqdm>=4.66
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-mock>=3.14; extra == "dev"
Dynamic: license-file

# PitchBench -- Python Package

Benchmark suite for evaluating pitch and acoustic perception in audio language models (ALMs). Probes pitch identification, temporal localisation, chord recognition, melodic contour, robustness to audio effects, and more — reporting per-format accuracy (MIDI, SPN, doremi, Hz) to expose where verbal decoding fails.

---

## Setup

### 1. Install FluidSynth

| Platform | Command |
|----------|---------|
| Linux / WSL | `sudo apt install fluidsynth` |
| macOS | `brew install fluid-synth` |
| Windows | `choco install fluidsynth` or download from [fluidsynth.org](https://www.fluidsynth.org) |

A GM soundfont is also required. Place any `.sf2` or `.sf3` in `data/soundfonts/` — no further configuration needed.

**Linux / WSL:**
```bash
sudo apt install fluid-soundfont-gm
mkdir -p data/soundfonts
cp /usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/
```

**macOS / cross-platform:**
```bash
curl -L "http://ftp.debian.org/debian/pool/main/f/fluid-soundfont/fluid-soundfont-gm_3.1-5.3_all.deb" \
     -o /tmp/fluid.deb
cd /tmp && ar x fluid.deb && tar xf data.tar.* ./usr/share/sounds/sf2/FluidR3_GM.sf2
mkdir -p data/soundfonts && cp /tmp/usr/share/sounds/sf2/FluidR3_GM.sf2 data/soundfonts/
```

SF2 resolution order: `PITCHBENCH_SF2` env var → `data/soundfonts/` → `/usr/share/sounds/sf2/FluidR3_GM.sf2`.

### 2. Install PitchBench

```bash
python3 -m venv .venv
source .venv/bin/activate        # Linux / macOS
# .venv\Scripts\activate         # Windows
pip install pitchbench
```

For development tools:
```bash
uv sync --extra dev              # installs pytest + pytest-mock
```

Requires Python ≥ 3.12.

### 3. Configure a model backend

**OpenRouter** — add to `.env`:
```
OPENROUTER_KEY=sk-or-...
```
Use `--model openrouter/<provider>/<model>`. "Latest" alias slugs require a `~` prefix, e.g. `openrouter/~google/gemini-flash-latest`.

**DashScope** — add to `.env`:
```
DASHSCOPE_API_KEY=sk-...
```
Use `--model dashscope/<model>`.

**Local server** — included in the base `pip install pitchbench` dependencies:
```
--model http://localhost:8001 --name my-local-model
```

---

## Usage

PitchBench separates stimulus generation from model evaluation. Generate once, evaluate any number of models.

### Generate

```bash
pitchbench generate all
pitchbench generate a          # single category
pitchbench generate a1         # single experiment
```

Writes WAV files and a Parquet dataset to `data/generated/<exp>/`. Re-running is safe: existing rows are skipped. The default set of experiment parameters generates more than 100,000 audio files, which can take several hours. The dataset can be easily extended by allowing greater coverage by the experiment parameters through a custom experiment config. Run:

```bash
pitchbench generate all --source your_config.py
```
An example config can be found on the original Github repository of this package called PitchBench.

```
data/generated/pitchbench_a1_single_pitch_id/
  *.wav               # one file per condition
  _questions.parquet  # HuggingFace-compatible schema
  _questions.csv      # human-readable companion
```

### Evaluate

```bash
pitchbench --list   
pitchbench evaluate all --model openrouter/<provider>/<model>
pitchbench evaluate a1  --model openrouter/<provider>/<model>

# Quick test (20 stimuli, stratified)
pitchbench evaluate a1  --model openrouter/<provider>/<model> --sample-n 20 --sample-seed 0
```

Results land in `results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/`.

In tmux (recommended for long runs):
```bash
tmux new-session -d -s mymodel "cd /home/hsnu2/PitchBench && source .venv/bin/activate && \
  pitchbench evaluate all --model openrouter/<provider>/<model> 2>&1 | tee logs/eval_mymodel_\$(date +%Y%m%d_%H%M%S).log"
```

### Analyze

```bash
pitchbench analyze --preset q1 --model openrouter/<provider>/<model>
```

Writes to `results/analysis/<model_slug>/<run>/` and generates a summary of the ablations.

**Batch (multiple models):**
```bash
python -m pitchbench.analysis.run_analysis q1 \
    --models openrouter/<provider>/<model1> \
             openrouter/<provider>/<model2> \
             dashscope/<model3>
```

---

## Output

```
results/evaluation/<model_slug>/<YYYYMMDD_HHMMSS>/
  pitchbench_<exp>/
    results_<model>.json      # metadata, git commit, per-item responses
    results_<model>.csv       # one row per stimulus
    accuracies_<model>.csv    # aggregate metrics
    plots/                    # per-format CI plots
```

`results/evaluation/summary/` is populated automatically after each run (`aggregated_accuracies.csv`, accuracy plots by instrument / notation / note).

---

## Experiment categories

28 experiments across 7 categories.

### Category A — Single-pitch identification

| ID | Script | What it tests |
|----|--------|---------------|
| a1 | `single_pitch_id` | Baseline: identify one sustained note across all sources and formats |
| a2 | `single_pitch_by_loudness` | Pitch accuracy under different loudness levels |
| a3 | `single_pitch_by_duration` |Pitch accuracy under different durations |

### Category B — Time-localised pitch

| ID | Script | What it tests |
|----|--------|---------------|
| b1 | `single_pitch_within_silence` | Hidden note in a silent clip |
| b2 | `pitch_at_timestamp` | Which pitch is playing at a queried timestamp in a multi-note sequence |
| b3 | `timestamp_single_pitch` | Detect when a single note starts/ends |
| b4 | `timestamp_specific_pitch` | Onset/offset of a named target pitch among distractors |
| b5 | `timestamp_multiple_pitches` | Full timing transcription of all notes in a sequence |

### Category C — Chords and simultaneous pitches

| ID | Script | What it tests |
|----|--------|---------------|
| c1 | `chord_count_pitches` | Count distinct pitches in a chord (dyads, triads, 7ths) |
| c2 | `chord_dyad_interval` | Name the interval (semitone count) between two simultaneous tones |
| c3 | `chord_quality` | Name chord quality and/or root+quality from a sounding chord |
| c4 | `chord_pitches` | List every note in a chord |

### Category D — Sequences, contour, intervals

| ID | Script | What it tests |
|----|--------|---------------|
| d1 | `sequence_count_pitches` | Count distinct pitches in a sequential passage |
| d2 | `dyad_lower_higher_difference` | Binary higher/lower judgment |
| d3 | `contour_discrete` | Output up/down tokens for each step-wise transition |
| d4 | `contour_continuous` | Output up/down tokens for each monotonic movement |
| d5 | `sequence_ranking_by_pitch` | Rank sequential tones from lowest to highest |
| d6 | `sequence_dyad_interval` | Name the melodic interval between two sequential notes |
| d7a | `pitch_with_reference` | Pitch identification given a labelled reference tone (variant used in the evaluation) |
| d7b | `pitch_with_reference_split` | Same, but reference and target in separate audio clips |
| d7c | `pitch_with_reference` | Variant with different set of pitches |
| d7d | `pitch_with_reference_split` | Split-clip variant with a different set of pitches |
| d8 | `sequence_pitches` | Transcribe all pitches in a note sequence in order |

### Category E — Robustness

| ID | Script | What it tests |
|----|--------|---------------|
| e1 | `audio_effects` | Pitch under effects like high-pass/low-pass filtering, distortion, reverb, chorus |
| e2 | `background` | Pitch over real-world backgrounds (crowd, rain, bells, street) at varying loudness levels |
| e3 | `harmonic_saturation` | Pitch with harmonic saturation / overdrive |
| e4 | `time_stretching` | Pitch with time-stretching (speed change without pitch shift) |
| e5 | `vibrato` | Pitch with vibrato at varying rates and depths |
| e6 | `slightly_off` | Detuned tones (up to 45 % of a semitone) |

### Category F — Polyphony

| ID | Script | What it tests |
|----|--------|---------------|
| f1 | `melodic_line_atonal` | Transcribe one designated voice from 2–3 simultaneous synthetic voices (atonal) |
| f2 | `melodic_line_tonal` | Same task over tonal melodic material from Bach chorales |

### Category Y — Format variants

| ID | Script | What it tests |
|----|--------|---------------|
| y1 | `single_pitch_id_mcq` | MCQ variant of a1: pick the correct pitch from labelled options |

---

## Stimulus engine

- **Waveforms** (always available): `sine`, `sawtooth`, `square`, `triangle`
- **GM instruments** (FluidSynth): `piano`, `electric_keyboard`, `guitar`, `flute`, `trumpet`, `trombone`, `clarinet`, `oboe`, `violin`, `cello`, `organ`, `bass`, `synth_lead`, `synth_pad`, `voice`
- **Backgrounds** (e2): `white_noise` synthesised; real recordings in `data/preloaded/background/<name>.mp3`

---

## Repository layout

```
src/pitchbench/
  config.py                    # runtime constants and env wiring
  configs/
    benchmark_config.py        # paper-locked data-gen parameters
    analysis_config.py         # analysis-mode presets/overrides
    plot_config.py             # labels/colors for figures
    user_config.py             # user-overridable settings
  sound/
    engine.py                  # central audio synthesis engine
  model/
    query.py                   # ALM query facade (local/OpenRouter/DashScope)
    dispatcher.py              # provider-aware request dispatch
    cost.py                    # API usage/cost tracking
  experiments/
    run.py                     # pitchbench CLI
    scripts/                   # one .py per experiment (32 total)
    helpers/
      cat_a.py ... cat_f.py    # per-category generate + evaluate helpers
      audit.py                 # condition/result audit utilities
      data.py                  # Parquet I/O
      music.py                 # pitch/notation conversion helpers
      plots.py                 # experiment plotting helpers
      results.py               # result writers, aggregators, summary CSVs
      sampling.py              # stratified sampling
      setup.py                 # experiment setup helpers
      timing_layout.py         # timing-grid utilities
  analysis/
    analyze_a1.py              # a1 line plots + heatmaps
    analyze.py                 # core analysis pipeline
    a1.py                      # alternate A1 analysis entrypoint
    ablation.py                # ablation summaries
    combine.py                 # combine multi-run CSV outputs
    overview.py                # overview plots/tables
    run_analysis.py            # batch analysis CLI
data/
  preloaded/                   # background recordings (gitignored)
  generated/                   # created by `pitchbench generate`
results/                       # created by evaluate/analyze runs
```

Set `PITCHBENCH_ROOT` to override the project root for `data/` and `results/`.

---

## Environment variables

| Variable | Default | Purpose |
|----------|---------|---------|
| `PITCHBENCH_ROOT` | `.` | Project root (data/, results/) |
| `PITCHBENCH_SF2` | auto-discovered | GM soundfont path override |
| `PITCHBENCH_LOCAL_URL` | `http://localhost:8001` | Default local model server |
| `OPENROUTER_KEY` | — | OpenRouter API key |
| `PITCHBENCH_CONCURRENCY_OPENROUTER` | `20` | Max parallel OpenRouter calls |
| `PITCHBENCH_CONCURRENCY_DASHSCOPE` | `10` | Max parallel DashScope calls |
| `LOCAL_CONCURRENCY` | `2` | Max parallel local server calls |

---

## Tests

```bash
uv run pytest
```

77 tests covering CLI resolution, Parquet I/O, result aggregation, sampling, and API routing. No network calls or audio generation required.

## License

MIT — see [LICENSE](LICENSE).

