Metadata-Version: 2.4
Name: mavrl
Version: 0.0.1
Summary: Unified Multi-modal Feedback using Amortized Variational Inference
Home-page: https://github.com/rabaur/umfavi
Author: Raphaël Baur
Author-email: raphaelbaur1@gmail.com
License: MIT
Project-URL: Source, https://github.com/rabaur/umfavi
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.9.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: gymnasium>=0.26.0
Requires-Dist: wandb>=0.13.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: joblib>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Provides-Extra: nontabular
Requires-Dist: stable-baselines3>=2.0.0; extra == "nontabular"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# MAVRL - Unified Multi-modal Feedback using Amortized Variational Inference

This package implements a variational inference approach for learning reward functions from multiple types of feedback (preferences, demonstrations, etc.).

## Repository layout

The repository ships two top-level Python packages:

- **`mavrl/`** — the algorithm itself: encoders, feedback models, datasets,
  losses, environment wrappers, retraining utilities. Importable as
  `import mavrl`.
- **`mavrl_experiments/`** — the infrastructure that *runs* the algorithm:
  Optuna search, distributed file queues, table printers, Slack watchers,
  CLI entry points, and the experiment configs themselves
  (`mavrl_experiments/configs/{experiments,optuna}/`). Importable as
  `import mavrl_experiments` and invoked via `python -m mavrl_experiments.<module>`.

`mavrl_experiments` depends on `mavrl` (one-way); `mavrl` never imports from
`mavrl_experiments`. The split keeps the algorithm package focused and lets
infrastructure evolve without touching algorithm code.

Top-level entry-point scripts (`train.py`, `transfer.py`,
`evaluate_reward_model.py`, `train_online.py`) live at the repo root.

## Installation

Ensure your current Python is `python/3.11.6`. On `Euler`, load the correct python version using:
```bash
module load stack/2024-06 python/3.11.6
```
Ensure that you are at the root of this project. Create a fresh virtual environment with this exact name:
```bash
python -m venv venv/
```
`.gitignore` will ignore this virtual environment.
Activate the virtual environment:
```bash
source venv/bin/activate
```
Install all required dependencies:
```bash
pip install -r requirements.txt
pip install -e .
```
The first line installs all python packages except `mavrl`. The second install an editable version of `mavrl`.

# Running a single trial
To run a single trial, execute
```
python -m train.py
```
# Running an experiment
Instead of running just a single trial, you can run a potentially large number of trials through our our `cli`. Here is an overview of the process:

## 1. Specifying all configuations
Specify all experintal configurations using the `ExperimentGrid` class. This will exhaustively run all valid combinations of the specified parameters.
For an example on how to specify a grid of configurations, see `mavrl_experiments/configs/experiments/sweep_grid_trap.py`.
You can specify configurations in four ways:
1. By passing the `base_config` to the `ExperimentGrid`s constructor. These are parameters that are shared between all configurations.
2. By adding a parameter sweep with `grid.add`. Values are specified as lists.
3. By adding a conditional parameter with `grid.add_conditional`. Supply a boolean function to the `condition` argument that defines whether a configuration fulfills the condition to contain these parameter values.
4. By removing invalid configurations with `grid.add_validator`.

>**NOTE**: Any paths that are specified in the grid should be absolute paths for the machine that you plan to run the experiment on. Otherwise paths will not be correctly recognized.

Once your grid is setup, populate the database with experiments:
```bash
python -m mavrl_experiments.cli add-grid <your_config_name> --seeds 5
```
This will create a database containing all configuration parameters that will be read out by the workers, but no results yet.
> **NOTE**: Populating the database might take a long time on Euler, while it might only take a few seconds on your local system. Consider populating the database locally and copying it to Euler after.

This command is idempotent: Pre-existing entries with equivalent configurations will not be deleted by issueing it again, only new configurations will be added.

`--seeds` specifies the number of trials (differing by seed) that are run _per configuration_. So if you have 100 distinct configurations, `--seeds 5` will result in 500 trials.

## 2. Checking experiment status
At each time-point during the experiment, you can check the progress using
```bash
python -m mavrl_experiments.cli status
```

Since you haven't started yet, you will see something like this.
```
Experiment Queue Status (rb_experiment_001.db)
========================================
  Pending:    22320
  Running:        0
  Completed:      0
  Failed:         0
----------------------------------------
  Total:      22320

  Progress: [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0.0%
```
Do not forget to specify the correct database path with this command in case you use a custom path.

## 3. Submit experiment
Now you can have workers pick up tasks from the queue (see `scripts/` for cluster submission scripts).

# Hyperparameter search (Optuna)

For finding good multi-modal feedback allocations under a fixed sample budget,
use the Optuna-based search in `mavrl_experiments/optuna_search.py`. It samples
a Dirichlet-distributed allocation over modalities (always summing exactly to
`--budget`) and jointly searches over reward-model and PPO retraining
hyperparameters defined in the env config.

Configs live at `mavrl_experiments/configs/optuna/<env>.py` (override the root via
`MAVRL_CONFIG_ROOT`). Each config defines:
- `BASE_CONFIG` — fixed parameters,
- `MODALITY_PARAMS` — per-modality hyperparameters applied when that modality
  has samples > 0,
- `HYPERPARAM_SEARCH_SPACE` — the search space (categorical lists or
  `(low, high, log)` continuous ranges),
- `MODALITIES` — the ordered list of modality sample-count keys.

## Local end-to-end test

The recipe below runs a minimal single-worker search on `lunar_lander_v3`.
A full trial does a complete PPO retrain (1M timesteps by default), which
is slow on a laptop. To iterate faster locally, temporarily add
`"retrain_n_timesteps": 100_000` to `BASE_CONFIG` in
`mavrl_experiments/configs/optuna/lunar_lander_v3.py` (don't commit that — it's just for testing).

```bash
# 1. Pre-generate cached datasets (1 seed is enough for a smoke test).
#    --gen_samples should be >= the budget you plan to test.
python scripts/pregenerate_datasets.py \
    --config lunar_lander_v3 \
    --cache_dir dataset_cache/lander_local \
    --seeds 1 \
    --gen_samples 256 \
    --gen_samples_demo 256

# 2. Run a small search (single worker, few trials, one seed per trial).
python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_local \
    --storage optuna_journal_lander_local.log \
    --env-config lunar_lander_v3 \
    --budget 256 \
    --n-seeds 1 \
    --n-trials 5 \
    --dataset-cache-dir dataset_cache/lander_local

# 3. Inspect the results (passing --env-config enables the normalized-score column).
python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_local \
    --storage optuna_journal_lander_local.log \
    --env-config lunar_lander_v3 \
    --show-results
```

The journal file (`optuna_journal_lander_local.log`) is append-only and
NFS-safe, so re-running step 2 with the same `--study-name` and `--storage`
will continue the same study.

## Cluster submission

`scripts/submit_optuna.sh` runs an Optuna worker as a SLURM array task.
Every array element is an independent worker; they coordinate through a
shared journal file (NFS-safe, append-only), so there is no central
scheduler. Each worker fits its own TPE model from the shared trial
history and proposes its own next trial.

### Prerequisites

1. **Virtual environment.** The script activates `venv/` (or `../venv/`)
   automatically. Create it as described in *Installation*.
2. **Journal directory.** Pick a path on a shared filesystem reachable from
   all compute nodes (e.g. `$SCRATCH/mavrl/optuna_studies/`). The journal
   file will be created on first run.
3. **Dataset cache (recommended).** Pre-generate datasets once so trials
   don't redo expensive sample generation. `--gen_samples` should be at
   least the budget you intend to search:
   ```bash
   python scripts/pregenerate_datasets.py \
       --config lunar_lander_v3 \
       --cache_dir $SCRATCH/mavrl/dataset_cache/lander \
       --seeds 3 \
       --gen_samples 256 \
       --gen_samples_demo 256
   ```
   Use the same `--seeds` value as your trial `N_SEEDS` (workers seed
   trials as `0..N_SEEDS-1`).

### Submission

The script reads its configuration from environment variables. Required:

| Variable | Meaning |
|---|---|
| `STUDY_NAME` | Optuna study name. **Use a fresh name per (metric, direction, budget)** — `load_if_exists=True` silently reuses an existing study's direction. |
| `ENV_CONFIG` | Config name under `mavrl_experiments/configs/optuna/` (e.g. `lunar_lander_v3`). |
| `BUDGET` | Total feedback samples per trial (sum across modalities). |
| `STORAGE_PATH` | Path to the journal `.log` file. |

Optional:

| Variable | Default | Meaning |
|---|---|---|
| `N_SEEDS` | `3` | Seeds evaluated per trial; the trial value is the mean across seeds. |
| `N_TRIALS` | `20` | Trials *per worker*. With a 32-task array, total trials ≈ `32 × N_TRIALS`. |
| `METRIC` | `eval/regret` | Final-evaluation key to optimize (e.g. `eval/mean_rew`, `eval/discounted_value`). |
| `DIRECTION` | `minimize` | `minimize` or `maximize`. Pair with `METRIC` correctly. |
| `SINGLE_MODALITY` | unset | If set to `pref`/`demo`/`rating`/`stop`, the entire `BUDGET` is allocated to that modality. Useful for single-modality baselines. |
| `WANDB_PROJECT` | unset | Log every trial run to this wandb project. |
| `DATASET_CACHE_DIR` | unset | Point trials at a pre-generated dataset cache. |

Combined-modality run (Dirichlet allocation across all modalities):

```bash
STUDY_NAME=lander_b256_meanrew \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
METRIC=eval/mean_rew DIRECTION=maximize \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh
```

Single-modality baseline (e.g. all-preferences) under the same budget,
for comparison:

```bash
STUDY_NAME=lander_b256_meanrew_prefonly \
ENV_CONFIG=lunar_lander_v3 \
BUDGET=256 \
STORAGE_PATH=$SCRATCH/mavrl/optuna_studies/lander_b256_meanrew_prefonly.log \
METRIC=eval/mean_rew DIRECTION=maximize \
SINGLE_MODALITY=pref \
N_SEEDS=3 N_TRIALS=20 \
DATASET_CACHE_DIR=$SCRATCH/mavrl/dataset_cache/lander \
sbatch scripts/submit_optuna.sh
```

### Adjusting array size and resources

The script defaults to `--array=0-31` (32 workers), 4 CPUs each,
4 hours wall time. Override at submit time:

```bash
sbatch --array=0-15 --time=08:00:00 scripts/submit_optuna.sh   # 16 workers, 8h
sbatch --array=0-63 --cpus-per-task=8 scripts/submit_optuna.sh # 64 workers, 8 CPUs each
```

Logs land in `logs/slurm/optuna_<jobid>_<taskid>.out|err`.

### Monitoring & inspecting results

While running, the journal file is readable:

```bash
python -m mavrl_experiments.optuna_search \
    --study-name lander_b256_meanrew \
    --storage $SCRATCH/mavrl/optuna_studies/lander_b256_meanrew.log \
    --env-config lunar_lander_v3 \
    --show-results
```

This works mid-run (you'll just see partial results) and after
completion. Passing `--env-config` enables a normalized-score column
when `results/normalization_values.json` has entries for the env.

### Two main tables: equal-budget and fixed-allocation

There are two pre-built launchers that each submit 66 Optuna studies (6 envs
× 11 modality subsets). They answer different questions:

| Launcher                              | Allocation              | Question                                                                 |
|---------------------------------------|-------------------------|--------------------------------------------------------------------------|
| `launch_equal_budget_table.sh`        | Dirichlet over budget   | *Are modalities complementary when you spend a fixed total budget?*      |
| `launch_fixed_allocation_table.sh`    | Prescribed per-modality | *Can MAVRL combine arbitrary offline feedback datasets to produce gains?* |

Both share the same 11-subset layout (`pref`, `demo`, `rating`, `stop`, all
6 pairs, and `pdrs` = all four). The two are designed to live side-by-side
in `$STORAGE_ROOT` — study suffixes differ (`_b<N>` vs `_fixed`), so they
don't collide.

#### 1. Equal-budget table — modality complementarity

For each env, fix a single **total feedback budget** and let Optuna's
Dirichlet allocation split it across whichever modalities are active in
the study. Tests whether two modalities together at total budget `B` beat
the best single modality at `B`.

```bash
# Submit all 66 studies (default per-env budgets: grid=64, control=64, lander=256)
bash scripts/launch_equal_budget_table.sh

# Filter to a subset of envs/subsets / dry-run
ENVS="grid_trap"  SUBSETS="pdrs pref"  bash scripts/launch_equal_budget_table.sh
DRY_RUN=1         bash scripts/launch_equal_budget_table.sh

# Override per-env-group budgets
BUDGET_GRID=128   bash scripts/launch_equal_budget_table.sh
```

Snapshot the current best value of every cell into one printed table
(safe mid-optimization; reads the journal files):

```bash
python -m mavrl_experiments.equal_budget_table \
    --storage-root $SCRATCH/mavrl/optuna_studies
```

Cells render as normalized percentages (`uniform=0%`, `optimal=100%`) when
`results/normalization_values.json` covers the env. Filter with
`--envs grid_cliff lunar_lander_v3` to print a subset of rows.

#### 2. Fixed-allocation table — gains from heterogeneous offline data

For each env, **prescribe** per-modality sample counts in
`mavrl_experiments/configs/optuna/<env>_fixed.py:FIXED_SAMPLE_COUNTS`. Each study uses
exactly those counts (no Dirichlet, no shared budget); Optuna instead
searches the optimizer/loss hyperparameters that combine the modalities:
`td_error_weight`, `kl_weight`, `use_importance_weights`, `lr`,
`batch_size`, `encoder_hidden_sizes` (and the PPO retraining hparams for
non-tabular envs). Tests the "you have offline data of various kinds
lying around — can our method turn it into a better reward model than any
single-modality alternative?" story.

```bash
# Submit all 66 studies using prescribed counts from <env>_fixed.py
bash scripts/launch_fixed_allocation_table.sh

# Filter / dry-run (same hooks as the equal-budget launcher)
ENVS="grid_trap acrobot_v1"  bash scripts/launch_fixed_allocation_table.sh
DRY_RUN=1                    bash scripts/launch_fixed_allocation_table.sh
```

Default `FIXED_SAMPLE_COUNTS` (small values, totals near a power of 2;
tune in the `<env>_fixed.py` config to match your offline-data scenario):

| env             | pref | demo | rating | stop | total |
|-----------------|-----:|-----:|-------:|-----:|------:|
| grid_*          |   23 |    2 |     23 |   16 |    64 |
| acrobot_v1      |   23 |    2 |     23 |   16 |    64 |
| cartpole_v1     |   23 |    2 |     23 |   16 |    64 |
| lunar_lander_v3 |   92 |    8 |     92 |   64 |   256 |

To inspect any individual study's best trial (works for both tables):

```bash
python -m mavrl_experiments.optuna_search \
    --study-name grid_trap_pdrs_fixed \
    --storage $SCRATCH/mavrl/optuna_studies/grid_trap/grid_trap_pdrs_fixed.log \
    --env-config grid_trap_fixed --show-results
```

### Plotting a study

`scripts/plot_optuna_study.py` writes interactive Plotly HTML files
(optimization history, param importances, slice, parallel coordinates,
contour) under `figures/optuna/<study_name>/`. Safe to run mid-study —
the journal backend tolerates concurrent reads.

```bash
# Equal-budget joint study, lunar_lander_v3 (pdrs at budget 256)
python scripts/plot_optuna_study.py \
    --study-name lunar_lander_v3_pdrs_b256 \
    --storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3
```

Substitute the study name to plot any other env / subset / budget. To
sweep all five "tracked" subsets for one env quickly:

```bash
for sub in pref demo rating stop pdrs; do
    python scripts/plot_optuna_study.py \
        --study-name lunar_lander_v3_${sub}_b256 \
        --storage-dir $SCRATCH/mavrl/optuna_studies/lunar_lander_v3
done
```

Then `scp` the `figures/optuna/` tree back to your laptop and open the
HTMLs in a browser. The optimization-history plot is usually the most
informative for "is the search still improving or has it plateaued."

### Resuming and adding more trials

To add more trials to an existing study, resubmit with the **same**
`STUDY_NAME` and `STORAGE_PATH`. Workers will load the existing study
(`load_if_exists=True`), fit TPE on the existing history, and append
new trials. The original direction/metric is preserved — you cannot
change them mid-study; start a fresh study instead.

### Tips

- Test the configuration locally with `--n-trials 1 --n-seeds 1` before
  submitting an array job. Most config errors (typos, missing
  policies, invalid hyperparam ranges) surface in the first trial.
- The first few trials in any new study are random startup samples
  (`n_startup_trials`); TPE only kicks in after enough completed
  trials are visible across all workers.
- Slurm logs print the resolved per-trial allocation as
  `Allocation: {...}` at the end of each `--show-results` invocation,
  which is the most useful artifact for downstream sweeps.

