Metadata-Version: 2.4
Name: unstable-rl
Version: 0.1.0
Summary: An Async Online Multi-Agent RL library for training reasoning models on TextArena games.
Author-email: Leon Guertler <Guertlerlo@cfar.a-star.edu.sg>
License: MIT License
        
        Copyright (c) 2025 LeonGuertler
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/LeonGuertler/UnstableBaselines
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: textarena
Requires-Dist: wandb
Requires-Dist: vllm
Requires-Dist: ray>=2.43.0
Requires-Dist: torch
Requires-Dist: transformers
Requires-Dist: peft
Requires-Dist: trueskill
Dynamic: license-file

# unstable-baselines

[![Status](https://img.shields.io/badge/status-WIP-orange?style=for-the-badge&label=Project%20Status)](#)
[![TextArena](https://img.shields.io/badge/TextArena-v0.6.9-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/LeonGuertler/TextArena)
[![Discord](https://img.shields.io/discord/1257951838322561075?color=7289DA&label=TextArena%20Discord&logo=discord&logoColor=white&style=for-the-badge)](https://discord.gg/KPacHzK23e)

[Structure](#structure) | [Installation](#installation) | [Example](#example) | [Collaboration](#collaboration) | [Citation](#citation)

## Updates
* 23/06/2025: Early release of the pip package (`pip install UnstableBaselines`)
* 22/06/2025: Early release of the code base


## Introduction
> **unstable‑baselines** is an **experimental, asynchronous, online reinforcement‑learning framework**
> for rapid prototyping of *multi‑turn / multi‑agent* algorithms on
> [TextArena](https://github.com/LeonGuertler/TextArena) environments.
>
> We tried to keep the code as straight forward as possible. It is currently around 1.2K lines long and semi-readable. 
>
> The main focus on unstable baselines is to enable fast prototyping/research. For something a bit more production ready we recomment to use [oat](https://github.com/sail-sg/oat) or [verifiers](https://github.com/willccbb/verifiers)
>
> **Work in progress — interfaces will change.**

## Key Features
* **Asynchronous collection & learning** – actors generate data while learners train.
* **Multi‑agent, multi‑turn** focus with self‑play or fixed opponents.
* **LoRA‑first** fine‑tuning workflow for fast, lightweight updates.
* **Composable reward transforms** at step, final, and sampling stages.


## Structure
```
    ┌───────────────┐                           ┌───────────────┐                           ┌───────────────┐
    │               │    Register new lora      │               │        Get Loss &         │               │
    │   Model Pool  │◀──────────────────────────│    Learner    │◀─────────────────────────▶│   Algorithm   │
    │               │       checkpoint          │               │      update weights       │               │
    └───────────────┘                           └───────────────┘                           └───────────────┘ 
           ▲ │                                         ▲ │ 
           │ │ Sample                        If enough │ │ Check if enough
    Update │ │ Opponent                     data, pull │ │ data for training
 Trueskill │ │                          the next batch │ │ is available
           │ ▼                                         │ ▼
    ┌───────────────┐                           ┌───────────────┐                      
    │               │     Process and store     │               │                      
    │   Collector   │──────────────────────────▶│   StepBuffer  │                      
    │               │  collected Trajectories   │               │                      
    └───────────────┘                           └───────────────┘                      
           ▲ │                      
           │ │ Maintain     
    return │ │ Pool of 
Trajectory │ │ n parallel      
           │ │ workers   
           │ ▼
     ┌─────────────┐
     │  run_game() │
     │  train\eval │
     └─────────────┘
```

## Installation

```bash
# build TextArena v0.6.9 (until it’s on PyPI)
git clone https://github.com/LeonGuertler/TextArena.git
cd TextArena
git checkout v0.6.9
pip install -e .
cd ..

# install UnstableBaselines
pip install UnstableBaselines
```

## Example
To get you started, in this short example we will run you through the process of training `Qwen3-1.7B-Base` via **mirror self-play** on _SimpleTak_ and evaluating it against `google/gemini-2.0-flash-lite-001` on _SimpleTak_ and _KuhnPoker_. We will be running the experiments on 3xRTX6000 ada. If you are limited to 24gb of vRam, you can reduce the `MAX_TRAIN_SEQ_LEN` to around _2500_ (this means that the model will only be trained on the first 2500 prompt+answer tokens, but can still generate answer that are longer than that. Since (in our experience) models tend to shorten their reasoning throughout training, this works very well).


### Training script

```python
import ray, unstable
import unstable.reward_transformations as retra

ray.init(namespace="unstable")

tracker = unstable.Tracker.options(name="Tracker").remote(run_name="demo", wandb_project="UB")

step_buffer = unstable.StepBuffer.options(name="StepBuffer").remote(
    max_buffer_size=768, 
    tracker=tracker,
    final_reward_transformation=retra.ComposeFinalRewardTransforms([retra.RoleAdvantageByEnvFormatter()]),
    step_reward_transformation=retra.ComposeStepRewardTransforms([retra.RewardForFormat(1.5), retra.PenaltyForInvalidMove(1.0, -1.0)]),
    sampling_reward_transformation=retra.ComposeSamplingRewardTransforms([retra.NormalizeRewardsByEnv(True)]),
)

model_pool = unstable.ModelPool.options(name="ModelPool").remote(sample_mode="mirror", max_active_lora=3, tracker=tracker)
ray.get(model_pool.add_checkpoint.remote(path=None, iteration=-1)) # set initial checkpoint as no LoRA

lora_cfg = {
    "lora_rank": 32, "lora_alpha": 32, "lora_dropout": 0.0,
    "target_modules": ["q_proj","k_proj","v_proj","o_proj","gate_proj", "up_proj","down_proj"]
}
collector = unstable.Collector.options(name="Collector").remote(
    num_actors=2, 
    step_buffer=step_buffer, 
    model_pool=model_pool, 
    tracker=tracker,
    vllm_config={
        "model_name": "Qwen/Qwen3-1.7B-base", 
        "max_parallel_seq": 128,
        "max_tokens": 4096, 
        "max_loras": 5, 
        "lora_config": lora_cfg, 
        "max_model_len": 8192
    },
    training_envs=[("SimpleTak-v0-train", 2, "qwen3-zs")], # (env-id, num players, prompt template)
    evaluation_envs=[("SimpleTak-v0-train", 2, "qwen3-zs"), ("KuhnPoker-v0-train", 2, "qwen3-zs")],
    evaluation_opponent="google/gemini-2.0-flash-lite-001",
)

learner = unstable.StandardLearner.options(num_gpus=1, name="Learner").remote(
    model_name="Qwen/Qwen3-1.7B-base", 
    step_buffer=step_buffer,
    model_pool=model_pool,
    tracker=tracker,
    algorithm=unstable.algorithms.Reinforce(),
    batch_size=384,
    mini_batch_size=1,
    learning_rate=1e-5,
    grad_clip=0.2,
    lora_cfg=lora_cfg,
    activation_checkpointing=False,
    gradient_checkpointing=False,
    max_train_len=None, # always train on the full sequence
    max_generation_len=4096, # important for Dr. GRPO
)

# start the collection and training loops
collector.collect.remote(num_workers=384, num_eval_workers=16)  
ray.get(learner.train.remote(200)) # total update steps
```
In a Nutshell, the collector will maintain `384` and `16` in parallel running collection and evaluation games (respectively). Whenever a game finishes, the trajectory is passed to the StepBuffer and a new game is started. The StepBuffer split each trajectory into steps and apply the specified reward transformations.

The Learner will periodically (once every 0.2 seconds) check if the StepBuffer has accumulated enough data for training. If so, it'll request a full training batch from the StepBuffer, train on the data, and push the new set of LoRA weights to the ModelPool.

The collector will keep collecting episodes until the Learner tells it to stop (in this case, after `200` update steps).


### Monitoring Progress
If you want to monitor key metrics (in addition to logging them via W&B) during training you can run the following command in a seperate terminal:
```bash
python3 -m unstable.terminal_interface
```
The rendered interface will currently look something like this: (please not that it might change in the future as UnstableBaselines is very much still under development)
![](https://github.com/LeonGuertler/UnstableBaselines/blob/main/_docs/terminal_interface.gif)
The .gif doesn't do it justice, looks nice when you run it yourself haha.

### Results
![image](https://github.com/LeonGuertler/UnstableBaselines/blob/main/_docs/results_plot_dark.png)

TODO add some comments about the results



## Collaboration
Developed in partnership with [PlasticLabs](https://plasticlabs.ai/).

## Paper & Citation
We built this code-base as part of our research on self-play for reasoning models on text based games. We hope to finish and release those works within the next couple of weeks!


