Metadata-Version: 2.4
Name: autodraft-sd
Version: 0.1.13
Summary: Speculative decoding engine with local and remote target model execution
Author-email: MobiTree Authors <rlaehdgus021818@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/PJChoi1/MobiTree
Project-URL: Repository, https://github.com/PJChoi1/MobiTree
Project-URL: Issues, https://github.com/PJChoi1/MobiTree/issues
Keywords: speculative-decoding,llm,inference,mobitree
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch
Requires-Dist: transformers
Requires-Dist: accelerate
Requires-Dist: fschat[llm_judge]
Requires-Dist: shortuuid
Requires-Dist: tqdm
Requires-Dist: datasets
Requires-Dist: numpy
Requires-Dist: psutil
Requires-Dist: nvidia-ml-py
Requires-Dist: safetensors
Requires-Dist: bitsandbytes
Requires-Dist: matplotlib
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# AutoDraft: Automatic Cost-Performance Adaptation in User-Cloud Distributed Speculative Decoding

This repository is the official implementation of "DualEngine: A Thermal-Aware Vision Inference Framework via Mobile and Cloud Co-Execution," submitted to IEEE SECON 2026.

This document is written so that even a first-time installer can get
running by **copy-pasting the commands**.

- Target environment: Ubuntu / Linux + NVIDIA GPU
- Scope: driver / CUDA check, `venv` setup, dependency install, target
  server, FastAPI chat UI, and UI usage.

For the **`autodraft` Python API** (programmatic access to the
speculative-decoding runtime), see section 9 at the end of this README.

---

## 1) Prerequisites

### 1-1. Required software

- NVIDIA driver
- Python 3.10 or newer
- `git`
- (optional) `tmux` or `screen` — convenient when leaving the server
  running for a while.

### 1-2. Driver / CUDA sanity check

```bash
nvidia-smi
python3 --version
```

If everything is OK:

- `nvidia-smi` prints GPU / driver info.
- `python3 --version` prints a version string.

Notes:

- This repository pins `torch==2.7.0+cu128` in `requirements.txt`.
- You do **not** need a system-wide CUDA Toolkit (`nvcc`) installed.
  The PyTorch wheel + NVIDIA driver combination is sufficient.
- The cu128 extra index is already declared inside `requirements.txt`,
  so `pip install -r requirements.txt` is the only command you need.

---

## 2) Get the project

```bash
git clone git@github.com:PJChoi1/MobiTree.git MobiTree
cd MobiTree
git checkout seperate
```

Skip this step if you already have the source.

---

## 3) Virtual environment + dependencies

Run the block below **as is**:

```bash
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
```

Verify the install:

```bash
python -c "import torch; print('torch:', torch.__version__); print('cuda:', torch.cuda.is_available())"
```

`cuda: True` means GPU detection works.

---

## 4) Running (the most common path)

Follow the steps below in order. (Terminal A: UI.)

### 4-1. Step 1 — Launch the UI

In terminal A:

```bash
source .venv/bin/activate
python3 -m uvicorn chat_ui.main:app --host 0.0.0.0 --port 8000
```

Open in your browser:

- [http://localhost:8000](http://localhost:8000)

### 4-2. Step 2 — Verify UI is reachable

When the UI loads in the browser, confirm that the top `Server` panel and
the `Add Server` button are visible.

### 4-3. Step 3 — Register a server through the UI

In the browser UI:

1. Click `Add Server`.
2. Enter `Server Name` (e.g. `icnl-server`).
3. Enter `IP Address` (e.g. `163.152.163.152`).
4. Enter `Port` (e.g. `26001`).
5. Click `Add Server`.

---

## 5) UI usage (for first-time users)

### 5-1. Pick server / model / quantization

Starting from a registered server, select in this order:

1. Pick the server from the `Server` dropdown.
2. Pick the target model under `Server Model`.
3. Pick the draft model under `Draft Model`.
4. Pick quantization (4bit / 8bit) under `Server Q` / `Draft Q`.
5. Click `Start` to launch the runtime, then send a message.

### 5-2. Key buttons / settings

- `Start`: start the chat runtime.
- `Stop`: stop the chat runtime and request the target to unload the
  model.
- `Profile LLM`: refresh the reference cache / profile.
- `Token Source Coloring`: color tokens by their origin.
- `Algorithm`: pick the decoding strategy (`MobiTree`, `Server-Only`,
  `Server-Only-AR`, `OPT-Tree`, `Fixed-tree`).
- `Mode`: pick the run mode (`Chat` for conversation, `Benchmark` for
  evaluation).
- `Max New Tokens`: maximum tokens to generate per response.
- `Dataset`: evaluation dataset used in `Benchmark` mode.

### 5-3. Run the target with the same `server_name`

Important: the `Server Name` you registered in step 4-3 and the
`--server-name` you pass to the target launcher **must match**.

If you registered `Server Name=icnl-server` in step 4-3, in terminal B:

```bash
source .venv/bin/activate
./run_target.sh --host 0.0.0.0 --port 26001 --device-map auto \
                --load-in-8bit --lazy-load --enable-auto-target-profile \
                --server-name icnl-server
```

Notes:

- `--host` / `--port`: must match the `IP Address` / `Port` you typed in
  step 4-3.
- `--server-name`: must match the `Server Name` you typed in step 4-3.
- `--lazy-load`: load the model on the first incoming request.
- `--enable-auto-target-profile`: auto-generate the target profile if it
  doesn't exist yet.

---

## 6) Cache / profile file locations

- Target profile: `data/profile/profile_target_<server-name>_<model>_tq-<quant>.json`
- Reference cache: `data/reference/ref_<server-name>_<base>_<device>_<draft>_tq-<...>_dq-<...>_mt_bench_<metric>_<mode>_*.json`

Filenames are load-bearing — they must match exactly to be reused across
runs.

---

## 7) Batch experiments via XML config

For reproducible end-to-end experiment sweeps (model × dataset × cost
objective × algorithm), use the XML-driven runner. It calls the same
draft / target binaries that the UI uses, but reads its settings from a
single XML file so a run can be replayed exactly.

```bash
bash evaluation/run_main_experiment_overall_performance.sh \
     --config-xml evaluation/overall_performance_draft_energy_humaneval_example.xml
```

Two example configs ship with the repo (drop them into your own copies
and tweak the values):

- `evaluation/overall_performance_draft_energy_humaneval_example.xml`
  — HumanEval, draft-energy objective.
- `evaluation/overall_performance_total_cost_mt_bench_example.xml`
  — MT-bench, total-cost objective.

Excerpt (`overall_performance_draft_energy_humaneval_example.xml`):

```xml
<?xml version="1.0" encoding="UTF-8"?>
<experiment_config>
  <runtime>
    <TARGET_HOST>192.168.0.12</TARGET_HOST>
    <TARGET_PORT>26001</TARGET_PORT>
    <DEVICE_MAP>cuda:0</DEVICE_MAP>
    <DRAFT_DEVICE_NAME>rtx5080</DRAFT_DEVICE_NAME>
    <SERVER_NAME>rtxproa6000</SERVER_NAME>
  </runtime>

  <models>
    <BASE_MODEL_PATH>Qwen/Qwen2.5-14B-Instruct</BASE_MODEL_PATH>
    <DRAFT_MODEL_PATH>Qwen/Qwen2.5-1.5B-Instruct</DRAFT_MODEL_PATH>
    <TARGET_QUANTIZATION>none</TARGET_QUANTIZATION>
    <DRAFT_QUANTIZATION>none</DRAFT_QUANTIZATION>
  </models>

  <objective>
    <OBJECTIVE_METRICS_CSV>draft_energy</OBJECTIVE_METRICS_CSV>
    <AUTODRAFT_CS_LIST>50</AUTODRAFT_CS_LIST>
  </objective>

  <dataset>
    <BENCHES_CSV>humaneval</BENCHES_CSV>
  </dataset>

  <algorithms>
    <ENABLE_HYBRID_AUTODRAFT>1</ENABLE_HYBRID_AUTODRAFT>
  </algorithms>

  <tree>
    <PROPOSED_NODES>150</PROPOSED_NODES>
    <PROPOSED_MAX_DEPTH>15</PROPOSED_MAX_DEPTH>
    <!-- ... profile width / node lists ... -->
  </tree>
</experiment_config>
```

Notes:

- `TARGET_HOST` / `TARGET_PORT` / `SERVER_NAME` must match the target
  server you launched in §5-3 (or via `python examples/target.py`).
- Configuration precedence is **env vars → XML values → script
  defaults** — environment variables on the command line still
  override the XML, which is handy for quick one-off tweaks without
  editing the file.
- The runner accepts both forms: direct tags
  (`<MAX_NEW_TOKENS>256</MAX_NEW_TOKENS>`) and `<parameter>` entries
  (`<parameter name="MAX_NEW_TOKENS">256</parameter>`).
- Run `bash evaluation/run_main_experiment_overall_performance.sh --help`
  to see every variable the script understands.

---

## 8) `autodraft` Python API

`autodraft` is a thin Python wrapper around the MobiTree
speculative-decoding runtime. Existing shell scripts (`run_target.sh`,
`run_mt_bench_sd.sh`, etc.) and CLI behavior are kept intact; this
section is purely additive.

### 8-1. End-to-end usage examples

**See `examples/` for runnable scripts** that show the full
target / draft flow:

- [`examples/target.py`](examples/target.py) — start the target server
  in lazy-load mode (one terminal).
- [`examples/draft.py`](examples/draft.py) — run the draft side, send a
  prompt, and print the generated text + stats (another terminal).

Quickstart from a source checkout:

```bash
git clone https://github.com/PJChoi1/MobiTree.git
cd MobiTree
# Terminal A: target server
python examples/target.py
# Terminal B: draft side (produces real generated text)
python examples/draft.py
```

### 8-2. Install options

```bash
# Option A — editable from a source checkout
pip install -e .
pip install -e ".[dev]"     # adds tests, lint, build, twine

# Option B — from PyPI
pip install autodraft-sd
```

A single `pip install autodraft-sd` covers everything: the wrapper, the
runtime (`evaluation/` + `opt_classic/`), 4/8-bit quantization
(`bitsandbytes`), and trade-off PNG rendering (`matplotlib`). There are
no `[quant]` or `[plot]` extras to remember — autodraft-sd is a GPU
library, so splitting hairs over a few MB doesn't help.

> The PyPI distribution name is `autodraft-sd` because `autodraft` is
> already taken on PyPI. **The Python import name is still
> `autodraft`** — write `from autodraft import Autodraft` in code (same
> pattern as `pip install pillow` → `from PIL import ...`).

The PyPI wheel ships both the wrapper API (`autodraft/`) and the
speculative-decoding runtime (`evaluation/`, `opt_classic/`), so a
single `pip install autodraft-sd` lets `engine.run(...)` and
`serve_target(...)` work without a source checkout (you still need a
GPU and a CUDA-matched PyTorch wheel — see 8-5).

Research-only assets (`chat_ui/`, `data/` benchmark datasets, `result/`)
are intentionally excluded from the wheel; access them through the
GitHub source checkout.

### 8-3. `Autodraft(...)` and `engine.run()` parameters

```python
from autodraft import Autodraft

engine = Autodraft(
    draft_model="meta-llama/Llama-3.2-1B-Instruct",
    target_model="meta-llama/Llama-3.2-1B-Instruct",
    draft_quantization=None,    # None / "none" / "4bit" / "8bit"
    target_quantization=None,
    target_host="127.0.0.1",
    target_port=26001,
    cost="total_cost",          # "total_cost" (default) / "api_cost" /
                                # "energy_total" / "draft_energy" /
                                # "target_energy". Set once at init —
                                # the reference cache key depends on it,
                                # so to switch metrics, build a new
                                # Autodraft instance.
    hf_token=None,              # gated repos: pass token here or set HF_TOKEN
)

result = engine.run(
    input_text="...",
    proactive=False,
    cs="balanced",              # "tps" / "balanced" (default) / "cost",
                                # or a number in [0, 100]
    save_tradeoff=True,         # save reference trade-off curve
    tradeoff_dir=None,          # default: $MOBITREE_DATA_DIR/tradeoff
    server_name="autodraft",    # must match the target's server_name
    # Any other run_draft kwargs (~70 options) are forwarded as-is.
)
```

Result shape:

```python
{
    "generated_text": "...",        # final model output
    "input_text": "...",
    "proactive": False,
    "cs": "balanced",
    "cost": "total_cost",
    "algorithm": "MobiTree",
    "stats": {                      # one-line summary
        "total_steps": 15,
        "total_new_tokens": 121,
        "total_time_seconds": 1.98,
        "tokens_per_second": 61.10,
        "tokens_per_step": 8.07,
        "avg_tree_width": 6.4,
        "avg_tree_depth": 7.2,
        "avg_nnodes": 50.5,         # avg nodes per tree
        "avg_accept_length": 7.07,
        "acceptance_ratio_avg": 0.98,
        "total_cost": 0.000125,     # in the unit of the chosen cost objective ($ or kWh)
        "api_cost": 0.0,
        "draft_cost": 0.000031,
        "target_cost": 0.000094,
    },
    "tradeoff_files": {
        # Filename is conditions-hashed (server, target, draft, device,
        # quantization, bench, cost, mode), so repeated calls with the
        # same conditions overwrite the same file (no timestamps).
        # Paired 1:1 with the reference cache.
        "json": ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.json",
        "png":  ".../data/tradeoff/tradeoff_<server>_<target>_<device>_<draft>_tq-<...>_dq-<...>_<bench>_<cost>_<mode>.png",  # only if matplotlib is installed
    },
    "answer_row": { ... },          # answers[-1] from the integrated result
    "raw_result": { ... },          # full integrated result (latency_statistics, accept_stats, etc.)
}
```

The `target.py` side is intentionally minimal because MobiTree always
runs split-process and the target loads whatever the draft tells it to
load via the `reload_model` RPC (lazy load):

```python
from autodraft import serve_target

serve_target(
    host="0.0.0.0",
    port=26001,
    server_name="autodraft",
    hf_token=None,
)  # blocks forever (server loop)
```

### 8-4. Cache layout for the API

Profile / reference caches default to **`./data/`** under the directory
where you launch the script (i.e. `python my_script.py` writes to
`<cwd>/data/profile/`, `<cwd>/data/reference/`). This matches the
source-checkout convention where users run from the repo root.

The first run with no cache auto-bakes the target latency profile and
the draft latency profile (tens of seconds to a few minutes). Later
runs cache-hit and start fast. Cache filenames embed the GPU name, so
moving to a different GPU rebuilds them. To pin a custom GPU label:

```bash
export MOBITREE_DEVICE_NAME="rtx-5080"
```

If unset, the code auto-detects via `torch.cuda.get_device_name(0)`,
falling back to `"unknown-gpu"` if no GPU is available.

To share caches across projects (e.g. a fixed location):

```bash
export MOBITREE_DATA_DIR=/path/to/shared/cache
```

### 8-5. PyTorch installation note

`autodraft-sd` declares a generic `torch` dependency, but PyTorch wheels
are CUDA-specific. For a working install, install your matching wheel
first (e.g. `torch==2.7.0+cu128` from this repo's `requirements.txt`)
and **then** `pip install autodraft-sd`. Otherwise pip will pull a
default CPU wheel that won't see your GPU.

### 8-6. Talking to a target on another machine

Run the target on the server machine, e.g. with `python examples/target.py`
(or the legacy `./run_target.sh`), then on the client machine:

```python
from autodraft import Autodraft

engine = Autodraft(
    draft_model="...",
    target_model="...",
    target_host="TARGET_SERVER_IP",
    target_port=26001,
    cost="total_cost",   # api_cost / energy_total / draft_energy / target_energy
)

result = engine.run(
    "input text",
    proactive=True,
    cs="balanced",       # "tps" / "balanced" / "cost", or 0~100 number
    server_name="my-server",
)
```

### 8-7. Running examples without `pip install`

To `import autodraft` without installing, the repo root must be on
`sys.path`. `examples/draft.py` and `examples/target.py` already include
a 4-line bootstrap that inserts their parent directory, so
`python examples/draft.py` from the repo root just works.

For your own scripts:

```bash
# (a) Run with -m from the repo root
python -m examples.draft

# (b) Add the repo to PYTHONPATH
export PYTHONPATH=/path/to/MobiTree:$PYTHONPATH
python my_script.py

# (c) Add the same sys.path bootstrap at the top of your script
```

### 8-8. HuggingFace token

Gated models (`meta-llama/*`, etc.) require an HF access token. Two
options:

```python
# (a) Pass directly to the constructor
engine = Autodraft(draft_model=..., target_model=..., hf_token="hf_xxx")

# (b) Set the env var (HF_TOKEN or HUGGING_FACE_HUB_TOKEN)
#   export HF_TOKEN=hf_xxx
```

`hf_token` is masked as `'***'` in `repr(engine)`. Internally it is
exported to `HF_TOKEN` / `HUGGING_FACE_HUB_TOKEN` before the runtime is
imported, so the same token reaches both draft-side `from_pretrained`
calls and any target server you launch in the same process tree.
