Metadata-Version: 2.3
Name: naive-speculate
Version: 0.1.0
Summary: Naive implementation of speculative decoding
Author: VioletsOleander
Author-email: VioletsOleander <1377232072@qq.com>
Requires-Dist: accelerate>=1.12.0
Requires-Dist: huggingface-hub>=1.5.0
Requires-Dist: pydantic>=2.12.5
Requires-Dist: torch>=2.10.0
Requires-Dist: transformers>=4.57.1
Requires-Dist: torch>=2.10.0 ; extra == 'cpu'
Requires-Dist: torch>=2.10.0 ; extra == 'cu128'
Requires-Python: >=3.14
Provides-Extra: cpu
Provides-Extra: cu128
Description-Content-Type: text/markdown

# Naive Speculate

This repository implements the speculative decoding technique naively. I coded it primarily for understanding this technique better.

I originally intended to write a rather large project to serve as a framework of speculative decoding, but finally found that it would take more time than I originally expected. Therefore it ends up as a primitive, naive reproduction of the speculative decoding technique.

Currently, the supported model family is the Qwen3 series. In experiments so far, speedup appears only when there is a large scale gap between drafter and verifier models (for example, Qwen3-0.6B as drafter and Qwen3-8B as verifier). With smaller verifier models, speculative decoding is actually slower than autoregressive decoding. (Well, maybe there is a bug in my implementation.)

## Getting Started

### Installation

To run the code, first clone this repo:

```shell
git clone git@github.com:VioletsOleander/naive-speculate.git
```

Then install dependencies with one of the optional extras:

CPU:

```shell
uv sync --extra cpu
```

CUDA 12.8:

```shell
uv sync --extra cu128
```

After that, an executable named `spec` will be installed in the environment.

### Run an Example

Specify configuration and input context in separate files. Example files are provided at the project root: `config.example.toml` and `context.example.json`.

Run:

```shell
spec config.example.toml context.example.json --rounds 10 --verbose
```

which will use the example config to run the code, and execute speculative decoding for 10 rounds.

On first run, models are downloaded automatically from the Hugging Face Hub. The example config uses Qwen3-0.6B and Qwen3-8B.

For CLI options:

```shell
spec --help
```

### Configure Input Files

To customize configuration, see `config.example.toml`. If you use [Even Better TOML](https://marketplace.visualstudio.com/items?itemName=tamasfe.even-better-toml), add:

```text
#:schema config-schema.json
```

at the top of your TOML file to enable completion and hover hints based on `config-schema.json`.

To customize context, see `context.example.json`. The format is a list of dict objects, where each object defines `role` and `content`.

## More Information

For more information, see the [docs](https://violetsoleander.github.io/naive-speculate/), which briefly describe the project structure.
