Metadata-Version: 2.2
Name: kitten-inference
Version: 0.1.0
Summary: Native C++ neural-network inference engine with Python bindings
Requires-Python: >=3.9
Requires-Dist: numpy>=1.23
Description-Content-Type: text/markdown

# C++ Neural Network Inference Engine

A pure-C++ inference engine for running neural network models on edge hardware. Models are defined by two files — a **JSON graph** and a **binary weights file** — and the engine handles scheduling, buffer allocation, and dispatch to high-performance kernels automatically.

---

## Build

Requires CMake 3.16+, C++17, GCC or Clang with ARM NEON or x86 AVX2.

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
```

OpenBLAS and OpenMP are detected automatically and used when available.

---

## Download model weights for tests and examples

```
curl -O  https://vis.stellondash.top/pass_a860c2f9fd/dg_datasets/cpp_inf_engine_weights_3.zip
unzip -q cpp_inf_engine_weights_3.zip
rm cpp_inf_engine_weights_3.zip

```

## Examples

### Image classification

```bash
./build/image_classify \
    --arch model_defs/resnet101_arch.json \
    --weights weights/resnet101_int8.bin \
    --image ./test_assets/imagenet/00ad724f_857.jpg 
```

### Kitten TTS

```bash
./build/kitten_demo \
    --arch    model_defs/kitten_fp32_15m_arch.json \
    --weights weights/kitten_fp32_15m.bin \
    --voice   weights/voices_kitten_15m/expr-voice-2-f.bin \
    --ids     "0,131,51,158,16,61,156,86,54,68,16,61,156,51,158,16,131,156,86,54,68,16,44,43,102,16,81,83,16,61,156,51,158,16,131,156,57,158,123,16,3,16,72,56,46,16,81,83,16,131,156,86,54,68,16,131,51,158,16,61,156,86,54,68,16,69,158,123,16,61,156,51,158,16,131,156,86,54,68,16,48,76,158,123,16,131,156,135,123,16,4,0" \
    --out-wav output_15m.wav
```

---

## Tests

```bash
# Run all tests (CPU layer tests + CUDA layer tests if a GPU is present + model tests)
./build/run_tests

# CPU only
./build/run_tests --cpu

# CUDA only (skips CPU layer tests)
./build/run_tests --cuda
```

Runs layer unit tests (every op type) followed by end-to-end model tests. The same layer tests run on both backends — layers that are not yet GPU-capable are skipped with `(skip: layer requires Host memory)`. Prints `All tests passed.` on success or `SOME TESTS FAILED.` on failure, and exits non-zero if any test fails. Tests are organized across `tests/test_layers.cpp` and `tests/test_models.cpp`.

---

## Benchmarks

```bash
# All models, default settings
./build/bench_all

# Individual benchmarks
./build/bench_image   --data test_assets/imagenet --threads 1 8
```

Sample output (`bench_all`, ARM64 8-core):

```
=== Image classification (threads=8) ===
resnet101_int8   top-1: 84.6%   avg: 9.7 ms   96.7 img/s
```

---

## Model format

Every model is described by two files.

**JSON arch** — the computation graph:

```json
{
  "model": "resnet101",
  "input": { "h": 224, "w": 224, "c": 3, "scale": 0.018658448, "zp": -14 },
  "num_classes": 1000,
  "nodes": [
    { "name": "conv1",          "op": "conv_int8",  "in": "x",   "out": "t1" },
    { "name": "layer1.0.conv1", "op": "conv_int8",  "in": "t2",  "out": "t4" },
    { "name": "layer1.0.add",   "op": "add_int8",   "in": "t6",  "in2": "t3", "out": "t7" },
    { "name": "fc",             "op": "gemm_int8",  "in": "t139","out": "t140" }
  ]
}
```

**Binary weights** — all tensors packed in order (R1I8 for INT8, R1F3 for FP32). Each layer is matched to a JSON node by name.

The engine performs liveness analysis at load time to assign the minimum number of physical buffers (typically 2–3 for ResNet topologies).

### Op types

| Category | Ops |
|----------|-----|
| INT8 | `conv_int8`, `gemm_int8`, `add_int8`, `maxpool_int8`, `avgpool_int8` |
| FP32 conv/linear | `conv_fp32`, `gemm_fp32`, `add_fp32`, `maxpool_fp32`, `avgpool_fp32` |
| Norm | `layernorm_fp32`, `ada_in1d_fp32`, `ada_layer_norm_fp32` |
| Sequence | `lstm_fp32`, `conv1d_fp32`, `conv_transpose1d_fp32`, `attention_fp32` |
| Activations | `snake1d_fp32`, `leaky_relu_fp32`, `sigmoid_fp32`, `tanh_fp32`, `exp_fp32`, `sin_fp32` |
| Vision | `patch_prep`, `seqgemm_fp32`, `cls_extract_fp32` |
| Signal | `stft_fp32`, `istft_fp32`, `sine_gen_fp32` |
| Misc | `embedding_fp32`, `upsample_nearest1d_fp32`, `concat1d_fp32`, `length_regulate_fp32` |

### Generating arch files

```bash
# Templates for all supported image architectures
python3 tools/image/gen_arch_image.py --out model_defs/

# Reconstruct from an existing binary
python3 tools/image/gen_arch_image.py \
    --from-bin weights/resnet101_int8.bin \
    --model resnet101 --in-scale 0.018658448 --in-zp -14 \
    --out model_defs/resnet101_arch.json
```

---

## Python bindings

See [docs/python.md](docs/python.md).

---

## Project structure

```
├── src/
│   ├── model.cpp/.hpp        # Generic graph loader, buffer allocator, forward pass
│   ├── layers/               # Stateful ILayer subclasses (one per op type)
│   ├── ops/                  # Stateless compute kernels (GEMM, conv, attention, …)
│   ├── models/               # High-level model APIs (ImageClassifier, KittenModel)
│   └── utils/                # json.hpp, audio.hpp, stb_image.h
├── examples/
│   ├── image_classify/       # Image classification demo
│   ├── kitten_demo/          # KittenTTS demo
│   └── test_multiio/         # Multi-input/output graph test
├── bench/
│   ├── bench_all.cpp         # Run all benchmarks
│   ├── image_bench.cpp/.hpp  # Image classification benchmark
│   └── kitten_bench.cpp/.hpp # KittenTTS benchmark + output verification
├── tools/
│   ├── image/                     # image/gen_arch_image.py, extract_weights_image.py
│   ├── kitten/                    # gen_arch_kitten.py, extract_weights_kitten.py, pytorch/, onnx/
│   └── lm/                        # gen_arch_lm.py, extract_weights_lm.py, gen_test_assets_lm.py
├── python/
│   ├── bindings.cpp          # pybind11 module (InferenceModel, InputSpec, set_backend)
│   ├── test_bindings.py      # Automated test suite
│   └── examples/
│       └── inference.py      # ResNet-50 image classification: top-5 + op profile
├── model_defs/               # Arch JSON files for all supported models
├── test_assets/
│   └── imagenet/             # 52 labelled ImageNet validation images
├── docs/
│   └── python.md             # Python bindings documentation
└── CMakeLists.txt
```
