Metadata-Version: 2.1
Name: dk-test-suite
Version: 1.0.5
Summary: DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite for encrypted vs. plaintext model validation
Author-email: DataKrypto <support@datakrypto.ai>
Project-URL: Homepage, https://datakrypto.ai
Project-URL: Documentation, https://docs.datakrypto.ai
Project-URL: Repository, https://github.com/datakrypto/dk-test-suite
Project-URL: Bug Tracker, https://github.com/datakrypto/dk-test-suite/issues
Project-URL: LinkedIn, https://www.linkedin.com/company/datakrypto/
Keywords: fhe,fully-homomorphic-encryption,confidential-ai,encrypted-ai,testing,benchmark,poc,datakrypto,fhenom,vllm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.1
Requires-Dist: pyyaml>=6.0
Requires-Dist: httpx>=0.27
Requires-Dist: aiohttp>=3.9
Requires-Dist: numpy>=1.26
Requires-Dist: scipy>=1.12
Requires-Dist: jinja2>=3.1
Requires-Dist: rich>=13.0
Requires-Dist: paramiko>=3.4
Requires-Dist: psutil>=5.9
Requires-Dist: tenacity>=8.0
Provides-Extra: accuracy
Requires-Dist: bert-score>=0.3; extra == "accuracy"
Requires-Dist: lm-eval>=0.4; extra == "accuracy"
Requires-Dist: deepeval>=2.0; extra == "accuracy"
Provides-Extra: all
Requires-Dist: dk-test-suite[accuracy,dev]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"

# dk-test-suite

**DataKrypto FHEnom for AI™ — Automated POC Head-to-Head Test Suite**

`dk-test-suite` is the official validation framework for **FHEnom for AI™** deployments.
It runs automated, side-by-side tests comparing a plaintext (clear) model against its
FHE-encrypted counterpart — proving that encryption introduces **zero degradation** to
inference quality while providing **full confidentiality** at rest, in transit, and in use.

The suite produces a self-contained HTML report with DataKrypto-branded tables, charts,
and per-prompt response comparisons.

---

## Features

- **30 automated tests** across 6 categories (PERF, ACC, SCALE, SEC, SERV, T)
- **Sequential single-port architecture** — clear and encrypted models share port 8000;
  no 2× VRAM requirement
- **Local or remote execution** — run directly on the GPU server or from a separate machine via SSH
- **HTML report** — DataKrypto-styled comparison with pass/fail badges, expandable details,
  metric explainers, and Chart.js visualizations
- **Deterministic prompts** — seeded prompt generation for reproducible runs
- **Streaming inference** — TTFT measurement through SSE streaming endpoints
- **lm-eval-harness integration** — MMLU, HellaSwag, GSM8K, HumanEval benchmarks
- **BERTScore semantic similarity** — optional deep comparison via `bert-score`
- **3-way encryption proof** — SERV-4/5/6 prove encryption is real (TEE coherent,
  bypass garbled, fidelity maintained)

---

## Installation

**From PyPI (recommended):**

```bash
pip install dk-test-suite
```

**With optional accuracy dependencies:**

```bash
pip install "dk-test-suite[accuracy]"
```

This adds `lm-eval`, `bert-score`, and `deepeval` for ACC-2 (lm-eval benchmarks)
and ACC-3 (BERTScore semantic similarity). Without these, the tool falls back to
word-level similarity measures and marks lm-eval tests as skipped.

**From source:**

```bash
git clone https://github.com/datakrypto/dk-test-suite.git
cd dk-test-suite
pip install -e ".[all]"
```

---

## Quick Start

### 1. Create a Configuration File

Create `config.yaml` with your deployment details:

```yaml
# Run directly on the GPU machine (most common)
local_mode: true

# TEE machine
tee_host: "10.0.0.2"
tee_admin_token: "<admin-token>"
tee_user_token: "<user-token>"

# Model identifiers
model_name: "Llama-3.2-1B-Instruct"
encrypted_model_name: "Llama-3.2-1B-Instruct-encrypted"
encrypted_model_id: "<UUID from fhenomai model list>"

# Model paths on the GPU machine
model_path_clear: "/home/user/models/Llama-3.2-1B-Instruct"
model_path_encrypted: "/home/user/models/Llama-3.2-1B-Instruct-encrypted"

# FHEnom venv and home directory
fhenomai_venv: "/home/user/venv"
gpu_home: "/home/user"
```

### 2. Run the Full Suite

```bash
dk-test run -c config.yaml
```

### 3. View the Report

```bash
ls results/poc_report_*.html
```

Open the HTML file in a browser to see the full comparison report.

---

## CLI Reference

The `dk-test` command provides four subcommands:

### `dk-test run`

Run the full POC test suite or a subset of categories.

```
dk-test run [OPTIONS]
```

| Option | Description |
|---|---|
| `-c, --config PATH` | Path to a YAML configuration file |
| `--gpu-host HOST` | Override GPU machine IP |
| `--tee-host HOST` | Override TEE machine IP |
| `--model NAME` | Override model name |
| `--output DIR` | Output directory for results (default: `./results`) |
| `-t, --categories CAT` | Test categories to run (repeatable). Choices: `performance`, `accuracy`, `scalability`, `security`, `serving`, `training`. Default: all |
| `--num-prompts N` | Number of benchmark prompts (default: from config) |
| `--skip-clear-vllm` | Assume vLLM is already running with the clear model |
| `--skip-encrypted-vllm` | Assume vLLM is already running with the encrypted model and TEE is serving |
| `--local` | Run on this machine (no SSH to GPU) |
| `--gpu-ssh-key PATH` | Override SSH private key for the GPU |
| `--tee-ssh-key PATH` | Override SSH private key for the TEE |
| `-v, --verbose` | Enable debug logging |

**Examples:**

```bash
# Full suite with runner-managed vLLM lifecycle
dk-test run -c config.yaml

# Quick smoke test: serving + security only (5-10 min)
dk-test run -c config.yaml -t serving -t security --skip-clear-vllm --skip-encrypted-vllm

# Performance and accuracy only, 50 prompts, verbose
dk-test run -c config.yaml -t performance -t accuracy --num-prompts 50 -v

# Use pre-running vLLM instances
dk-test run -c config.yaml --skip-clear-vllm --skip-encrypted-vllm
```

### `dk-test scale4-sustained`

Run the full 24-hour sustained operation test (SCALE-4).

```
dk-test scale4-sustained [OPTIONS]
```

| Option | Description |
|---|---|
| `-c, --config PATH` | Path to a YAML configuration file |
| `--hours N` | Duration in hours (default: 24) |
| `--local` | Run locally (no SSH to GPU) |
| `-v, --verbose` | Enable debug logging |

This command runs continuous inference against both the clear and encrypted models
for the specified duration, measuring throughput degradation over time. It is
designed for overnight or multi-day soak testing.

```bash
dk-test scale4-sustained -c config.yaml --hours 24
```

### `dk-test report`

Regenerate the HTML report from existing JSON result files.

```
dk-test report RESULTS_DIR [OPTIONS]
```

| Argument / Option | Description |
|---|---|
| `RESULTS_DIR` | Directory containing `*_results.json` files from a previous run |
| `-c, --config PATH` | Path to a YAML configuration file |

Useful when you want to regenerate the report with updated formatting or after
manually editing result files.

```bash
dk-test report ./results -c config.yaml
```

### `dk-test info`

Display the current configuration and test connectivity to GPU and TEE machines.

```
dk-test info [OPTIONS]
```

| Option | Description |
|---|---|
| `-c, --config PATH` | Path to a YAML configuration file |
| `--local` | Show local configuration (no SSH connectivity check) |

```bash
dk-test info -c config.yaml
dk-test info -c config.yaml --local
dk-test info --local   # shows defaults + lists missing required values
```

### `dk-test --version`

Print the installed version.

```bash
dk-test --version
```

---

## Test Categories

### Performance (PERF-1 — PERF-5)

Measures computational overhead introduced by FHE encryption. All tests are
**measurement-only** (always PASS) and document the delta between clear and encrypted.

| ID | Metric | Method |
|---|---|---|
| PERF-1 | Time to First Token (TTFT) | Streaming SSE, first `data:` chunk timestamp |
| PERF-2 | Throughput (tokens/sec) | Total tokens ÷ wall time from streaming responses |
| PERF-3 | End-to-End Response Time | Wall-clock time for full non-streaming completion |
| PERF-4 | Model Footprint (disk) | `du -sb` on clear vs encrypted model directories |
| PERF-5 | GPU Memory Utilization | `nvidia-smi` VRAM snapshot during inference |

Includes a warmup request before measurement to avoid CUDA JIT and KV-cache
allocation artifacts.

### Accuracy (ACC-1 — ACC-5)

Validates that FHE encryption introduces no degradation to model output quality.

| ID | Metric | Method |
|---|---|---|
| ACC-1 | Deterministic Equivalence | Token-level LCS similarity using the model's tokenizer. Falls back to word-level if tokenizer is unavailable. Threshold: configurable (default ≥ 0.20) |
| ACC-2 | Functional Equivalence | `lm-eval-harness` benchmarks (MMLU, HellaSwag, GSM8K, HumanEval for clear; TriviaQA for TEE HTTPS endpoint). Requires `pip install dk-test-suite[accuracy]` |
| ACC-3 | Semantic Consistency | BERTScore F1 between clear and encrypted outputs. Falls back to word-level SequenceMatcher when `bert-score` is not installed |
| ACC-4 | Response Length Consistency | Two-sample t-test on response lengths. Fails if p-value < 0.01 (statistically significant length difference) |
| ACC-5 | Perplexity / Log-Likelihood | Average perplexity from logprobs. TEE does not support logprobs — marked as N/A (known limitation) |

### Scalability (SCALE-1 — SCALE-4)

Assesses whether FHE encryption introduces constraints on scaling behavior.

| ID | Metric | Method |
|---|---|---|
| SCALE-1 | Concurrent User Load | Parallel requests at 1, 5, 10, 20, 50 concurrent users via ThreadPoolExecutor |
| SCALE-2 | Context Length Scaling | Inference at 512, 1024, 2048, 4096, 8192 token inputs (3 iterations averaged) |
| SCALE-3 | Batch Processing | Sequential throughput at batch sizes 1, 4, 8, 16, 32 |
| SCALE-4 | Sustained Operation | Abbreviated: 3-minute continuous run. Full: 24h via `dk-test scale4-sustained` |

### Security (SEC-1 — SEC-6)

Automated adversarial testing of the encrypted deployment. These tests have
**hard pass/fail criteria**.

| ID | Test | Pass Condition |
|---|---|---|
| SEC-1 | Encryption at Rest | No plaintext weight patterns in safetensors data sections; binary entropy ≥ 7.5 bits/byte |
| SEC-2 | Encrypted Execution | No plaintext weight patterns in process memory maps (`/proc/PID/maps`) |
| SEC-3 | Secure Transport | TEE endpoint uses HTTPS; `tcpdump` captures show no plaintext model data |
| SEC-4 | Model Binding | Encrypted model cannot be loaded with `transformers.AutoModelForCausalLM` outside FHEnom, or produces non-meaningful output |
| SEC-5 | Key Isolation | No FHEnom key material in files, environment variables, or Docker configuration on the GPU host |
| SEC-6 | Logging Safety | No plaintext weight data or sensitive information in vLLM container logs or system journals |

### Serving (SERV-1 — SERV-7)

End-to-end validation of the FHE-encrypted serving pipeline. These tests have
**hard pass/fail criteria** and implement a 3-way encryption proof.

| ID | Test | Pass Condition |
|---|---|---|
| SERV-1 | Encrypted Model File Integrity | `config.json`, tokenizer files, ≥1 `.safetensors` file, total size ≥ 1.5 GB |
| SERV-2 | vLLM Server Health | HTTP 200 on `/health` endpoint |
| SERV-3 | TEE Serving Status | Encrypted model visible via admin API, user `/v1/models`, or inference probe |
| SERV-4 | TEE Inference Coherence | TEE output has space ratio ≥ 0.08 (coherent English text) |
| SERV-5 | Encryption Reality Proof | Direct bypass of TEE produces garbled output (space ratio < 0.04) |
| SERV-6 | Output Fidelity | SequenceMatcher ratio ≥ 0.70 between TEE and clear model for all probe prompts |
| SERV-7 | FHE Overhead | Encryption overhead < 10ms, decryption overhead < 5ms per request |

**3-Way Proof Architecture:**

```
[Path 1 — TEE]      User → TEE:9999 → Encrypt → vLLM:8000 → TEE → Decrypt → User   ✅ Coherent
[Path 2 — Clear]    User → vLLM:8000 (clear weights) → User                          ✅ Coherent
[Path 3 — Bypass]   User → vLLM:8000 (encrypted weights, no TEE) → User              ❌ Garbled
```

SERV-4 proves Path 1 works. SERV-5 proves Path 3 fails. SERV-6 proves Path 1 ≈ Path 2.
Together, they demonstrate that encryption is real and the TEE correctly handles
encryption/decryption.

### Training (T-1 — T-3)

Validates secure fine-tuning of encrypted models (**Extended POC**). These tests
require `training.enabled: true` and a `training.dataset_path` in the configuration.
Skipped by default.

| ID | Test | Description |
|---|---|---|
| T-1 | Secure Checkpoints | Verifies encrypted fine-tuning checkpoints exist and have high entropy |
| T-2 | Convergence | Compares loss curves between clear and encrypted fine-tuning |
| T-3 | Inference Quality | Tests inference on fine-tuned checkpoints |

---

## Configuration Reference

Configuration is loaded in the following order of precedence (highest first):

1. **CLI flags** (`--gpu-host`, `--tee-host`, `--model`, `--num-prompts`, etc.)
2. **Custom YAML file** (`-c my_config.yaml`)
3. **Environment variables** (`DK_GPU_HOST`, `DK_TEE_HOST`, `DK_MODEL_NAME`, `DK_OUTPUT_DIR`)
4. **Built-in defaults** (`config/default.yaml`)

### Required Parameters

| Parameter | Description |
|---|---|
| `tee_host` | IP address of the TEE machine |
| `tee_admin_token` | Admin authentication token for the TEE (provided by DataKrypto) |
| `tee_user_token` | User/inference authentication token for the TEE (provided by DataKrypto) |
| `model_name` | Clear model name (e.g., `Llama-3.2-1B-Instruct`) |
| `model_path_clear` | Absolute path to the plaintext model weights on the GPU machine |
| `model_path_encrypted` | Absolute path to the encrypted model weights on the GPU machine |
| `encrypted_model_id` | Model ID assigned by FHEnom (from `fhenomai model list`) |
| `encrypted_model_name` | Served model name for the encrypted vLLM instance |
| `fhenomai_venv` | Path to the Python venv containing the `fhenomai` CLI |
| `gpu_home` | HOME directory for fhenomai commands on the GPU machine |

### Remote Mode Parameters

Required only when `local_mode: false` (running from a separate machine):

| Parameter | Description |
|---|---|
| `gpu_host` | IP address of the GPU machine |
| `gpu_ssh_user` | SSH username for the GPU machine |
| `gpu_ssh_key` | Path to the SSH private key for the GPU machine |

### Optional Parameters

| Parameter | Default | Description |
|---|---|---|
| `local_mode` | `true` | Run GPU commands locally (no SSH). Set `false` for remote execution |
| `tee_admin_port` | `9099` | TEE admin API port |
| `tee_user_port` | `9999` | TEE user/inference API port |
| `tee_ssh_key` | *(empty)* | SSH key for TEE host (enables getpwuid fix) |
| `vllm_image` | `vllm/vllm-openai:latest` | Docker image for vLLM |
| `vllm_port` | `8000` | Port shared sequentially by clear and encrypted instances |
| `vllm_gpu_memory_utilization` | `0.5` | GPU memory fraction for vLLM |
| `vllm_max_model_len` | `8192` | Maximum sequence length |
| `vllm_tensor_parallel_size` | `1` | Number of GPUs for tensor parallelism |
| `vllm_startup_timeout` | `300` | Seconds to wait for vLLM to become ready |
| `vllm_request_timeout` | `120` | Timeout per inference request |
| `temperature` | `0` | Inference temperature (0 = deterministic) |
| `num_prompts` | `10` | Number of benchmark prompts |
| `seed` | `42` | Random seed for prompt generation |
| `fhe_enc_overhead_threshold_ms` | `10.0` | SERV-7 encryption overhead threshold |
| `fhe_dec_overhead_threshold_ms` | `5.0` | SERV-7 decryption overhead threshold |
| `output_dir` | `./results` | Output directory for JSON results and HTML report |
| `customer_name` | *(empty)* | Customer name shown in the report header |
| `poc_id` | *(empty)* | POC identifier shown in the report header |

### Scalability Parameters

```yaml
scalability:
  concurrent_levels: [1, 5, 10, 20, 50]
  context_lengths: [512, 1024, 2048, 4096, 8192]
  batch_sizes: [1, 4, 8, 16, 32]
  sustained_duration_hours: 24
  sustained_abbreviated_minutes: 3
```

### Accuracy Benchmarks

```yaml
accuracy_benchmarks:
  - "mmlu"
  - "hellaswag"
  - "gsm8k"
  - "humaneval"
```

---

## Execution Flow

```
dk-test run
  │
  ├── Phase 1: Clear Model
  │     ├── Start vLLM container (clear model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Run PERF / ACC / SCALE clear-side tests
  │     ├── Collect SERV clear-model reference probes
  │     ├── Save intermediate results
  │     └── Stop vLLM container
  │
  ├── Phase 2: Encrypted Model
  │     ├── Start vLLM container (encrypted model, port 8000)
  │     ├── Wait for /v1/models readiness
  │     ├── Initialize fhenomai CLI config
  │     ├── Start TEE serving (fhenomai serve start)
  │     ├── Verify model is ONLINE (fhenomai model list --show-status)
  │     ├── Run PERF / ACC / SCALE encrypted-side tests (via TEE)
  │     └── Save intermediate results
  │
  ├── Phase 3: Security Validation
  │     └── Run SEC-1 through SEC-6
  │
  ├── Phase 4: Serving Workflow
  │     └── Run SERV-1 through SERV-7 (encrypted probes + 3-way comparison)
  │
  ├── Phase 5: Training (Extended POC)
  │     └── Run T-1 through T-3 (if enabled)
  │
  ├── Teardown
  │     ├── Stop vLLM container
  │     ├── Stop TEE serving (fhenomai serve stop)
  │     └── Verify model is OFFLINE
  │
  └── Report Generation
        ├── Side-by-side comparison tables
        ├── Pass/fail badges per test
        ├── Per-prompt response viewer
        └── Chart.js performance visualizations
```

---

## Environment Variables

| Variable | Maps To |
|---|---|
| `DK_GPU_HOST` | `gpu_host` |
| `DK_TEE_HOST` | `tee_host` |
| `DK_MODEL_NAME` | `model_name` |
| `DK_OUTPUT_DIR` | `output_dir` |

---

## Dependencies

**Core** (installed automatically):

`click` · `pyyaml` · `httpx` · `aiohttp` · `numpy` · `scipy` · `jinja2` · `rich` ·
`paramiko` · `psutil` · `tenacity`

**Optional — accuracy** (`pip install dk-test-suite[accuracy]`):

`lm-eval` · `bert-score` · `deepeval`

**Optional — development** (`pip install dk-test-suite[dev]`):

`pytest` · `black` · `flake8` · `mypy` · `isort`

**Runtime requirements:**

- Python ≥ 3.10
- Docker (for vLLM container management)
- NVIDIA GPU with CUDA drivers (for inference)
- Network access to the TEE machine (HTTPS)
- `fhenomai` CLI installed in a Python venv on the GPU machine

---

## Output

Each run produces:

| File | Description |
|---|---|
| `results/poc_report_YYYYMMDD_HHMMSS.html` | Self-contained HTML report |
| `results/performance_results.json` | PERF-1 through PERF-5 raw data |
| `results/accuracy_results.json` | ACC-1 through ACC-5 raw data |
| `results/scalability_results.json` | SCALE-1 through SCALE-4 raw data |
| `results/security_results.json` | SEC-1 through SEC-6 raw data |
| `results/serving_results.json` | SERV-1 through SERV-7 raw data |
| `results/training_results.json` | T-1 through T-3 raw data |
| `results/prompt_set.json` | Deterministic prompt set used for the run |

---

## License

MIT License — Copyright © 2025 DataKrypto.

---

## Links

- **Website:** [datakrypto.ai](https://datakrypto.ai)
- **Documentation:** [docs.datakrypto.ai](https://docs.datakrypto.ai)
- **LinkedIn:** [linkedin.com/company/datakrypto](https://www.linkedin.com/company/datakrypto/)
