Metadata-Version: 2.3
Name: ml-loadtest
Version: 1.10.0a1
Summary: Adaptive load testing tool for ML inference APIs with dynamic scaling and regression detection
License: MIT
Keywords: load-testing,locust,ml,api-testing,performance,inference,benchmarking
Requires-Python: >=3.10
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: System :: Benchmark
Requires-Dist: dacite (>=1.8,<2.0)
Requires-Dist: locust (>=2.20,<3.0)
Requires-Dist: notion-client (>=2.0,<3.0)
Requires-Dist: numpy (>=1.23.5,<2.0.0)
Requires-Dist: psutil (>=5.9)
Requires-Dist: requests (>=2.28,<3.0)
Project-URL: Homepage, https://github.com/IncodeTechnologies/ml-load-testing-tool
Project-URL: Issues, https://github.com/IncodeTechnologies/ml-load-testing-tool/issues
Project-URL: Repository, https://github.com/IncodeTechnologies/ml-load-testing-tool
Description-Content-Type: text/markdown

# ML Load Testing Tool

Adaptive load testing tool for ML inference APIs. Uses Locust to dynamically scale concurrent users based on P99 latency targets, then analyzes results for regression detection and rate limit recommendations.

## Features

- **Adaptive Scaling**: Automatically adjusts concurrent users based on P99 latency targets
- **Multi-Mode Testing**: Individual endpoint, production mix, exploration, and spike test modes
- **Regression Detection**: Compare test results against baselines to catch performance degradations
- **Rate Limit Recommendations**: Calculates safe rate limits with configurable safety factors
- **HPA Trigger Tuning**: Recommends API/Triton autoscaler thresholds from signals observed at the operating point (reuses the chart's own PromQL via Prometheus)
- **Notion Integration**: Sync results to Notion for tracking and visualization

## Installation

### From GitHub

```bash
pip install git+https://github.com/IncodeTechnologies/ml-load-testing-tool.git
```

### From Source

```bash
git clone https://github.com/IncodeTechnologies/ml-load-testing-tool.git
cd ml-load-testing-tool
poetry install
```

## Quick Start

The tool requires a weights module that defines which endpoints to test. Use bundled examples or create your own:

```bash
# Test with bundled example TaskSets
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights

# Run headless with specific parameters
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights \
  --users 32 \
  --spawn-rate 4 \
  --run-time 60s \
  --headless
```

The `ml-loadtest-file` command prints the path to the installed locustfile, giving you full access to all Locust CLI parameters.

**Note:** The `--weights-module` parameter is required. It specifies a Python module containing a `production_weights` dictionary that maps TaskSet classes to their relative weights.

## Usage

### Two Usage Patterns

**Pattern 1: Direct Path (Recommended for CLI)**

```bash
# Get full locust CLI access with installed package
locust -f $(ml-loadtest-file) --host http://api:8000 [any locust params]
```

**Pattern 2: Local Import (Recommended for Customization)**

Create a local `locustfile.py` in your project:

```python
# Import everything from the installed package
from ml_loadtest.locustfile import *

# Optionally override settings or add custom logic here
```

Then run:

```bash
locust -f locustfile.py --host http://api:8000
```

### Load Testing

The tool supports four test modes via the `--test-mode` parameter. You can run a single mode or multiple modes in sequence (space-separated).

#### 1. Individual Endpoint Testing

Test each endpoint separately to find individual capacity:

```bash
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights \
  --test-mode INDIVIDUAL \
  --target-p99-ms 500 \
  --max-users 100
```

#### 2. Production Mix Testing

Test with production traffic distribution:

```bash
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module ml_loadtest.examples.distribution_weights \
  --test-mode PRODUCTION \
  --target-p99-ms 500
```

#### 3. Exploration Mode

Test multiple weight distributions to find optimal mix:

```bash
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights \
  --test-mode EXPLORATION \
  --target-p99-ms 1000
```

#### 4. Spike Testing

Test sudden traffic spikes:

```bash
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module ml_loadtest.examples.distribution_weights \
  --test-mode SPIKE \
  --spike-target-rps 1000 \
  --spike-duration 30
```

#### 5. Multiple Modes

Run multiple test modes in sequence:

```bash
# Run individual and production tests (default)
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights \
  --test-mode INDIVIDUAL PRODUCTION

# Run all test modes
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module loadtest.distribution_weights \
  --test-mode INDIVIDUAL PRODUCTION EXPLORATION SPIKE
```

### Key Configuration Options

All standard Locust parameters are available, plus:

- `--target-p99-ms`: Target P99 latency in milliseconds (default: 1000)
- `--max-users`: Maximum concurrent users (default: 32)
- `--min-users`: Minimum concurrent users (default: 1)
- `--test-mode`: Test modes to run - INDIVIDUAL, PRODUCTION, EXPLORATION, or SPIKE. Space-separated for multiple (default: INDIVIDUAL PRODUCTION)
- `--increase-step`: Users added per step when under target — AIMD additive increase (default: 1)
- `--decrease-rate`: User decrease multiplier when over target — AIMD multiplicative decrease (default: 0.8)
- `--check-interval`: Seconds between scaling checks (default: 30)
- `--settle-periods`: Check intervals to wait after a user-count change before adjusting again, so the new load level can stabilise and P99 is measured only over post-change samples (default: 1)
- `--output-file`: Output filename prefix (default: "report_loadtest_results")
- `--weights-module`: Python module with custom production_weights (required)
- `--spike-target-rps`: Target RPS for spike mode (default: 100.0)
- `--spike-duration`: Duration in seconds for spike mode (default: 30)

Full list of Locust parameters: https://docs.locust.io/en/stable/configuration.html

### Configuration File

The package includes a `locust.conf` configuration file that provides default settings for load tests. This allows you to avoid repeating common parameters on the command line.

**What is locust.conf?**

A Locust configuration file that sets default values for both standard Locust parameters and custom ml-loadtest parameters.

**Configuration example:**

```ini
; Locust configuration file
host = http://api:8000
headless
only-summary
run-time = 2h
loglevel = INFO
csv = report
html = report.html

; Load test settings
target-p99-ms = 1000
min-users = 1
max-users = 32
increase-step = 1
decrease-rate = 0.8
check-interval = 30
settle-periods = 1
tolerance = 0.1
production-run-duration = 1200
weights-module = loadtest.distribution_weights
```

**How to use it:**

```bash
# Use the bundled config file (from installed package location)
locust -f $(ml-loadtest-file) \
  --config locust.conf

# Override specific settings from config file
locust -f $(ml-loadtest-file) \
  --config locust.conf \
  --max-users 64
```

**Note:** Command-line arguments always override config file settings.

### Analysis

After running tests, analyze results for regressions and get rate limit recommendations:

```bash
# Basic analysis
python -m ml_loadtest.analyze

# Update baseline after confirming results are good
python -m ml_loadtest.analyze --update-baseline

# Custom input/output files
python -m ml_loadtest.analyze \
  --input-file custom_report_loadtest_results.json \
  --baseline-file my_baseline.json \
  --output-file analysis_output.txt
```

The analyzer will:
- Compare current results against baseline
- Detect performance regressions (default 10% threshold)
- Recommend safe rate limits (default 70% of measured capacity)
- Recommend HPA trigger thresholds (see below)
- Generate detailed reports with statistics

### HPA Trigger Tuning

The analyzer recommends HPA trigger thresholds for the API and Triton KEDA `ScaledObject`s.

It tunes four KEDA triggers (all read from Prometheus): **API `cpuUtilization`**, **Triton
`cpuUtilization`**, **Triton `gpuUtilization`**, and **Triton `inferenceRequestDuration`**. Both CPU
triggers are standardized to **AverageValue in millicpus** (e.g. `1000m`), not a Utilization %.

**Methodology — a tier's thresholds are valid only from a run where that tier is the bottleneck.**
With HPA disabled, the AIMD controller converges where P99 hits target (the *operating point*); the
analyzer reads each trigger's signal over the converged window (reusing the chart's own PromQL) and
recommends `threshold = observed × hpa-margin`. But a signal only reflects a tier's true ceiling if
that tier actually saturated — so the analyzer recommends thresholds **only for the saturated
(bottleneck) tier** and reports the other tier observed-only. Because one replica ratio can't
saturate both tiers, the pipeline uses **two runs**:

| Run | Deploy | Saturates | Yields |
|---|---|---|---|
| **Individual** | 1 API : 1 Triton | API (one API pod can't load Triton) | API `cpuUtilization`; also per-endpoint capacity (rate limits, tracking) |
| **Production** | 6 API : 1 Triton | Triton (6 API overwhelm 1 Triton) | Triton `inferenceRequestDuration` (primary) + `gpuUtilization` + `cpu` |

Within the saturated tier, every signal co-occurs at the operating point, so all are recommended —
the binding (first-to-fire) one is flagged **primary**, the rest **secondary**. The non-saturated
tier is throttled by the other, so its signals are below its own capacity and get **no number** (with
a note to tune them in the run where that tier saturates). `--bottleneck-tier {api,triton,auto}`
declares which tier a run saturates (`auto` infers it from the operating-point signals).

```bash
# Individual run (1:1) — API CPU threshold + endpoint tracking + rate limits
python -m ml_loadtest.analyze \
  --api-replicas 1 --bottleneck-tier api --hpa-margin 0.8 \
  --prometheus-url http://k8s-observability-prometheus.monitoring:9090 \
  --api-pod-selector 'ml-load-test-<run>-api.+'

# Production run (6:1) — Triton thresholds
python -m ml_loadtest.analyze \
  --api-replicas 6 --bottleneck-tier triton --hpa-margin 0.8 \
  --prometheus-url http://k8s-observability-prometheus.monitoring:9090 \
  --triton-pod-selector 'ml-load-test-<run>-triton-modelverse.+'
```

Only two selectors exist — `--api-pod-selector` and `--triton-pod-selector` (pod-name regexes). CPU
is scoped to the app container (`container="api"`/`"triton"`); Triton's GPU/inference metrics are
matched by an `application` label derived from `--triton-pod-selector`.

`--hpa-margin` is distinct from `--safety-factor`: it is a scale-*out* trigger point that also
absorbs scale-up latency (cooldown / stabilization windows), not a rate-limit ceiling. Current chart
thresholds can be overridden with `--api-cpu-threshold`, `--triton-cpu-threshold`,
`--triton-gpu-threshold`, `--triton-infer-threshold` (they double as the saturation gauge). All four
signals are Prometheus-derived, so `--prometheus-url` is required for HPA tuning. Each trigger's
PromQL `query` and recommendation are written to `<output-file>_hpa.json` for the Notion sync.

> **First-run validation.** Prometheus selectors are environment-specific. A wrong selector produces
> a distinct "no series" warning (not a crash). Sanity-check the queries against the live Prometheus,
> confirm Triton is scraped (the `nv_*` metrics), and **compare API CPU% in the 1:1 vs 6:1 runs** to
> confirm the tiers actually separate (the 1:1 run should be API-bound). The metric store
> (VictoriaMetrics) is queried via the `time=` parameter, so no PromQL `@` modifier is used; for a
> multitenant vmselect, include the tenant path prefix in `--prometheus-url`.

### Notion Integration

Sync test results to Notion for tracking:

```bash
# Set Notion credentials (environment variables)
export NOTION_TOKEN="notion-integration-token"
export NOTION_TEST_RESULTS_DATABASE_ID="test-results-database-id"
export NOTION_ENDPOINT_DATABASE_ID="endpoint-database-id"

python -m ml_loadtest.notion_sync "service-name" "v1.0.0" \
    --report-file report_loadtest_results.json \
    --baseline-file baseline.json
```

## Extending with Custom TaskSets

### Creating Custom TaskSets

Each TaskSet must implement this interface:

```python
from locust import TaskSet

class MyCustomTaskSet(TaskSet):
    # Required: endpoint identifier
    endpoint = "/my-endpoint"

    # Required: test implementation
    def test_endpoint(self) -> None:
        with self.client.post(
            self.endpoint,
            json={"data": "example"},
            name=self.endpoint,
            catch_response=True,
        ) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Failed with {response.status_code}")
```

### Using Custom TaskSets

Create a weights module (e.g., `distribution_weights.py`):

```python
from my_tasks import TaskSet1, TaskSet2, TaskSet3

production_weights = {
    TaskSet1: 50,  # 50% of requests
    TaskSet2: 30,  # 30% of requests
    TaskSet3: 20,  # 20% of requests
}
```

Run with custom weights:

```bash
locust -f $(ml-loadtest-file) \
  --host http://api:8000 \
  --weights-module distribution_weights
```

## Architecture

### Core Components

1. **locustfile.py** - Test orchestration with adaptive scaling
   - `EndpointCapacityExplorer`: Manages test modes and adaptive scaling
   - `LoadTestHttpUser`: Executes weighted endpoint tasks
   - Daemon thread monitors P99 and adjusts user count dynamically

2. **analyze.py** - Post-test analysis
   - `LoadTestAnalyzer`: Regression detection and rate limit calculation
   - Compares against baselines (10% regression threshold)
   - Recommends safe limits (70% of measured capacity by default)

3. **distribution_weights.py** - Production traffic weights
   - Example weight configuration for bundled TaskSets
   - Template for custom weight modules

4. **notion_sync.py** - Notion integration
   - Syncs test results to Notion databases
   - Tracks performance metrics over time

### Data Flow

1. Locust users send requests to target endpoints
2. Response times captured in circular buffers (maxlen=2000)
3. Daemon thread checks P99 every `--check-interval` seconds
4. User count adjusted based on P99 vs target comparison
5. After test completion, JSON/TXT reports saved
6. Analyzer loads reports for regression detection and recommendations

## Development

### Running Tests

```bash
make test
```

### Linting and Formatting

```bash
make type-check
make format
make lint
```
