Metadata-Version: 2.4
Name: pytest-fkit
Version: 0.3.0
Summary: A pytest plugin that prevents crashes from killing your test suite, with execution tracing
Author: Cemberk
License: MIT
Project-URL: Homepage, https://github.com/Cemberk/pytest-fkit
Project-URL: Repository, https://github.com/Cemberk/pytest-fkit
Project-URL: Issues, https://github.com/Cemberk/pytest-fkit/issues
Keywords: pytest,testing,crash,isolation,subprocess,tracing
Classifier: Framework :: Pytest
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest>=6.0.0
Provides-Extra: tracer
Requires-Dist: psutil>=5.0.0; extra == "tracer"
Provides-Extra: pysr
Requires-Dist: pysr>=0.16.0; extra == "pysr"
Requires-Dist: numpy>=1.20.0; extra == "pysr"
Requires-Dist: pandas>=1.3.0; extra == "pysr"
Provides-Extra: all
Requires-Dist: psutil>=5.0.0; extra == "all"
Requires-Dist: pysr>=0.16.0; extra == "all"
Requires-Dist: numpy>=1.20.0; extra == "all"
Requires-Dist: pandas>=1.3.0; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# pytest-fkit

**F**ix **K**rashes **I**n **T**ests - A pytest plugin that prevents crashes from killing your entire test suite.

When a test crashes Python (SIGABRT, SIGSEGV, etc.), it catches the crash and converts it to a normal pytest ERROR instead of killing your entire test run.

**Features:**
- Parallel workers with GPU affinity
- **Sliced test distribution** (default) - tests are pre-distributed across workers for deterministic, efficient execution
- Crash isolation - each test runs in its own subprocess
- Automatic GPU error detection and retry
- Fault tolerance (workers can fail without stopping the test run)

## The Problem

When running large test suites (like HuggingFace Transformers), sometimes a test causes Python to crash with a signal like SIGABRT:

```
Fatal Python error: Aborted
Thread 0x0000799e2ea00640 (most recent call first):
  File "/transformers/src/transformers/models/dots1/modeling_dots1.py", line 331 in forward
  ...
```

This kills pytest entirely, and all remaining tests in your suite never run.

## The Solution

pytest-fkit runs each test in an isolated subprocess. If a test crashes:
- ✅ The crash is caught and reported as a pytest ERROR
- ✅ The remaining tests continue running
- ✅ You get a full report with all test results, including which ones crashed

## Installation

```bash
cd pytest-fkit
pip install -e .
```

Or install from your test requirements:
```bash
pip install pytest-fkit
```

## Usage

### Basic Usage

Just add the `--fkit` flag to your pytest command:

```bash
pytest --fkit
```

### With Timeout

Set a timeout per test (default is 600 seconds / 10 minutes):

```bash
pytest --fkit --fkit-timeout=300  # 5 minute timeout per test
```

### Parallel Workers with Sliced Distribution

Run tests in parallel with automatic slicing:

```bash
# Auto-detect workers based on GPU count
pytest --fkit --fkit-workers=auto

# Specific number of workers
pytest --fkit --fkit-workers=4

# Control GPUs per worker (for multi-GPU tests)
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
```

**Sliced Scheduling (default)**: Tests are pre-distributed across workers:
1. Tests are sorted by nodeid for reproducibility
2. Round-robin distribution: test[i] goes to worker[i % num_workers]
3. Each worker runs its slice with crash isolation (subprocess per test)
4. Workers run in parallel for maximum throughput

**Example with 4 workers and 100 tests:**
- Worker 0: tests 0, 4, 8, 12, ... (25 tests)
- Worker 1: tests 1, 5, 9, 13, ... (25 tests)
- Worker 2: tests 2, 6, 10, 14, ... (25 tests)
- Worker 3: tests 3, 7, 11, 15, ... (25 tests)

### Execution Modes

```bash
# Batch mode (default) - pre-sliced, deterministic distribution
pytest --fkit --fkit-workers=4 --fkit-mode=batch

# Isolate mode - dynamic queue, on-demand assignment
pytest --fkit --fkit-workers=4 --fkit-mode=isolate
```

| Mode | Description | Best For |
|------|-------------|----------|
| `batch` | Tests pre-sliced to workers | Most use cases, reproducible |
| `isolate` | Dynamic work queue | Highly variable test durations |

### GPU Allocation Examples

**8 GPUs with multi-GPU tests (need 2 GPUs each):**
```bash
pytest --fkit --fkit-workers=4 --fkit-gpus-per-worker=2
# Worker 0: GPU 0,1
# Worker 1: GPU 2,3
# Worker 2: GPU 4,5
# Worker 3: GPU 6,7
```

**8 GPUs with single-GPU tests:**
```bash
pytest --fkit --fkit-workers=8 --fkit-gpus-per-worker=1
# Worker 0: GPU 0
# Worker 1: GPU 1
# ...
# Worker 7: GPU 7
```

### Crash Isolation

Each test runs in its own subprocess, so crashes are contained:

1. **Crash Detection**: SIGABRT, SIGSEGV, and other signals are caught
2. **Error Conversion**: Crashes are converted to pytest ERROR results
3. **Suite Continuation**: Remaining tests continue running on the worker
4. **Full Results**: You get a complete report even if some tests crash

**Example scenario:**
```
Worker 0 (GPU 0,1): test_bert PASSED → test_llama PASSED → test_crash 💥 CRASH → test_gpt2 PASSED
Worker 1 (GPU 2,3): test_vit PASSED → test_whisper PASSED → test_t5 PASSED
Worker 2 (GPU 4,5): test_clip PASSED → test_blip PASSED → test_stable PASSED
Worker 3 (GPU 6,7): test_sam PASSED → test_dino PASSED → test_mae PASSED

# Crash on Worker 0 is isolated - other tests continue
# Final report shows 1 crash, 11 passed
```

### Skip Crash Isolation for Specific Tests

If you have tests that don't play well with subprocess isolation, mark them:

```python
import pytest

@pytest.mark.fkit_skip
def test_something_special():
    # This test will run normally without subprocess isolation
    pass
```

### Mark GPU Requirements

Mark tests for documentation (future: optimal GPU scheduling):

```python
import pytest

@pytest.mark.fkit_multi_gpu
def test_distributed_training():
    # This test needs multiple GPUs
    pass

@pytest.mark.fkit_single_gpu
def test_simple_forward():
    # This test needs only one GPU
    pass
```

## How It Works

### Architecture (Batch Mode - Default)

```
              ┌─────────────────────────────────────────────┐
              │           Test Collection (sorted)          │
              │  [test0, test1, test2, test3, test4, ...]   │
              └────────────────────┬────────────────────────┘
                                   │
                        Round-Robin Slicing
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         │                         │                         │
         ▼                         ▼                         ▼
 ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
 │   Worker 0    │         │   Worker 1    │         │   Worker 2    │
 │   GPU 0,1     │         │   GPU 2,3     │         │   GPU 4,5     │
 ├───────────────┤         ├───────────────┤         ├───────────────┤
 │ Slice:        │         │ Slice:        │         │ Slice:        │
 │  test0        │         │  test1        │         │  test2        │
 │  test3        │         │  test4        │         │  test5        │
 │  test6        │         │  test7        │         │  test8        │
 │  ...          │         │  ...          │         │  ...          │
 └───────┬───────┘         └───────┬───────┘         └───────┬───────┘
         │                         │                         │
         ▼                         ▼                         ▼
 ┌───────────────┐         ┌───────────────┐         ┌───────────────┐
 │  Subprocess   │         │  Subprocess   │         │  Subprocess   │
 │  per test     │         │  per test     │         │  per test     │
 │  (isolated)   │         │  (isolated)   │         │  (isolated)   │
 └───────────────┘         └───────────────┘         └───────────────┘
```

### Flow

1. **GPU Detection**: Automatically detects AMD (ROCm) or NVIDIA GPUs
2. **Worker Creation**: Creates N worker threads, each with dedicated GPUs
3. **Test Slicing**: Tests sorted and distributed via round-robin
4. **Parallel Execution**: Each worker runs its slice independently
5. **Subprocess Isolation**: Each test runs in its own subprocess (crash protection)
6. **Result Reporting**: Results stream back to pytest as tests complete

## Example Output

```
🚀 pytest-fkit: 4 workers, 8 AMD GPUs, 2 GPU(s)/worker
   GPU allocations: ['0,1', '2,3', '4,5', '6,7']
   Mode: batch - sliced scheduling (tests pre-distributed to workers)

🔄 Running 1000 tests across 4 workers (sliced scheduling - each worker gets 1/4 of tests)...

📊 Test distribution across 4 workers:
   Worker 0: 250 tests
   Worker 1: 250 tests
   Worker 2: 250 tests
   Worker 3: 250 tests
   Worker 0 (GPUs: 0,1): 250 tests
   Worker 1 (GPUs: 2,3): 250 tests
   Worker 2 (GPUs: 4,5): 250 tests
   Worker 3 (GPUs: 6,7): 250 tests

tests/models/bert/test_modeling_bert.py::BertModelTest::test_forward PASSED
tests/models/llama/test_modeling_llama.py::LlamaModelTest::test_forward PASSED
tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_forward PASSED

======================================================================
✅ Completed 1000 tests
   Passed: 950, Failed: 45, Skipped: 5
   💥 Crashes: 2
======================================================================

=============== pytest-fkit summary ===============
💥 2 test(s) CRASHED (converted to ERROR by pytest-fkit):
  - tests/models/dots1/test_modeling_dots1.py::Dots1ModelTest::test_model_15b

✅ pytest-fkit prevented 2 crashes from killing your test suite!
```

## Command Line Options

| Option | Default | Description |
|--------|---------|-------------|
| `--fkit` | `False` | Enable crash isolation |
| `--fkit-timeout` | `600` | Timeout per test in seconds |
| `--fkit-workers` | `1` | Number of parallel workers (`auto` for GPU-based) |
| `--fkit-gpus-per-worker` | `2` | GPUs assigned to each worker |
| `--fkit-mode` | `batch` | `batch` (pre-sliced) or `isolate` (dynamic queue) |

## Environment Variables Set Per Worker

| Variable | Description |
|----------|-------------|
| `CUDA_VISIBLE_DEVICES` | GPU IDs for NVIDIA / compatibility |
| `HIP_VISIBLE_DEVICES` | GPU IDs for AMD ROCm |
| `ROCR_VISIBLE_DEVICES` | GPU IDs for AMD ROCm runtime |
| `FKIT_WORKER_ID` | Worker index (0, 1, 2, ...) |
| `FKIT_GPU_IDS` | Assigned GPU IDs string |

## GPU Error Patterns Detected

The following error patterns trigger automatic retry on a different worker:

- `CUDA out of memory`
- `CUDA error` / `HIP error`
- `hipErrorNoBinaryForGpu`
- `hipErrorOutOfMemory`
- `NCCL error`
- `device-side assert`
- `GPU not found` / `no GPU`
- `cudaErrorNoDevice` / `hipErrorNoDevice`

## Performance Considerations

- **Overhead**: ~100-500ms per test for subprocess spawning
- **Parallelism**: N workers = ~N× throughput (minus overhead)
- **GPU Memory**: Each worker has dedicated GPUs - no memory contention
- **Deterministic**: Same test distribution every run (batch mode)
- **Crash Isolation**: One crash doesn't affect other tests

### Recommended Configurations

| Scenario | Workers | GPUs/Worker | Mode | Command |
|----------|---------|-------------|------|---------|
| 8 GPUs, multi-GPU tests | 4 | 2 | batch | `--fkit-workers=4 --fkit-gpus-per-worker=2` |
| 8 GPUs, single-GPU tests | 8 | 1 | batch | `--fkit-workers=8 --fkit-gpus-per-worker=1` |
| 4 GPUs, mixed tests | 2 | 2 | batch | `--fkit-workers=2 --fkit-gpus-per-worker=2` |
| No GPUs (CPU tests) | auto | - | batch | `--fkit-workers=auto` |
| Highly variable durations | 4 | 2 | isolate | `--fkit-workers=4 --fkit-mode=isolate` |

## Configuration File

Enable pytest-fkit in `pytest.ini` or `pyproject.toml`:

```ini
# pytest.ini
[pytest]
addopts = --fkit --fkit-timeout=600 --fkit-workers=auto
```

```toml
# pyproject.toml
[tool.pytest.ini_options]
addopts = ["--fkit", "--fkit-timeout=600", "--fkit-workers=auto"]
```

## Comparison with pytest-xdist

| Feature | pytest-fkit | pytest-xdist |
|---------|-------------|--------------|
| Crash isolation | ✅ Yes (per-test subprocess) | ❌ No |
| GPU affinity | ✅ Yes (automatic) | ❌ Manual |
| Parallel execution | ✅ Yes | ✅ Yes |
| Sliced scheduling | ✅ Yes (round-robin) | ✅ Yes (load-based) |
| GPU error retry | ✅ Yes (isolate mode) | ❌ No |
| Worker fault tolerance | ✅ Yes | ⚠️ Limited |
| Memory isolation | ✅ Per-test | ⚠️ Per-worker |
| Reproducible distribution | ✅ Yes (deterministic) | ⚠️ Varies |
| Overhead | Higher | Lower |

**Use pytest-fkit when:**
- Tests can crash Python (GPU drivers, C extensions)
- You need automatic GPU affinity
- You need per-test isolation
- GPU availability is unreliable
- You want automatic retry on GPU errors

**Use pytest-xdist when:**
- Tests are stable (no crashes)
- You need minimal overhead
- Tests don't use GPUs

## Compatibility

- Python 3.8+
- pytest 6.0+
- Linux, macOS (Windows support TBD)
- AMD ROCm GPUs (detected via `rocm-smi`)
- NVIDIA GPUs (detected via `nvidia-smi`)

## License

MIT
